we are using keycloak in standalone-cluster mode, with a custom user federation SPI that create & manage users in a separate database from keycloak.
We’ve deployed our service in a 3 containers setup running in docker (rancher cluster), with jgroups setup to use JDBC_PING (in production) but we have also successfully tested dns.DNS_PING, kubernetes.KUBE_PING and a custom RANCHER_PING (support for rancher 1.X rancher meta-data service, that we intend to contribute when it’s ready)
Our main issue with this setup is that we often have to update the federation code, so we publish an updated image for our keycloak service (we built it upon jboss/keycloak:9.0.3, and use cli scripts to install the jgroups stack and tune other parameters), and start a rolling upgrade of each container of the cluster. During that rolling upgrade, we often have issues with the infinispan caches. Any access to one of those caches will take ages and often end up with the edge load-balancer cutting the HTTP connection and return a 504 to the clients (affecting both logins, token refreshes, you name it). We’ve seen response times up to the minute, and infinispan never recover existing sessions keys from the “sessions” cache. Surprisingly, after a while, new logins work fine and the service resumes operations - but users logged in with existing sessions will lose them and have to reconnect. We have setup our LB to timeout after 15s, but this is far from satisfactory, as, depending on the load, we have approximately 80% chance to lose all existing sessions during a server upgrade. Usually, we end up stopping and restarting the whole cluster. I can provide traces of those upgrades, infinispan reports losing contact with a cache coordinator and every access to that cache is blocked for a very long time (If anyone has an idea/tip of how to prevent or control such blocking with the infinispan subsystem, i’d be very glad)
A the moment, the cache owners for keycloak’s distributed-caches are set to 3 (3 containers, 3 owners => all data is replicated on every nodes, no matter what)
To solve that federation update issue, and the fact that keycloak/infinispan don’t seem to support an environment where a cluster member can disappear and another one re-appear without notice. we are exploring the following solutions:
deploy keycloak outside of kube/rancher/docker on a set of more stable hosts, and use the old-fashioned “connect and deploy using the CLI with capistrano/ansible” deployment for our custom SPI, not requiring a server restart and the infinispan cluster to lose a member. For me, given the effort we spent deploying our infrastructure to support containers and fast update pace, this would be a shame, and a serious step backward
use a separate, more stable infinispan cluster. This seems to us like the most practical solution at the moment. We have all the CLI scripts ready to modify the standalone-ha.xml, but at the moment we are hitting the following roadblock: when configured to use a remote-store for the
actionTokenscaches, keycloak will setup an
InfinispanNotficationManagerto propagate clustered messages, requiring a
sitename to be specified (https://github.com/keycloak/keycloak/blob/master/model/infinispan/src/main/java/org/keycloak/cluster/infinispan/InfinispanNotificationsManager.java#L100). The service will not start, always throwing an
java.lang.IllegalStateException: Multiple datacenters available, but site name is not configured! Check your configuration(no, keycloak, i don’t have multiple datacenters for this deployment :-D) Obviously, in standalone mode, this variable is not set, and i have yet to find a way to pass it through configuration. I’m inferring from that code that the configuration we’re trying to setup will not be supported. Am I right ?
Thanks for reading that looooooong message, glad if you can provide some tips