Cluster troubles : enabling remote infinispan cache

Hello,

we are using keycloak in standalone-cluster mode, with a custom user federation SPI that create & manage users in a separate database from keycloak.

We’ve deployed our service in a 3 containers setup running in docker (rancher cluster), with jgroups setup to use JDBC_PING (in production) but we have also successfully tested dns.DNS_PING, kubernetes.KUBE_PING and a custom RANCHER_PING (support for rancher 1.X rancher meta-data service, that we intend to contribute when it’s ready)

Our main issue with this setup is that we often have to update the federation code, so we publish an updated image for our keycloak service (we built it upon jboss/keycloak:9.0.3, and use cli scripts to install the jgroups stack and tune other parameters), and start a rolling upgrade of each container of the cluster. During that rolling upgrade, we often have issues with the infinispan caches. Any access to one of those caches will take ages and often end up with the edge load-balancer cutting the HTTP connection and return a 504 to the clients (affecting both logins, token refreshes, you name it). We’ve seen response times up to the minute, and infinispan never recover existing sessions keys from the “sessions” cache. Surprisingly, after a while, new logins work fine and the service resumes operations - but users logged in with existing sessions will lose them and have to reconnect. We have setup our LB to timeout after 15s, but this is far from satisfactory, as, depending on the load, we have approximately 80% chance to lose all existing sessions during a server upgrade. Usually, we end up stopping and restarting the whole cluster. I can provide traces of those upgrades, infinispan reports losing contact with a cache coordinator and every access to that cache is blocked for a very long time (If anyone has an idea/tip of how to prevent or control such blocking with the infinispan subsystem, i’d be very glad)

A the moment, the cache owners for keycloak’s distributed-caches are set to 3 (3 containers, 3 owners => all data is replicated on every nodes, no matter what)

To solve that federation update issue, and the fact that keycloak/infinispan don’t seem to support an environment where a cluster member can disappear and another one re-appear without notice. we are exploring the following solutions:

  • deploy keycloak outside of kube/rancher/docker on a set of more stable hosts, and use the old-fashioned “connect and deploy using the CLI with capistrano/ansible” deployment for our custom SPI, not requiring a server restart and the infinispan cluster to lose a member. For me, given the effort we spent deploying our infrastructure to support containers and fast update pace, this would be a shame, and a serious step backward :frowning:

  • use a separate, more stable infinispan cluster. This seems to us like the most practical solution at the moment. We have all the CLI scripts ready to modify the standalone-ha.xml, but at the moment we are hitting the following roadblock: when configured to use a remote-store for the work, sessions, authenticationSessions, offlineSessions, clientSessions, offlineClientSessions, loginFailures and actionTokens caches, keycloak will setup an InfinispanNotficationManager to propagate clustered messages, requiring a site name to be specified (https://github.com/keycloak/keycloak/blob/master/model/infinispan/src/main/java/org/keycloak/cluster/infinispan/InfinispanNotificationsManager.java#L100). The service will not start, always throwing an java.lang.IllegalStateException: Multiple datacenters available, but site name is not configured! Check your configuration (no, keycloak, i don’t have multiple datacenters for this deployment :-D) Obviously, in standalone mode, this variable is not set, and i have yet to find a way to pass it through configuration. I’m inferring from that code that the configuration we’re trying to setup will not be supported. Am I right ?

Thanks for reading that looooooong message, glad if you can provide some tips :wink:

One thing that i forgot to mention: plain keycloak without our user-federation SPI installed displays the exact same behaviour, i obviously suspected our own code before, but we get the same a plain keycloak image with the following changes:

  • startup timeout set to 90s
  • keycloak caches setup to use remote-store
  • jdbc_ping, dns_ping and rancher_ping jgroups stacks installed and some custom startup cli ran in the entrypoint to select the stack based on environment variables
  • loggers level for org.jgroups, org.infinispan set through environment variables
  • public interface set to eth0

I can provide the cli scripts used to build the image, didn’t include them because i thought it would clutter the conversation.

Again, thanks for any help or tip :wink:

and here what we’re seeing in the logs during a rolling upgrade

I just found out that the issue occured because i was trying to offload the work cluster. In that case, the InfinispanNotficationManager expects a multi-site configuration and tries to forward events to a remote cluster.

For the first time in 2 years, i could perform a load test of our keycloak service during which i simulated 3 full service upgrades with only 3 reported errors.

3 Likes

Hi zuzur,
happy to hear you managed to get the setup running. Do you mind sharing these CLI scripts? I am also interested in externalizing the infinispan cache and they would be helpful.

Thanks!

I am as well facing the same issue. Could you please share the configuration change required to fix this issue ?