Cluster troubles : enabling remote infinispan cache

zuzur · May 14, 2020, 9:58am

Hello,

we are using keycloak in standalone-cluster mode, with a custom user federation SPI that create & manage users in a separate database from keycloak.

We’ve deployed our service in a 3 containers setup running in docker (rancher cluster), with jgroups setup to use JDBC_PING (in production) but we have also successfully tested dns.DNS_PING, kubernetes.KUBE_PING and a custom RANCHER_PING (support for rancher 1.X rancher meta-data service, that we intend to contribute when it’s ready)

Our main issue with this setup is that we often have to update the federation code, so we publish an updated image for our keycloak service (we built it upon jboss/keycloak:9.0.3, and use cli scripts to install the jgroups stack and tune other parameters), and start a rolling upgrade of each container of the cluster. During that rolling upgrade, we often have issues with the infinispan caches. Any access to one of those caches will take ages and often end up with the edge load-balancer cutting the HTTP connection and return a 504 to the clients (affecting both logins, token refreshes, you name it). We’ve seen response times up to the minute, and infinispan never recover existing sessions keys from the “sessions” cache. Surprisingly, after a while, new logins work fine and the service resumes operations - but users logged in with existing sessions will lose them and have to reconnect. We have setup our LB to timeout after 15s, but this is far from satisfactory, as, depending on the load, we have approximately 80% chance to lose all existing sessions during a server upgrade. Usually, we end up stopping and restarting the whole cluster. I can provide traces of those upgrades, infinispan reports losing contact with a cache coordinator and every access to that cache is blocked for a very long time (If anyone has an idea/tip of how to prevent or control such blocking with the infinispan subsystem, i’d be very glad)

A the moment, the cache owners for keycloak’s distributed-caches are set to 3 (3 containers, 3 owners => all data is replicated on every nodes, no matter what)

To solve that federation update issue, and the fact that keycloak/infinispan don’t seem to support an environment where a cluster member can disappear and another one re-appear without notice. we are exploring the following solutions:

deploy keycloak outside of kube/rancher/docker on a set of more stable hosts, and use the old-fashioned “connect and deploy using the CLI with capistrano/ansible” deployment for our custom SPI, not requiring a server restart and the infinispan cluster to lose a member. For me, given the effort we spent deploying our infrastructure to support containers and fast update pace, this would be a shame, and a serious step backward
use a separate, more stable infinispan cluster. This seems to us like the most practical solution at the moment. We have all the CLI scripts ready to modify the standalone-ha.xml, but at the moment we are hitting the following roadblock: when configured to use a remote-store for the work, sessions, authenticationSessions, offlineSessions, clientSessions, offlineClientSessions, loginFailures and actionTokens caches, keycloak will setup an InfinispanNotficationManager to propagate clustered messages, requiring a site name to be specified (https://github.com/keycloak/keycloak/blob/master/model/infinispan/src/main/java/org/keycloak/cluster/infinispan/InfinispanNotificationsManager.java#L100). The service will not start, always throwing an java.lang.IllegalStateException: Multiple datacenters available, but site name is not configured! Check your configuration (no, keycloak, i don’t have multiple datacenters for this deployment :-D) Obviously, in standalone mode, this variable is not set, and i have yet to find a way to pass it through configuration. I’m inferring from that code that the configuration we’re trying to setup will not be supported. Am I right ?

Thanks for reading that looooooong message, glad if you can provide some tips

zuzur · May 14, 2020, 10:10am

One thing that i forgot to mention: plain keycloak without our user-federation SPI installed displays the exact same behaviour, i obviously suspected our own code before, but we get the same a plain keycloak image with the following changes:

startup timeout set to 90s
keycloak caches setup to use remote-store
jdbc_ping, dns_ping and rancher_ping jgroups stacks installed and some custom startup cli ran in the entrypoint to select the stack based on environment variables
loggers level for org.jgroups, org.infinispan set through environment variables
public interface set to eth0

I can provide the cli scripts used to build the image, didn’t include them because i thought it would clutter the conversation.

Again, thanks for any help or tip

zuzur · May 14, 2020, 10:14am

and here what we’re seeing in the logs during a rolling upgrade

zuzur · May 15, 2020, 10:04pm

I just found out that the issue occured because i was trying to offload the work cluster. In that case, the InfinispanNotficationManager expects a multi-site configuration and tries to forward events to a remote cluster.

For the first time in 2 years, i could perform a load test of our keycloak service during which i simulated 3 full service upgrades with only 3 reported errors.

Groeg · November 24, 2020, 2:18pm

Hi zuzur,
happy to hear you managed to get the setup running. Do you mind sharing these CLI scripts? I am also interested in externalizing the infinispan cache and they would be helpful.

Thanks!

bramalingam81 · March 30, 2022, 12:45pm

I am as well facing the same issue. Could you please share the configuration change required to fix this issue ?

AlexeiKlimenko · May 31, 2022, 2:29pm

Hello.
Faced with the same error. How did you resolve it?

“i was trying to offload the work cluster” - what does that mean?

Topic		Replies	Views
Configuring Infinispan instance Configuring the server	1	875	January 19, 2023
Keycloak 17 cluster running on docker swarm Configuring the server clustering , container	2	4286	August 17, 2022
How to configure standalone external infinispan server to domain cluster keycloak Getting advice	4	3163	May 11, 2021
Standalone-ha.xml configuration for externalizing Infinispan from Keycloak Configuring the server	0	1681	October 15, 2020
How to enable infinispan for domain cluster mode of keycloak Getting advice	0	406	April 16, 2021

Cluster troubles : enabling remote infinispan cache

Related Topics