Keycloak slow/misbehaving when clustered

When I have our docker based keycloak system setup in a single node, it behaves perfectly. When I attempt to cluster it in to a docker swarm with 3 nodes, the admin console becomes incredibly slow to load, and it will force users to authenticate 3 times each time they login(like its doing it for each node). Has anyone setup a cluster and experienced these weird behaviors? Verified behavior when pointing directly to the machine url and through our f5 proxy.

First advice: look for org.infinispan and org.jgroups logs. Set their respective level to DEBUG.
In the huge swamp of messages, you should find the reason why the nodes in the cluster can’t see each other (jgroups) and don’t allow the caches (infinispan) to start properly up.

The default jgroups stacks for keycloak (udp, but it is also possible to start using a tcp stack) will use a ping protocol (MPING) that require IP multicast to be available on your deployment environment. AWS ? nope, IP multicast isn’t available. You need to setup aws_ping or another kind of protocol that doesn’t require multicast (JDBC_PING, etc…)

another important parameter in a wildfly cluster is the cache owners. This is the number of nodes on which the infinispan layer will replicate a distributed cache - keycloak sets up 6 of those (sessions, authenticationSessions, clientSessions, offlineSessions, offlineClientSessions and actionTokens).

On a docker swarm, as you have not much control over the stability of the cluster nodes, i’d suggest using as many owners as you have members in your swarm. Or, given how you upgrade the service, at least as many owners as there are running, healthy containers during such an upgrade.

My experience is that keycloak doesn’t support such setups very well. It takes a very long time to infinispan to recover from losing a cluster member, and under load, this makes the cluster unresponsive. I’m still looking for a solution on that side. I lose the session cache everytime i upgrade the cluster (or a node gets killed by the environment, or a swarm node is decomissionned because it’s part of a spot-ASG, pick your problem, infinispan won’t survive it…)