Keycloak HA issue in Kubernetes when trying to restart the infinispan "master"

I am currently working on a Keycloak HA setup on Kubernetes. I am not a Keycloak, kubernetes, or infinispan expert but I managed to get the setup working and the pods can see eachother and use infinispan for Auth sessions.

  • I am using two pods and the CACHE_OWNERS_COUNT and CACHE_OWNERS_AUTH_SESSIONS_COUNT is set to two.
  • Keycloak Version 12.0.4 from the jboss/keycloak image
  • the setup is using a kubernetes statefull set and I have a kubernetes service which I am using to distribute requests and a headless service which I use for JGROUPS_DISCOVERY via dns.DNS_PING

While performing load tests the requests are distributed across both pods (stickyness to optimize processing is a topic I will take care of later) and I can see that both pods can respond to any token refresh or started login, no matter if the login happened on the current pod or the other and accessing the admin interface also works well.

The challenging part is the failover scenario. I can restart one of the pods while the loadtest is running but not the other. Without a deep understanding of infinispan it seems as if infinispan has a master (the pod that sends messages like: [Context=sessions] ISPN100010: Finished rebalance with members [cim-1, cim-0,…], topology id …

I can restart almost any pod while in the middle of a load test and I am not getting a single error but when I restart the infinispan master my keycloak cluster collapses and I am only getting errors for the next 30 seconds.

Even when I am not under load and I am deleting the pod it seems as if the master tries to hand over the “master flag” to another pod but is not getting any response (I am running in a timeout). From our cluster Admins I am getting the info that this is expected behaviour as the communication with the master is cut off as soon as the termination starts.

the pod i am killing has the node identifier cim-1 and cim-0 should be the new master

The error message I am getting is:
…“loggerName”:“org.infinispan.interceptors.impl.InvocationContextInterceptor”,“level”:“ERROR”,“message”:“ISPN000136: Error executing command PrepareCommand on Cache ‘http-remoting-connector’, writing keys [cim-1]”,“threadName”:“timeout-thread–p17-t1”,“threadId”:127,“mdc”:{},“ndc”:"",“hostName”:“cim-1”,“processName”:“jboss-modules.jar”,“processId”:842,“exception”:{“refId”:1,“exceptionType”:“org.infinispan.util.concurrent.TimeoutException”,“message”:"ISPN000476: Timed out waiting for responses for request 21 from cim-0"

Is this anything you have been experiencing? How can I get more information on which connection is failing (is it a TCP or UDP connection is it a new one or an existing one,…)

Could this be related to a bug in the cilium network overlay we are using: LB service (pod) backend deletion breaks existing connections · Issue #14844 · cilium/cilium · GitHub

Any thoughts on how to continue are appreciated