We are running Keycloak in docker containers in AWS Fargate. We deploy by adding new containers to the cluster before removing the old ones to avoid down time. However we are experiencing some issues when making changes to the Infinispan configuration, e.g. changing the number of owners in the distributed cache. It seems as though the cluster needs to be taken down to apply the changes. Is this expected behaviour?
Specifically we changed the number of owners in the distributed cache from 1 to 2, but still lost data when a single node was removed from the cluster. In the application log we could see the warning: “Lost data because of graceful leaver” even though all the nodes in the cluster were seemingly configured with 2 owners. After the cluster was shut down and started again the configuration seemed to be working as expected. We also tested the other way around, i.e. changing the number of owners from 2 to 1. Even though all the nodes in the cluster were configured to have only 1 owner, no data was lost during deploy until the cluster was restarted. Should it be possible to make changes to the caching configuration without down time? If so, how can we achieve this?
We’re experiencing the same issues on a similar setup. Keycloak deployed on a rancher cluster (no multicast, so we’re currently using jdbc_ping in production but made tests with dns_ping, tcp_gossip, kube_ping and a custom rancher_ping based on kube_ping). It’s pretty stable until you have to update the containers in the cluster (we have some custom user-federation SPIs that we update often - way too often to my taste, but…)
Our findings is that it has nothing to do with updating owners at run time. You shouldn’t do that. This setting controls the number of replicas (owners) for a distributed cache. You start your caches with a defined set of owners and don’t need to change it. Ever. At the moment, we have as many owners as containers in our cluster (3)
The “lost segment” we (and you) are experimenting seems to be caused by infinispan losing its cache coordinator, and suddenly, any access (get / set) on the cache gets stuck until the LB (you have an LB on top of your cluster, right ?) decides to kill the HTTP connection. I’ve seen timeouts up to 60s (due to ELBs default settings) but infinispan will happily keep the client waiting for a very long time, during which you simply lose your cluster.
We have a load test that runs simultaneous clients logins and refreshing their token like crazy (very far from normal operations conditions) and i simply cannot safely upgrade my cluster the “docker way” (kill a container, start a fresh one, wait for it to be healthy, repeat)
under that load.
We have tried so many different infinispan settings and jgroups stacks to get this working… Under load, keycloak simply cannot afford to lose a cluster member, and this totally defeats our original idea of using such a cluster for HA.
My current opinion on wildfly/EJB on top of infinispan caches is that it adds a lot of complexity to solve a problem that 1% of users may have (multi-DC replication) while making the life of the 99% others miserable I would be happy if i could just store my sessions/offlineSessions/etc caches on a redis/memcache service with which we have a pretty good operating experience.
If someone here has a sample Keycloak 9.0.X configuration that supports losing one of the cluster member under load, it would really really save my day, and i would be grateful for EVER. Any tips or suggestion would be welcome… Currently, we’re trying (and failing miserably) to host the caches on a separate cluster from keycloak that would be way more stable and shouldn’t suffer from members leaving.
A Gist of our current “production” infinispan setup…
And here is the kind of behaviour the cluster shows (sessions cache failing here, but that could be any other cache in keycloak) during an upgrade under stress. When there is no requests, everything works fine, but that’s not how it’s supposed to work, right ?
For the sake of completeness, i wanted to mention that we have also tried to add a new member (container) before stopping an existing container. The results are even worse, we just go from 80% chances of having to restart the whole cluster (losing the session caches and requesting all users to login again) to 100%