Changing cache configuration without down time

emil.lunde · December 2, 2019, 10:13am

We are running Keycloak in docker containers in AWS Fargate. We deploy by adding new containers to the cluster before removing the old ones to avoid down time. However we are experiencing some issues when making changes to the Infinispan configuration, e.g. changing the number of owners in the distributed cache. It seems as though the cluster needs to be taken down to apply the changes. Is this expected behaviour?

Specifically we changed the number of owners in the distributed cache from 1 to 2, but still lost data when a single node was removed from the cluster. In the application log we could see the warning: “Lost data because of graceful leaver” even though all the nodes in the cluster were seemingly configured with 2 owners. After the cluster was shut down and started again the configuration seemed to be working as expected. We also tested the other way around, i.e. changing the number of owners from 2 to 1. Even though all the nodes in the cluster were configured to have only 1 owner, no data was lost during deploy until the cluster was restarted. Should it be possible to make changes to the caching configuration without down time? If so, how can we achieve this?

Thanks!

slaskawi · December 2, 2019, 11:20am

Unfortunately yes, this is expected. At the moment, Infinispan’s configuration is static. Changing the number of owners in a running system may result in unexpected behavior and has never been tested.

But this may improve once we migrate to Infinispan 10. The way Infinispan consumes configuration has been completely refactored. But this requires some experimentation and testing.

zuzur · May 13, 2020, 8:23am

We’re experiencing the same issues on a similar setup. Keycloak deployed on a rancher cluster (no multicast, so we’re currently using jdbc_ping in production but made tests with dns_ping, tcp_gossip, kube_ping and a custom rancher_ping based on kube_ping). It’s pretty stable until you have to update the containers in the cluster (we have some custom user-federation SPIs that we update often - way too often to my taste, but…)

Our findings is that it has nothing to do with updating owners at run time. You shouldn’t do that. This setting controls the number of replicas (owners) for a distributed cache. You start your caches with a defined set of owners and don’t need to change it. Ever. At the moment, we have as many owners as containers in our cluster (3)

The “lost segment” we (and you) are experimenting seems to be caused by infinispan losing its cache coordinator, and suddenly, any access (get / set) on the cache gets stuck until the LB (you have an LB on top of your cluster, right ?) decides to kill the HTTP connection. I’ve seen timeouts up to 60s (due to ELBs default settings) but infinispan will happily keep the client waiting for a very long time, during which you simply lose your cluster.

We have a load test that runs simultaneous clients logins and refreshing their token like crazy (very far from normal operations conditions) and i simply cannot safely upgrade my cluster the “docker way” (kill a container, start a fresh one, wait for it to be healthy, repeat)
under that load.

We have tried so many different infinispan settings and jgroups stacks to get this working… Under load, keycloak simply cannot afford to lose a cluster member, and this totally defeats our original idea of using such a cluster for HA.

My current opinion on wildfly/EJB on top of infinispan caches is that it adds a lot of complexity to solve a problem that 1% of users may have (multi-DC replication) while making the life of the 99% others miserable I would be happy if i could just store my sessions/offlineSessions/etc caches on a redis/memcache service with which we have a pretty good operating experience.

If someone here has a sample Keycloak 9.0.X configuration that supports losing one of the cluster member under load, it would really really save my day, and i would be grateful for EVER. Any tips or suggestion would be welcome… Currently, we’re trying (and failing miserably) to host the caches on a separate cluster from keycloak that would be way more stable and shouldn’t suffer from members leaving.

A Gist of our current “production” infinispan setup…

gist.github.com

https://gist.github.com/earzur/2296fdc4a23d92fa63f3624ac0bf8e5a

infinispan.xml

       <subsystem xmlns="urn:jboss:domain:infinispan:9.0">
            <cache-container name="keycloak">
                <transport lock-timeout="60000"/>
                <local-cache name="realms">
                    <object-memory size="10000"/>
                </local-cache>
                <local-cache name="users">
                    <object-memory size="10000"/>
                </local-cache>
                <local-cache name="authorization">

This file has been truncated. show original

jgroups.xml

        <subsystem xmlns="urn:jboss:domain:jgroups:7.0">
            <channels default="ee">
                <channel name="ee" stack="jdbc-ping" cluster="ejb"/>
            </channels>
            <stacks>
                <stack name="jdbc-ping">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <protocol type="org.jgroups.protocols.JDBC_PING">
                        <property name="connection_url">jdbc:mysql://${env.MYSQL_KEYCLOAK_SERVER:mysql}:${env.MYSQL_KEYCLOAK_PORT:3306}/${env.MYSQL_KEYCLOAK_DATABASE:key
cloak}?${env.MYSQL_KEYCLOAK_JDBC_PARAMS:useSSL=false}</property>

This file has been truncated. show original

And here is the kind of behaviour the cluster shows (sessions cache failing here, but that could be any other cache in keycloak) during an upgrade under stress. When there is no requests, everything works fine, but that’s not how it’s supposed to work, right ?

This results in tons of 504s and 404s on the client side.

zuzur · May 13, 2020, 9:00am

For the sake of completeness, i wanted to mention that we have also tried to add a new member (container) before stopping an existing container. The results are even worse, we just go from 80% chances of having to restart the whole cluster (losing the session caches and requesting all users to login again) to 100%

Topic		Replies	Views
Cluster troubles : enabling remote infinispan cache Configuring the server	6	3679	May 31, 2022
Configuring distributed caches - impact of the owner parameter on infinispan configuration Configuring the server	3	300	January 16, 2024
Sessions lost during deployments as pods try reaching distributed cache owners that are taken down Getting advice	3	1014	September 30, 2022
Keycloak clustering in a dynamic environment Configuring the server	2	1749	October 15, 2019
Keycloak web GUI is down, when the db connection is lost despite the infinispan distributed cache is configured. Any ideas why? Configuring the server	0	322	June 12, 2023

Changing cache configuration without down time

Related Topics