Degraded Performance during Rolling Deployments

Hi,

During rollouts of our deployment, we are encountering significant slowdown on requests to the existing Keycloak pods the moment the first pod in the statefulset begins termination. The request times change from about 200ms worst case to consistently taking 2 to 10 seconds until the end of the rollout process.

We’re currently using Keycloak 6.0.1 deployed to an AWS EKS cluster (v1.13) running on Amazon Linux 2 using a slightly modified version of the Codecentric Keycloak chart (added sidecar container for metrics collection). The cluster contains three c5.xlarge (4CPU, 8GB RAM) worker nodes, with memory utilization typically less than 2GB. Our active session count during rollouts is usually less than 50.

I’ve tried modifying Wildfly logging to debug, but didn’t see any obvious logs to help indicate what calls might be taking longer during the rollout. The Keycloak pods are in a standalone-ha configuration, with the distributed Infinispan cache being setup to use tcp instead of udp for jgroups channel config and having a cache owner count of 3 for all the caches modified in the helm chart cli script.

Please let me know what additional information I can provide. I’m working on steps to reproduce the issue, but I wanted to start a thread in case someone has ideas on what to look at or familiarity with this issue already.

Thank you,
Michael

1 Like

Hi, we are facing the same issues, some hints how to do rolling updates without performance degradation would be highly appreciated :-), thanks.

2 Likes

Same issue here, I am going to move into a remote infinispan to see if this will solve the problem. I keep you posted if it improves something. This is how the spike looks like during deployment…

big spikes in CPU also, from the logs I see a lot of messages being transmitted to move the keys around the cluster when the pod is going down, exactly because of the replication of the cache-container. We also use owners=2, I will see if the infinispan is the reason for this.

1 Like

Hi,

Have you resolved this issue? We are facing the same problem with slowdown during rolling update of server pods. Our setup is the following:

  • Keycloak 10.0.2
  • AWS EKS 1.18
  • Codecentric Helm chart
  • 3 replicas in a StatefulSet

According to JMX metrics, there is increase in Infinispan cache write time and cache replication time, up to 500 ms and more.

Thank you in advance!

1 Like

Same here, keycloak 10…
We already have this since keycloak 6, but keycloak not taken this issue seriously.
We have delays of 10-30 seconds and therefore timeouts during rollout.
We discovered it is because infinispan blocks all pods during updating/syncing sessions.
Both with distributed an replicated cache.

We hope it improves with keycloak 12, which includes wildfly 21, and therefore infinispan 10.
In there they rewrote cache to be almost non-blocking.
In infinispan 11 this big change is totally finished, so maybe even have to wait for keycloak 13…
But we hope in keycloak 12 it wil improve…

3 Likes

Thanks for the answer! We will wait for Keycloak 12 :slight_smile:
In the meantime, have you implemented some workaround to mitigate this issue?

no unfortunately we did not find a workarround.
we even added offline sessions, in the hope this would use loading sessions from database and therefore make less use of resyncing online sessions between pods. Even though it is loading from database, it also syncs moment they are in the cache and somehow communicating between the pods. And again timeouts are there and logins are blocked.
We even signal optimisticlock and duplicate exceptions through jpa. Maybe related to all of this.
Frustrated that there is no solution and keycloak doesnt do good kubernetes testing to solve issues like this. They say handover an test to simulate this issue but the test is just to create lots of sessions (> 100.000) so they should be able to test it.
My gues is they intuduced it long time agoo when wildfly infinispan cache mode was changed from mode=async to mode=sync. this config change was mandatory because it wasnt possible anymore to use async mode, in stead there was a whole new async infinispan api. Its keycloak 12 and partly 13 where the rewrite code to use this async api. They should have given prio long time agoo together with config change…
Lets pray it improves in next version. They do not answer when keycloak 12 is released, but looks like issues are all done…

1 Like

Hi was this solved?
We think we are seeing similar problems when evaluating Keycloak. We are using Keycloak 17.0.0 quarkus. Running in Kubernetes. Whenever enabling distributed cache the admin UI becomes very slow even with just one user. Creating tokens also takes roughly 2s. After a few minutes UI becomes responsive again and tokens take 200ms to create.
Making changes like deploying or killing one of the keycloak pods makes keycloak slow again for a few minutes.