During rollouts of our deployment, we are encountering significant slowdown on requests to the existing Keycloak pods the moment the first pod in the statefulset begins termination. The request times change from about 200ms worst case to consistently taking 2 to 10 seconds until the end of the rollout process.
We’re currently using Keycloak 6.0.1 deployed to an AWS EKS cluster (v1.13) running on Amazon Linux 2 using a slightly modified version of the Codecentric Keycloak chart (added sidecar container for metrics collection). The cluster contains three c5.xlarge (4CPU, 8GB RAM) worker nodes, with memory utilization typically less than 2GB. Our active session count during rollouts is usually less than 50.
I’ve tried modifying Wildfly logging to debug, but didn’t see any obvious logs to help indicate what calls might be taking longer during the rollout. The Keycloak pods are in a standalone-ha configuration, with the distributed Infinispan cache being setup to use tcp instead of udp for jgroups channel config and having a cache owner count of 3 for all the caches modified in the helm chart cli script.
Please let me know what additional information I can provide. I’m working on steps to reproduce the issue, but I wanted to start a thread in case someone has ideas on what to look at or familiarity with this issue already.