I am running Keycloak 8.0.1 in HA (2 replicas) on AWS ECS Fargate.
There are something like 20 realms, each of that with at least 100 users and 100 groups (coming from an external LDAP (OpenLdap)).
One of that realm is huge and has ~2000 users and ~7000 groups.
All things goes fine, but we notice a liner CPU and Memory increment over the time, till it reach the limits: CPU ~ 98% and Memory ~ 71% for both 2 running instances.
I am really worried about this behavior.
In attachment the CPU graph (Not the memory one cause I can only attach one image).
Do you have any advice?
Since it is running on Fargate I cannot have any chance to make TD or even HD.
However, I could make it run on ECS (EC2 Instance) if it can help to understand.
my team is running a Keycloak 7.0.0 Standalone HA (2 nodes) on EC2 instance and we had a similar experience. We managed to delay the growth by days instead of hours as in your case.
Can you check/answer the following?:
- How often do your User Federations (LDAP connections) sync?
- Do you use full and changed users syncs?
- Do you have an LDAP errors in your logs?
- Are you sure that all users and groups from LDAP are synced?
- Are your nodes (Keycloak) in Sync?
- Do you have a single LDAP instance or a cluster?
What we discovered:
LDAP Cluster: If LDAP nodes are out of sync the Keycloak members won’t stop updating/syncing the users and fill the heap.
Low sync periods: The higher the period for syncing, the lower the usage. I’d guess from our CPU utillization that your User Federations sync every hour.
- Don’t use “Periodic Changed Users” setting if you use LDAP as source of truth (no other tools like AD, ADFS, … are enabled). There is no need for syncing if you don’t do management in your LDAP system
Wrong User Federation configuration: If you have any errors in the LDAP configuration it may leak open TCP connections filling slowly up your memory
Do you have access to VisualVM, JProfiler, YourKit Java Profiler, NetBeans Profiler, Stackify or New Relic?
Ref: Hunting Java Memory Leaks
Thank you for your feedback.
We runs LDAP in master/slave mode. The first url is the master one. Viewing the keycloak user sync log users are not updated when no need it.
We use LDAP as source of truth for the authentication, but we also use LDAP for get user groups in this case the source of truth is Keycloak. Since user group change in the LDAP we need to keep them in sync often.
I do not notice any error in the Keycloak logs.
I done some configuration changes in the LDAP federation:
- Set an higher period of sync.
- Remove the flag “Preserve Group Inheritance” from the groups-mapper (this reduce by ~30% the time of sync)
- Remove a second groups-to-role-mapper (no more needed by our application)
After that changes I restarts both keycloak instances and the following is the CPUUtilization graph:
I have also conduct a heap analysis.
Those was some evidence:
I can deduct there is some leak in the Keycloak instances sync. So I checked the configuration (default, no customization) and nothing ring me a bell.
I am really convinced that there is some leak somewhere.
I was running Keycloak 4.8.3 in production. The CPU utilization was always under the 1% and the RAM under the 30% (of 2GB).
Since in production we do NOT have big realm, I upgraded Keycloak to the 8.0.1 version without changing any configuration.
Now the CPU utilization is growing day by day as well as the RAM (now at 35%).
So I can confirm that the version 4.8.3 do not have such problem.
I you want the Keycloak team to look into your findings then you will need to open an issue.
@lmolinaro, have you raised an issue? If so, could you please provide the issue number on this thread?
I’m facing the same problem currently, I found several issues in Keycloak Jira that seem to be related to this. I’ll add them here for follow-up:
Upgrading to Keycloak 10 seems to fix this issue. The cpu level is not rising anymore after the update.