Configurations for a large cluster

jichen-amplify · July 16, 2021, 7:42pm

Hi,

We have been running a 3 instances Keycloak cluster on AWS EC2 with some modifications on the standalone-ha.xml file that comes with official Keycloak distribution. The instance type we are using is fairly large (c5.4xlarge) and the whole system is fairly stable.

Recently, our company has moved to using AWS Fargate, which means we will have much less powerful machines. During our load testing, we are now running 18 instances of Keycloak and we are seeing a small amount of errors. Some requests are taking a really long time to complete. CPU and network usage also become very unstable, even after the load testing is done.

Does anyone have any suggestions on areas of interest where we can look into to eliminate the issues? Or it would be really helpful too if anyone can point us to some information on how to configure/run a large Keycloak cluster.

Much appreciated in advanced!
Jia

xgp · July 17, 2021, 6:49am

Depending on your target number of users and sessions, I’d consider running an external infinispan cluster. I’ve been running ~15-20 keycloak instances on Fargate with 5 infinispan instances (r5.2xlarge) on EC2, and it’s much more stable than running everything on Fargate.

There have been some discussions here about config:

jichen-amplify · July 19, 2021, 11:55pm

@xgp Thanks for the information! I’ll give it a try for sure.

Thanks again.

ashishtchaudhari · July 20, 2021, 12:25pm

@jichen-amplify @xgp
Can you please let us know the kind of traffic that you are managing with your deployments? Just trying to understand what should be the scale of the infrastructure, when the peak load is expected to go like 1000 authentications per second.

xgp · July 20, 2021, 4:42pm

Is that 1000 auths per second sustained? You’re really going to have almost 100 million auths per day!? That would certainly be one of the biggest deployments of Keycloak I’ve heard of.

The cluster I described above does ~200 auths per second peak load, but definitely not sustained. Most auths in one day is ~2 million.

ashishtchaudhari · July 20, 2021, 4:50pm

No, its not sustained. It would be peak load that may be experienced intermittently.
So, we are integrating with a mobile app which will be used by about 15 to 20 million users. The app may send periodic notifications for some offers that it may have. During that period, we are expecting that lot of users may try to access the application, which would create a load on Keycloak to provide tokens to users who may need a new token or a new authentication.

In your scenario where you have about 200 auth / second, what kind of infrastructure did you need , interms of sizing.

Regards,
Ashish

xgp · July 20, 2021, 5:12pm

The cluster I described above:

ashishtchaudhari · July 20, 2021, 5:14pm

ok so 15-20 instances is it needed all the time or its just the peak scaling of containers during auto-scale on Fargate?

xgp · July 20, 2021, 5:26pm

In the use case that I have, there would be no time to auto-scale to meet peaks, so we currently keep 16 instances running at all time.

ashishtchaudhari · July 20, 2021, 5:32pm

OK, thanks a lot , this information helps us to decide the way forward.
Just one last question, can you please share what is the total user volume that you did cater to ?

xgp · July 20, 2021, 5:43pm

There are just over 18 million users stored, but the MAU number is only 4 million.

ashishtchaudhari · July 20, 2021, 5:44pm

Thanks again, this is really helpful. Just to understand the sizing, can I know how many vCPU and RAM each of your container had in the Fargate ?

xgp · July 20, 2021, 6:09pm

  cpu                 = 4096
  memory              = 8192

Happy to share information, but I’d really recommend doing load testing against your own specific use cases before you put something of this scale into production. There are a few starts out there for Keycloak load testing frameworks (e.g. GitHub - lgraf/keycloak-gatling: basic gatling simulation to load test your keycloak installation).

ashishtchaudhari · July 20, 2021, 6:24pm

Yes, we are going to do the load test, I was just trying to understand the sizing for similar scale. Thanks again

atulchauhan01 · September 29, 2021, 7:21am

@ashishtchaudhari, are you running Keycloak and Infinispan on Fargate or Infinispan on EC2 instance as suggested by @xgp ?
If you are running on Fargate, can you please share the container sizing details (vCPU and Memory) for Keycloak and Infinispan? and how many instances are you running all the time?

ashishtchaudhari · September 29, 2021, 7:46am

Hi Atul,
At this point we are running it over auto-scaling group of EC2 instances. We have decided to go with R5b instances with 2 CPU an 16 GB of RAM. Our performance testing so far have showed that we would need more memory to store session cache. Our Infinispan servers are embedded and not external as of now. We are still evaluating the best approach.

Regards,
Ashish

atulchauhan01 · September 29, 2021, 9:11am

Thanks Ashish, can you share, how many users do you have in your database? Which database are you using and what is the configuration?

ashishtchaudhari · October 16, 2021, 9:45pm

We have about 4 million users.

ashishtchaudhari · October 19, 2021, 4:27am

@xgp
We are facing issues with keycloak in certain scenarios, if the rolling restart of the server happens, there are certain issues related to refresh token is seen. We have offline tokens configured, but even then it seems that in some cases, even if the token sent is valid, keycloak returns an error stating that the token is invalid. By anychance, did you face an issue similar to this, if yes, any guidance around this will help.

xgp · October 19, 2021, 7:30am

I haven’t seen that specific problem. If you could post more about what you’re seeing, we could try to recreate it and help you debug.

Topic		Replies	Views
Keycloak Cluster JGroups/Infinispan on EC2 Configuring the server	9	4708	May 11, 2023
Deploy Keycloak to AWS Getting advice	1	2134	November 17, 2020
Questions related to Keycloak HA Getting advice authentication , oidc	6	891	May 4, 2023
Keycloak in production Configuring the server	1	188	February 12, 2024
Cluster troubles : enabling remote infinispan cache Configuring the server	6	3694	May 31, 2022

Configurations for a large cluster

Related Topics