Configurations for a large cluster

Hi,

We have been running a 3 instances Keycloak cluster on AWS EC2 with some modifications on the standalone-ha.xml file that comes with official Keycloak distribution. The instance type we are using is fairly large (c5.4xlarge) and the whole system is fairly stable.

Recently, our company has moved to using AWS Fargate, which means we will have much less powerful machines. During our load testing, we are now running 18 instances of Keycloak and we are seeing a small amount of errors. Some requests are taking a really long time to complete. CPU and network usage also become very unstable, even after the load testing is done.

Does anyone have any suggestions on areas of interest where we can look into to eliminate the issues? Or it would be really helpful too if anyone can point us to some information on how to configure/run a large Keycloak cluster.

Much appreciated in advanced!
Jia

1 Like

Depending on your target number of users and sessions, I’d consider running an external infinispan cluster. I’ve been running ~15-20 keycloak instances on Fargate with 5 infinispan instances (r5.2xlarge) on EC2, and it’s much more stable than running everything on Fargate.

There have been some discussions here about config:

@xgp Thanks for the information! I’ll give it a try for sure.

Thanks again.

@jichen-amplify @xgp
Can you please let us know the kind of traffic that you are managing with your deployments? Just trying to understand what should be the scale of the infrastructure, when the peak load is expected to go like 1000 authentications per second.

Is that 1000 auths per second sustained? You’re really going to have almost 100 million auths per day!? That would certainly be one of the biggest deployments of Keycloak I’ve heard of.

The cluster I described above does ~200 auths per second peak load, but definitely not sustained. Most auths in one day is ~2 million.

No, its not sustained. It would be peak load that may be experienced intermittently.
So, we are integrating with a mobile app which will be used by about 15 to 20 million users. The app may send periodic notifications for some offers that it may have. During that period, we are expecting that lot of users may try to access the application, which would create a load on Keycloak to provide tokens to users who may need a new token or a new authentication.

In your scenario where you have about 200 auth / second, what kind of infrastructure did you need , interms of sizing.

Regards,
Ashish

The cluster I described above:

ok so 15-20 instances is it needed all the time or its just the peak scaling of containers during auto-scale on Fargate?

In the use case that I have, there would be no time to auto-scale to meet peaks, so we currently keep 16 instances running at all time.

OK, thanks a lot , this information helps us to decide the way forward.
Just one last question, can you please share what is the total user volume that you did cater to ?

There are just over 18 million users stored, but the MAU number is only 4 million.

1 Like

Thanks again, this is really helpful. Just to understand the sizing, can I know how many vCPU and RAM each of your container had in the Fargate ?

  cpu                 = 4096
  memory              = 8192

Happy to share information, but I’d really recommend doing load testing against your own specific use cases before you put something of this scale into production. There are a few starts out there for Keycloak load testing frameworks (e.g. GitHub - lgraf/keycloak-gatling: basic gatling simulation to load test your keycloak installation).

1 Like

Yes, we are going to do the load test, I was just trying to understand the sizing for similar scale. Thanks again

@ashishtchaudhari, are you running Keycloak and Infinispan on Fargate or Infinispan on EC2 instance as suggested by @xgp ?
If you are running on Fargate, can you please share the container sizing details (vCPU and Memory) for Keycloak and Infinispan? and how many instances are you running all the time?

Hi Atul,
At this point we are running it over auto-scaling group of EC2 instances. We have decided to go with R5b instances with 2 CPU an 16 GB of RAM. Our performance testing so far have showed that we would need more memory to store session cache. Our Infinispan servers are embedded and not external as of now. We are still evaluating the best approach.

Regards,
Ashish

Thanks Ashish, can you share, how many users do you have in your database? Which database are you using and what is the configuration?

We have about 4 million users.

@xgp
We are facing issues with keycloak in certain scenarios, if the rolling restart of the server happens, there are certain issues related to refresh token is seen. We have offline tokens configured, but even then it seems that in some cases, even if the token sent is valid, keycloak returns an error stating that the token is invalid. By anychance, did you face an issue similar to this, if yes, any guidance around this will help.

I haven’t seen that specific problem. If you could post more about what you’re seeing, we could try to recreate it and help you debug.