Random Keycloak Timeouts for Login

Hey everybody! Full disclosure I am fairly new with Keycloak so I apologize if the answer to this question is obvious to some. Here’s a little background on my issue – Currently we have a 3 node EKS cluster sitting in a Private VPC whereby we access our endpoints via AWS Workspaces.

I am deploying Keycloak V14.0 using a custom Helm chart and can access the admin service fine and can equally connect to an upstream AD server and pull in users.

The issue that I am currently facing is as follows: We have Gitlab and Grafana set up to use OIDC via Keycloak and we are running into seemingly random timeouts when trying to sign in. Sometimes the timeout will occur before the Keycloak sign in page comes up and sometimes it occurs after putting in credentials. Other times it signs in totally fine, though this result is pretty rare especially with Gitlab.

Here’s what I found in the Keycloak Logs for those following scenarios:

  1. If the Keycloak login screen never comes up – There isn’t anything related to a request in the Keycloak logs as if the request never made it across the cluster.
  2. If I get to the Keycloak login page and put in my credentials I do see successful token creation in the logs.

I am wondering if anybody else has deployed Keycloak in an EKS multi-node cluster and ran into similar issues or if there is somewhere that I should start.

An example error that I get from Grafana is – login.OAuthLogin(NewTransportWithCode)

I have a separate ticket open with the Gitlab team in case it is a configuration issue on that side as well.

Thanks!

Attached is our Gitlab client setup as an example. Note we are using Client ID and Secret as our creds

That Grafana case sounds like a Grafana can’t exchange code for a token - did you configure stickiness on used LB/reverse proxy? Make sure also that used Grafana client doesn’t have enabled full scope and you have minimal roles/group exposed via role/group mapper. Of course did you check Keycloak/LB-proxy logs to find where it is timeouting.

I think you are probably right on that thought. So we use Network Load Balancers so as far as I know we cannot configure stickiness. I did ensure that the client for Grafana did not have full scope enabled. I don’t believe we have any roles other than the single Grafana role enabled at this time. I do notice this error from the Keycloak logs when I get the error I posted above for Grafana.

19:28:17,815 INFO  [org.keycloak.events] (default task-81) type=LOGIN, realmId=LLD, clientId=grafana, userId=bb2e1a22-c1e2-4742-a917-cf57aaf198c6, ipAddress=10.1.38.209, auth_method=openid-connect, auth_type=code, response_type=code, redirect_uri=https://grafana.domain.com/login/generic_oauth, consent=no_consent_required, code_id=8122f858-d9dd-46a6-9c75-bff242fb5055, username=jimi.doddo@domain.com, response_mode=query, authSessionParentId=8122f858-d9dd-46a6-9c75-bff242fb5055, authSessionTabId=yXN50Q1Qtvg
19:30:27,286 WARN  [org.keycloak.protocol.oidc.utils.OAuth2CodeParser] (default task-84) Code '38d32898-5ba0-4157-a5ad-2b8e73aa5586' already used for userSession '8122f858-d9dd-46a6-9c75-bff242fb5055' and client '6575d7db-b51d-4a64-a64f-1571873ca105'.
19:30:27,287 WARN  [org.keycloak.events] (default task-84) type=CODE_TO_TOKEN_ERROR, realmId=LLD, clientId=grafana, userId=null, ipAddress=10.1.38.209, error=invalid_code, grant_type=authorization_code, code_id=8122f858-d9dd-46a6-9c75-bff242fb5055, client_auth_method=client-secret

Almost as if a request times out and Grafana maybe tries to create another new session which throws an error? Definitely feels like it is timing out, though. Note that we have traffic routing for our applications through a Private Istio Ingress Gateway with Grafana connected to a Passthrough Istio Ingress Gateway.

I believe your infinispan cache setup is somehow broken. Request is routed to different node, where cache doesn’t have a code and it can’t be discovered from other nodes, so error is invalid_code.

1 Like

Got it ok yeah I did nothing to set up infinispan cache on my end so I wonder how it is being set up by default. Any ideas of how I can check to see if that is the culprit?

This may be unrelated but I noticed that Grafana sign in works ~90% of time time and Gitlab almost never even makes it to the Keycloak sign in screen. Grafana and Keycloak pods reside on the same node while the Gitlab webservice pods reside on different nodes from Keycloak. I wonder if that has something to do with it?

¯_(ツ)_/¯ only you know how it is deployed, only you have access to logs.

That’s fair - Fwiw we are deploying from an upstream open source helm chart for keycloak (chart · main · Platform One / Big Bang / Packages / Security Tools / Keycloak · GitLab) only values we updated were to attached our TLS certs as well as our custom realms.