Keycloak fails when its deployed, scale-in and/or scale-out

I’m running keycloak:9.0.3 on Docker Swarm with mssql and JDBC_PING as discovery protocol. I noticed that infinispan start to fail when a container is shut down with the following exception:

Step to reproduce

  1. Run 3 replicas (different nodes) into Docker swarm with mssql and JDBC_PING configured with:

JGROUPS_DISCOVERY_PROPERTIES=datasource_jndi_name=java:jboss/datasources/KeycloakDS,remove_all_data_on_view_change=true,info_writer_sleep_time=500

  1. Start some script to keep making login request to Keycloak

  2. Then, start killing containers and see the logs. The errors should be there.

    : java.lang.NullPointerException
    at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.clearTable(JDBC_PING.java:362)
    at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.removeAll(JDBC_PING.java:190)
    at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.stop(JDBC_PING.java:119)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
    at org.jgroups@4.1.4.Final//org.jgroups.stack.ProtocolStack.stopStack(ProtocolStack.java:906)
    at org.jgroups@4.1.4.Final//org.jgroups.JChannel.stopStack(JChannel.java:1076)
    at org.jgroups@4.1.4.Final//org.jgroups.JChannel._close(JChannel.java:1063)
    at org.jgroups@4.1.4.Final//org.jgroups.JChannel.close(JChannel.java:454)
    at org.jboss.as.clustering.jgroups@18.0.1.Final//org.jboss.as.clustering.jgroups.subsystem.ChannelServiceConfigurator.accept(ChannelServiceConfigurator.java:132)
    at org.jboss.as.clustering.jgroups@18.0.1.Final//org.jboss.as.clustering.jgroups.subsystem.ChannelServiceConfigurator.accept(ChannelServiceConfigurator.java:58)
    at org.wildfly.clustering.service@18.0.1.Final//org.wildfly.clustering.service.FunctionalService.stop(FunctionalService.java:77)
    at org.wildfly.clustering.service@18.0.1.Final//org.wildfly.clustering.service.AsyncServiceConfigurator$AsyncService.lambda$stop$1(AsyncServiceConfigurator.java:142)
    at org.jboss.threads@2.3.3.Final//org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
    at org.jboss.threads@2.3.3.Final//org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1982)
    at org.jboss.threads@2.3.3.Final//org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
    at org.jboss.threads@2.3.3.Final//org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1377)
    at java.base/java.lang.Thread.run(Thread.java:834)
    at org.jboss.threads@2.3.3.Final//org.jboss.threads.JBossThread.run(JBossThread.java:485)

Also, in some cases there are a lot of errors like this (I think they’re related):

 Error executing command PutKeyValueCommand on Cache 'authenticationSessions', writing keys [5efd900f-05b2-4aac-9c6c-f9cc74490f01]
Error executing command PutKeyValueCommand on Cache 'clientSessions', writing keys [ca20afb1-53bb-4d59-8914-12b6c1962c15]
... similar messages ...

Those errors make my system unstable and clients start to receiving a lot of errors (400 and 500)

Another related exception

: java.sql.SQLException: javax.resource.ResourceException: IJ000470: You are trying to use a connection factory that has been shut down: java:jboss/datasources/KeycloakDS
at org.jboss.ironjacamar.jdbcadapters@1.4.17.Final//org.jboss.jca.adapters.jdbc.WrapperDataSource.getConnection(WrapperDataSource.java:159)
at org.jboss.as.connector@18.0.1.Final//org.jboss.as.connector.subsystems.datasources.WildFlyDataSource.getConnection(WildFlyDataSource.java:64)
at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.getConnection(JDBC_PING.java:310)
at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.clearTable(JDBC_PING.java:361)
at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.removeAll(JDBC_PING.java:190)
at org.jgroups@4.1.4.Final//org.jgroups.protocols.JDBC_PING.stop(JDBC_PING.java:119)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1540)
.....

Hi
If it is helpful we are using KUBE_PING with Kubernetes cluster. Our DevOps engineer create service account with RBAC and pods saw each other

We finally managed to merge one of my fixes for JDBC_PING: https://github.com/keycloak/keycloak-containers/pull/255

It should get better.

Hey, thanks for your reply!

Unfortunately, Kubernetes is not an option since the company uses Docker Swarm :confused:

Hey,

Thanks for your suggestion, I’ll give it a try.

Regarding 15c5b97
Basically, you removed PING and MPING protocols from configuration. Is it ok? Since I’m using 9.0.3 I’ll create a custom start up script.

Yeah, sure. You need to have (at least) one discovery protocol in your stack. MPING/PING/JDBC_PING are all different implementation of the discovery protocol.

1 Like

Hi @slaskawi,

After some tests, errors deleting ping data disappear, but downtime is still a concern with 500 errors.

Is there a way to avoid this kind of errors?

I’m thinking about changing the discovery timeout to a lower value, configure the cache to be async?

Any help would be great :confused:

It seems to be working with Keycloak 11.0.2 + JDBC_PATCH

I’ll keep you posted after a while

Thanks!