Issue with Keycloak 6.0.0 Cluster Utilizing 20 Gb/s Bandwidth

Hello -

We’re running keycloak 6.0.0 and have it clustered via the default configuration that ships with the standalone-ha.xml clustering configuration (we’re not using cross-datacenter replication).

After a lengthy period of stability, we can no longer cluster the nodes together. The application logs have error messages such as:

2019-12-05 13:36:19,494 ERROR [org.keycloak.models.sessions.infinispan.util.FuturesHelper] (Timer-3) Exception when waiting for future: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 5242042 from XXXXXX
    at java.util.concurrent.CompletableFuture.reportGet(
    at java.util.concurrent.CompletableFuture.get(
    at org.keycloak.models.sessions.infinispan.util.FuturesHelper.waitForAllToFinish(
    at org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider.removeExpiredOfflineUserSessions(
    at org.keycloak.models.sessions.infinispan.InfinispanUserSessionProvider.removeExpired(
    at org.keycloak.timer.basic.BasicTimerProvider$
    at java.util.TimerThread.mainLoop(
Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 5242042 from XXXXXX
    at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(
    at java.util.concurrent.ScheduledThreadPoolExecutor$
    at java.util.concurrent.ThreadPoolExecutor.runWorker(
    at java.util.concurrent.ThreadPoolExecutor$

When this happens, we’ve observed the keycloak nodes generating 20 Gigabits of network traffic per second.

We’ve been able to reproduce the issue in our QA environment by generating a large number of offline tokens and then having them expire from the infinispan cache. When cache is purged, we took a network capture. Each datagram is 50 kb in size, which is enormous. The traffic eventually grows so large that all network links saturate and everything fails. We examined one particular datagram and was surprised to see java stack traces inside (best seen in the following screenshot since it is difficult to format):

We haven’t confirmed it yet, but suspect the nodes are sending a storm of sorts to each other. One node sends a datagram with an exception and then an error occurs thus causing another exception until things grow in size to the point where the network packet size of ~60 kb is exceeded.

Has anybody seen anything like this? Is it a known bug? Any advice for troubleshooting or fixing?

1 Like

I posted a more detailed version of this question and answered it here: