Experiencing Infinispan Timeouts

We have been running Keycloak on AWS as an ECS application for about 3 months now. We occasionally experience infinispan timeouts (shown below) when under load. These cause a spate of login errors until the system recovers. Our infinispan stack is also shown below. Does anyone have any thoughts as to how we might debug this. We have reviewed other posts in this forum but nothing has helped so far. Thanks!

Error:

2020-08-09 01:44:13,170 ERROR [org.keycloak.services.error.KeycloakErrorHandler] (default task-1957) env=dev node_ip=10.116.53.26 ecs_cluster_name=keycloak-service-cluster ecs_service_name=keycloak-service Uncaught server error: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 30113426 from ip-10-116-52-251
	at org.infinispan@9.4.16.Final//org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:259)
	at org.infinispan@9.4.16.Final//org.infinispan.cache.impl.CacheImpl.executeCommandAndCommitIfNeeded(CacheImpl.java:1918)
	at org.infinispan@9.4.16.Final//org.infinispan.cache.impl.CacheImpl.putIfAbsent(CacheImpl.java:1474)
	at org.infinispan@9.4.16.Final//org.infinispan.cache.impl.DecoratedCache.putIfAbsent(DecoratedCache.java:695)

Infinispan stack configuration:

                <stack name="tcp">
                    <transport type="TCP" socket-binding="jgroups-tcp"/>
                    <protocol type="JDBC_PING">
                        <property name="datasource_jndi_name">java:jboss/datasources/KeycloakDS
                        </property>
                        <property name="remove_old_coords_on_view_change">true</property>
                        <property name="remove_all_data_on_view_change">true</property>
                        <property name="initialize_sql">
                            CREATE TABLE IF NOT EXISTS JGROUPSPING (
                            own_addr varchar(200) NOT NULL,
                            bind_addr varchar(200) NOT NULL,
                            created timestamp NOT NULL,
                            cluster_name varchar(200) NOT NULL,
                            ping_data BYTEA,
                            constraint PK_JGROUPSPING PRIMARY KEY (own_addr, cluster_name)
                            )
                        </property>
                        <property name="insert_single_sql">INSERT INTO JGROUPSPING (own_addr, bind_addr, created, cluster_name, ping_data) values (?, '${jboss.bind.address.private:UNKNOWN}', NOW(), ?, ?)</property>
                        <property name="delete_single_sql">DELETE FROM JGROUPSPING WHERE own_addr=? AND cluster_name=?</property>
                        <property name="select_all_pingdata_sql">SELECT ping_data, own_addr, cluster_name FROM JGROUPSPING WHERE cluster_name=?</property>
                    </protocol>
                    <protocol type="MERGE3"/>
                    <socket-protocol type="FD_SOCK" socket-binding="jgroups-tcp-fd"/>
                    <protocol type="FD_ALL"/>
                    <protocol type="VERIFY_SUSPECT"/>
                    <protocol type="pbcast.NAKACK2"/>
                    <protocol type="UNICAST3"/>
                    <protocol type="pbcast.STABLE"/>
                    <protocol type="pbcast.GMS"/>
                    <protocol type="MFC"/>
                    <protocol type="FRAG3"/>
                </stack>
            </stacks>

Are you setup to use a cross-datacenter remote infinispan setup? So you have keycloak nodes and separate infinispan nodes?

The configuration you posted is jgroups TCP jDBC_PING confuguration for cluster discovery. It’s used as a way for new keycloak nodes to discover existing nodes in the cluster to sync cache from.

There’s configuration for state-transfer timeouts for caches

<replicated-cache name="offlineSessions">
      <!---- default settings ---->
     <state-transfer timeout="240000" chunk-size="512"/>
</replicated-cache>

Hi Eloyot,

I got the same issue, so how did you fix it?