Hello,
I have a high availablility deployment of Keycloak 24 with an independant Infinispan distributed cache.
There is an intermittant error encountered during a PKCE Auth Code login when the user authenticates using the Google Identity Provider.
I have traced the problem down to the Authorization Sessions active code not matching the state parameter code when processing the callback from Google.
I can see in the logs that the Authorization Sessions active code is updated before generating the state parameter for the redirect to Google. The active code forms part of state code.
Sometimes the Authorization Session returns the old active code. Becuase the active code and the state do not match, Keycloak considors this an Illegal Hash and directs the user to the Login Timeout page.
The active code may be updated by a different Keycloak instance than the instance that handles the callback. I beleive I have observed the failure in both scenarios. There appears to be a problem with the Authorization Session distributed cache being updated correctly.
Here is the configuration for Keycloak’s authenticationSessions cache
<distributed-cache name="authenticationSessions" owners="2" statistics="true"> <expiration lifespan="-1"/> <persistence passivation="false"> <remote-store xmlns="urn:infinispan:config:store:remote:14.0" cache="authenticationSessions" raw-values="true" shared="true" segmented="false" marshaller="org.keycloak.cluster.infinispan.KeycloakHotRodMarshallerFactory"> <remote-server host="${env.KC_CACHE_REMOTE_STORE_HOST}" port="${env.KC_CACHE_REMOTE_STORE_PORT}"/> <connection-pool max-active="16" exhausted-action="CREATE_NEW"/> <security> <authentication server-name="infinispan"> <digest username="${env.KC_CACHE_REMOTE_STORE_USERNAME}" password="${env.KC_CACHE_REMOTE_STORE_PASSWORD}" realm="${env.KC_CACHE_REMOTE_STORE_REALM}"/> </authentication> </security> </remote-store> </persistence> </distributed-cache>
The Infinispan cluster consists of 6 instances with the cache owners set to 2 for each distributed cache.
Here is the Infinispan configuration for the authenticatedSessions cache
infinispan:
cacheContainer:
# Cache Container name is used for defining health status probe URL paths
name: default
statistics: true
caches:
# Template for Distributed Caches
distributed-cache-cfg:
distributed-cache-configuration:
mode: SYNC
statistics: "true"
locking:
isolation: READ_COMMITTED
# Disable striping to create a new lock per entry
striping: false
# Amount of time, in milliseconds, to wait for a contented lock
acquire-timeout: 20000
transaction:
mode: NON_XA
locking: PESSIMISTIC
encoding:
mediaType: application/x-jboss-marshalling
# Keycloak authenticated sessions cache
authenticationSessions:
distributed-cache:
configuration: distributed-cache-cfg
owners: 2
The Infinispan documentation states:
.. read can return the value from any owner, depending on how fast the primary owner replies. The write is not atomic across all the owners. In fact, the primary commits the update only after it receives a confirmation from the backup. While the primary is waiting for the confirmation message from the backup, reads from the backup will see the new value, but reads from the primary will see the old one.
This may explain why we are intermitantly seeing an old value being returned by Infinispan. However, the update of the “active_code“ is occurring many seconds before the read since the read does not occur until the user has completed their login with Google (20+ seconds). One would expect that any delay in updating/commiting the updating in both owner nodes would be measured in milliseconds and not seconds or minutes.
Additionally…
Infinispan nodes create local replicas when they retrieve entries from another node in the cluster. L1 caches avoid repeatedly looking up entries on primary owner nodes and adds performance.
Enabling L1 improves performance for read operations but requires primary owner nodes to broadcast invalidation messages when entries are modified.
An L1 cache (disabled by default) only exists if you set your cache mode to distribution. An L1 cache prevents unnecessary remote fetching of entries mapped to remote caches by storing them locally for a short time after the first time they are accessed. By default, entries in L1 have a lifespan of 60,000 milliseconds (though you can configure how long L1 entries are cached for). L1 entries are also invalidated when the entry is changed elsewhere in the cluster so you are sure you don't have stale entries cached in L1. Caches with L1 enabled will consult the L1 cache before fetching an entry from a remote cache.
The documenation for the distributed cache “l1-lifespan
“ configuration attribute states:
Maximum lifespan in milliseconds of an entry placed in the L1 cache. By default L1 is disabled unless a positive value is configured for this attribute. If the attribute is not present, L1 is disabled.
The documentation on the L1 cache is contradictory as it says “By default L1 is disabled unless a positive value is configured
“ but also “By default, entries in L1 have a lifespan of 60,000 milliseconds
“
If the L1 cache is set to 60K millis or 60 seconds then that could explain why we are seeing old values returned by the cache 22+ seconds after the update. A possible solution may be to explicitly disable the L1 cache by setting the “l1-lifespan
“ configuration attribute to -1.