Experience of "Cross-Datacenter Replication Mode" in production environment?

i see the keycloak “Cross-Datacenter Replication Mode” is in technical preview stage for several years …

i am wonder whether somebody can share your experience with this feature…is it stable enough? ever use it in critical scenario?

thank you.

1 Like

Hi! We have tried to set up this mode a few times on different versions of Keycloak, but every attempt was unsuccessful. Here is a discussion of these issues: https://groups.google.com/g/keycloak-user/c/mm04q0C9oW0/m/8thToqQeBAAJ?utm_medium=email&utm_source=footer&pli=1
We finally managed to configure production cross-regional Keycloak infrastructure based on Master-Slave cross-regional database (AWS Aurora PostgreSQL) and using JDBC_PING discovery protocol for groups Reliable group communication with JGroups.

Hi,
i have a look of the discussion mentioned, may i ask whether that mode offer good performance and stable (though it is still in tech preview for years)…?

also, after you setup that mode, what will happend when you try to upgrade keycloak version? say from 14.0 to 15.0? will 15.0 keycloak able to pickup session data/cache from the infinispan remote cache?

thanks.

As I mentioned, we failed to set this mode in accordance with Keycloak documentation, so we have chosen our own path and the performance is good for our case.
Regarding your second question, Keycloak does not guarantee zero-downtime upgrades regardless of the architecture you are using. 11.0.3 to 12.0.1 Upgrade fails - #5 by dasniko. E.g. we cannot upgrade from 11.x to 12.x version without losing all authenticated sessions. We are performing zero-downtime upgrades while staying on the 11.x version. We drain traffic from one region, upgrade that region, swap traffic to the upgraded region and upgrade the second region. Each region has a cluster of 3 Keycloak nodes. Cross-region infinispan caches replication is enabled by setting this option: CACHE_OWNERS_COUNT=4, so at least one copy of the cache stores in another region.

Hi,

so, from your experience, may i conclude that:

  1. up to this moment, zero-downtime upgrade never work for ‘major’ keycloak version upgrade (with or without external infinispan)…actually, i am using the keycloak commercial version RHSSO (but without dedicated/external infinispan cache). For even ‘minor’ version upgrade (e.g. RHSSO 7.4.0 [based on keycloak 9.0.3] to 7.4.4 [based on keycloak 9.0.10]), we are losing all the authenticated sessions!
  2. for minor version upgrade, zero downtime may works.

i don’t have much experience on infinispan/keycloak session replication…but do you guess whether it is possible to write a helper utility that listen on the session data creation/delete/update and ‘replicate’ the changes to the new keycloak version?

use case:

  1. setup another dedicated new keycloak version say keycloak 15.0 (existing = 14.0), with dedicated DB. so, they share nothing at this moment. the new version don’t serve traffic initially.
  2. before we start to switch to the new version, in the existing version, read the session data and manually apply change to the new version (invole some infinispan cache listener?). this may involve some keycloak session data model change (e.g. map 14.0 session object to 15.0 sesison object). During this period (e.g. 5mins), we may need to also setup DB tool to sync data from existing DB to new DB. (if the table schema changed)
  3. when all existing session data (or at least most of the existing session data) are ‘replicated’ to the new version, switch all the traffic to the new version. this task should be done in saturday/sunday mid-night…

i am not sure the complexity of creating such ‘migration’ tool…i see some other OIDC implementation make use of similar approach for zero downtime migration (e.g. WSO2 Identity server)

thank you.

Hi @ping ,

  1. We have managed to upgrade successfully without losing any auth sessions from 10.x to 11.x, but not from 11.x to 12.x, so the major version upgrade may succeed or not succeed with zero-downtime. It depends on the changes made in the new version in the infinispan subsystem.
  2. For minor version upgrades zero-downtime maintenance works.

Regarding the cache syncing/porting automation, do you really want to spend so much effort for this rare event just for saving auth sessions? I do not know your case but it seems it is not so disruptive as it happens really rare. Also, why are you forced to use a major version as soon as it is out?

Hi @efimovms,

my organization involve some 7x24 operations. i want to explore whether there can be a zero downtime mechanism.

as we make use of RHSSO cluster (without external/standalone infinispan due to some reason), we found the RHSSO cluster will lose auth session on minor version upgrade (e.g. RHSSO 7.4.3 to 7.4.4) . it seems when there is upgrade in the jgroups library, RHSSO minor version upgrade will fails due to clustering message/session replication incompatibility.

anyway, many thanks for your advice and sharing.