Three server clustered config

Dear folks,

We already have a working standalone.xml version with a separate backend database on another server.

I have three servers that I want to spin up a clustered Keycloak environment. The documentation is a little bit confusing, and is mentioning things like XA and node IDs etc.

Is it just a matter of booting the same config using standalone-ha.xml on the three servers, or do I have to do more? The keycloak official docs don’t mention node IDs, but other articles do… and I’m afraid I’m getting confused.

Hope someone can point me in the right direction!

Hi,

You can use the same standalone-ha.xml on each of the three servers, but the starting commandline is individual:
./standalone.sh --server-config=standalone-ha.xml -b=0.0.0.0 -bprivate=LOCAL_IP_ADDRESS -Djboss.node.name=NODE_X -Djboss.tx.node.id=NODE_X
where X is 1, 2, 3, …
In the infinispan section in standalone-ha.xml, the owners count of the distributed cache entries has to be set to a minimum of 2.

I run an Apache SSL proxy in AJP balancer mode in front the nodes, config in your case could be like this

<Proxy balancer://cluster>
	BalancerMember ajp://ADDRESS_OF_NODE_1:8009 route=NODE_1 status=
	BalancerMember ajp://ADDRESS_OF_NODE_2:8009 route=NODE_2 status=
	BalancerMember ajp://ADDRESS_OF_NODE_3:8009 route=NODE_3 status=
	ProxySet stickysession=AUTH_SESSION_ID
	Allow from all
</Proxy>
<IfModule mod_ssl.c>
	<VirtualHost _default_:443>
		#...SSL-Stuff...omitted
		ProxyPass /auth balancer://cluster/auth
		ProxyPassReverse /auth balancer://cluster/auth
	</VirtualHost>
</IfModule>

If you run the Apache proxy on every node and put a network loadbalancer in front of the proxies, you get a rock-solid setup…

But you should consider to switch to the quarkus based distribution, support and development of the wildlfy-based version is phasing out…

Thanks so much for your help!

I am almost there, but I think I might be missing something still…

I have three nodes sitting behind a load balancer with HAProxy with pinned sessions. Login on the admin interface is working, but login via sites is not… I think this is because there are a lot of services that require OIDC auth tokens. I’m guessing that the clusters can’t validate each others tokens.

Does this sound likely? Does that mean the broadcast network isn’t working properly?

Does it work with only one node running? Is there something helpful in the logs? Messages like “… joined the cluster” or “…has left the cluster” when starting or shutting down nodes?

Yes, it does work with one node running, but I am not seeing any messages about joining or leaving the cluster…

I did see a few Received new cluster view for channel ejb: [NODE_DK3|0] (1) [NODE_DK3] lines.

One thing I did notice, which I imagine has something to do with it, is the new nodes are generating self certs, whereas the single node I was originally using is not. Presumably there’s some way to inject a certificate into the container somehow, which must be a step I’m missing…?

By default, standalone-ha.xml is configured to use jgroups with UDP to lookup the cluster nodes, using special multicast address 230.0.0.4, port 45688. Is there anything in your network or on the hosts blocking this?

I’ve made sure that was exposed and visible to the external interface. Didn’t seem to make a difference. Or at least, I’m not seeing any mention of e.g NODE_DK3 finding NODE_DK1 or NODE_DK2.

Something I’ve just thought, which probably has a lot to do with it, is that the three servers are actually three VPSs on the same network, each running docker. For “Reasons” this is not one Kubernetes cluster running 3 keycloak containers, it’s three VPSs running three separate docker instances each running one keycloak container.

In this layout, I’m wondering if this constitutes a “cross datacenter” config, even though they all share the same galera cluster DB backend?

Ok… dusting this off and I think I’m making progress.

I’ve switched to using TCPPING and explicitly setting the initial_hosts to the ip addresses of the three nodes.

However, when I fire this up I’m getting:

WFLYCTL0062: Composite operation failed and was rolled back. Steps that failed:
Step: step-9
Operation: /subsystem=datasources/jdbc-driver=mariadb:add(driver-name=mariadb, driver-module-name=org.mariadb.jdbc, driver-xa-datasource-class-name=org.mariadb.jdbc.MySQLDataSource)
Failure: WFLYCTL0212: Duplicate resource [
    ("subsystem" => "datasources"),
    ("jdbc-driver" => "mariadb")
]

To me, this is telling me that clustering is … kind of … working… in that I only get this error when the initial_hosts are set allowing node discovery.

But, I’m scratching my head as to what’s going on. All three nodes are identical keycloak containers that are configured with identical standalone-ha.xml files, and the only environment differences are different jboss.tx.node.id, jboss.node.name and jboss.bind.address / jboss.bind.address.private values to configure the node.

Any clues where I might look?