Why Your Keycloak Nodes “Freeze” — and How to Save Your Identity Cluster - Nihad Gurbanov

Kubernetes is often described as self-healing.
But if you run Keycloak at scale, you’ve probably learned the hard way that “self-healing” doesn’t always mean identity-safe.

One short infrastructure event — sometimes just a few seconds — can cascade into random logouts, stuck sessions, database locks, or even a full authentication outage.

This post explains why Keycloak nodes appear to “freeze,” what actually breaks under the hood, and how to respond like an SRE instead of firefighting in panic.

A Familiar Alert, a Missing Node

You get an alert:

“Potential Split-Brain Risk detected on node aks-worker-9u”

You immediately check the cluster:

kubectl get nodes

The node is gone.
No NotReady.
No SchedulingDisabled.
No trace.

The Keycloak pod that was running there? Also gone.

At first glance, this feels terrifying — did Kubernetes just eat my node?

Not quite.

This is not a Kubernetes bug.
And it’s not a Keycloak bug either.

It’s the interaction between cloud infrastructure maintenance and distributed identity coordination.

1. The “Freeze”: What Is Actually Happening?

Cloud providers like Azure must regularly maintain physical hosts. Your VMs don’t live on magic — they live on real hardware.

Azure typically handles this in one of two ways:

Option 1: Planned Upgrade (Reboot)

The VM is stopped
The OS is patched
The VM reboots
Kubernetes sees a clean node restart

This is noisy but predictable.

Option 2: Live Migration (“The Freeze”)

Azure migrates the VM to a new physical host
To copy memory safely, the VM’s CPU is paused
The pause typically lasts a few seconds (sometimes longer)

💥 This is the dangerous one

From the outside:

The node stops responding
Heartbeats disappear
Network traffic pauses

From inside the VM:

Time effectively jumps forward
The process never “crashed”
Keycloak thinks it was running continuously

This mismatch is the root of the problem.

2. Why Keycloak Is Especially Sensitive

Keycloak is not a stateless web app.

Under the hood, it relies on:

Infinispan for distributed caches (sessions, tokens, login state)
JGroups for cluster membership and messaging
Leader election for background jobs
Database locks to prevent duplicate work

This design is powerful — but it assumes reasonable clock and heartbeat consistency.

A freeze violates that assumption.

3. How a 10-Second Freeze Becomes a Split-Brain

Let’s walk through a real failure scenario.

Step 1: Normal Operation

Nodes A, B, and C are running Keycloak
Node A is the cluster coordinator
Node A holds leadership-related locks
Sessions and cache ownership are stable

Step 2: Node A Freezes

Azure pauses Node A’s CPU for ~10 seconds
No heartbeats are sent
No messages are received

To the rest of the cluster:

“Node A is dead.”

Step 3: Re-election Happens

Nodes B and C trigger a new election
Node B becomes the new coordinator
Node B starts background jobs
Node B writes new lock state to the database

So far, everything looks healthy.

Step 4: Node A Wakes Up

CPU resumes
Internal clock jumps forward
Node A never realized it stopped

From Node A’s perspective:

“I’m still the leader.”

Now you have two leaders.

4. The Fallout: What Breaks in Production

This is where things get ugly — and unpredictable.

Common Symptoms

Users are randomly logged out
Sessions vanish or reappear
Token refreshes fail
Admin console becomes unstable
Database logs fill with lock errors

Typical Log Messages

ISPN000040: Two masters announced
Could not acquire change log lock
Failed to update session state

Why This Happens

Two nodes attempt to:
- Expire the same sessions
- Update the same rows
- Own the same cache segments
Infinispan tries to reconcile conflicting ownership
The database becomes the last line of defense — and starts rejecting writes

This is classic distributed split-brain behavior, and identity systems suffer the most because state correctness matters more than availability.

5. “The Node Is Gone” — Why That’s Actually Good News

If you ran:

kubectl get nodes

…and the node no longer exists, congratulations — AKS already saved you.

Azure Kubernetes Service includes Auto-Repair:

If a node stays NotReady or unresponsive beyond a threshold
Azure assumes hardware failure
The VM is deleted
A fresh node is provisioned

Why This Matters

Deleting the node:

Permanently kills the “old brain”
Prevents it from waking up later
Guarantees only one leader remains

In other words:

The most dangerous node is a frozen one that comes back.
A deleted node is safe.

6. The SRE Playbook: What To Do When This Happens

When you get a freeze or maintenance alert involving Keycloak, act calmly and methodically.

Step 1: Find Where Keycloak Is Running Now

kubectl get pods -A -l app.kubernetes.io/name=keycloak -o wide

Confirm:

All pods are running
They are on different nodes
No pod is stuck in Terminating

Step 2: Inspect Logs for Split-Brain Residue

kubectl logs <pod-name> | grep -E "ISPN|Lock|database"

Pay special attention to:

Repeated lock acquisition failures
“Two masters” messages
Continuous session rebalancing

A few messages during recovery are normal.
Persistent errors are not.

Step 3: The Safe Reset (When Things Go Sideways)

If login failures spike or behavior becomes erratic, don’t try to outsmart a broken cluster.

Do a clean reset:

kubectl scale deployment keycloak --replicas=0

Wait until:

All pods are terminated
Database locks are released

Then bring it back:

kubectl scale deployment keycloak --replicas=3

This:

Clears Infinispan state
Forces a clean leader election
Restores a single cluster “brain”

Yes, it’s disruptive — but far less damaging than silent identity corruption.

7. How to Reduce the Risk Going Forward

While freezes can’t be eliminated entirely, you can limit the blast radius:

Run at least 3 Keycloak replicas
Use pod anti-affinity to spread nodes
Tune JGroups and Infinispan timeouts carefully
Monitor:
- Coordinator changes
- Lock acquisition latency
- Session eviction rates
Prefer node replacement over resurrection

Final Thoughts

Infrastructure will always be transient:

Hardware ages
Hosts get patched
VMs migrate
Nodes disappear

The goal isn’t to prevent change — it’s to design identity systems that survive it.

The Key Insight

A frozen node is more dangerous than a dead one.
If the platform deletes it, that’s not failure — that’s protection.

When your Keycloak cluster has one healthy coordinator and a clean cache state, authentication becomes boring again — exactly how it should be.

Why Your Keycloak Nodes “Freeze” — and How to Save Your Identity Cluster