Why Your Keycloak Nodes “Freeze” — and How to Save Your Identity Cluster

Kubernetes is often described as self-healing.
But if you run Keycloak at scale, you’ve probably learned the hard way that “self-healing” doesn’t always mean identity-safe.

One short infrastructure event — sometimes just a few seconds — can cascade into random logouts, stuck sessions, database locks, or even a full authentication outage.

This post explains why Keycloak nodes appear to “freeze,” what actually breaks under the hood, and how to respond like an SRE instead of firefighting in panic.


A Familiar Alert, a Missing Node

You get an alert:

“Potential Split-Brain Risk detected on node aks-worker-9u”

You immediately check the cluster:

kubectl get nodes

The node is gone.
No NotReady.
No SchedulingDisabled.
No trace.

The Keycloak pod that was running there? Also gone.

At first glance, this feels terrifying — did Kubernetes just eat my node?

Not quite.

This is not a Kubernetes bug.
And it’s not a Keycloak bug either.

It’s the interaction between cloud infrastructure maintenance and distributed identity coordination.


1. The “Freeze”: What Is Actually Happening?

Cloud providers like Azure must regularly maintain physical hosts. Your VMs don’t live on magic — they live on real hardware.

Azure typically handles this in one of two ways:

Option 1: Planned Upgrade (Reboot)

  • The VM is stopped
  • The OS is patched
  • The VM reboots
  • Kubernetes sees a clean node restart

This is noisy but predictable.

Option 2: Live Migration (“The Freeze”)

  • Azure migrates the VM to a new physical host
  • To copy memory safely, the VM’s CPU is paused
  • The pause typically lasts a few seconds (sometimes longer)

💥 This is the dangerous one

From the outside:

  • The node stops responding
  • Heartbeats disappear
  • Network traffic pauses

From inside the VM:

  • Time effectively jumps forward
  • The process never “crashed”
  • Keycloak thinks it was running continuously

This mismatch is the root of the problem.


2. Why Keycloak Is Especially Sensitive

Keycloak is not a stateless web app.

Under the hood, it relies on:

  • Infinispan for distributed caches (sessions, tokens, login state)
  • JGroups for cluster membership and messaging
  • Leader election for background jobs
  • Database locks to prevent duplicate work

This design is powerful — but it assumes reasonable clock and heartbeat consistency.

A freeze violates that assumption.


3. How a 10-Second Freeze Becomes a Split-Brain

Let’s walk through a real failure scenario.

Step 1: Normal Operation

  • Nodes A, B, and C are running Keycloak
  • Node A is the cluster coordinator
  • Node A holds leadership-related locks
  • Sessions and cache ownership are stable

Step 2: Node A Freezes

  • Azure pauses Node A’s CPU for ~10 seconds
  • No heartbeats are sent
  • No messages are received

To the rest of the cluster:

“Node A is dead.”

Step 3: Re-election Happens

  • Nodes B and C trigger a new election
  • Node B becomes the new coordinator
  • Node B starts background jobs
  • Node B writes new lock state to the database

So far, everything looks healthy.

Step 4: Node A Wakes Up

  • CPU resumes
  • Internal clock jumps forward
  • Node A never realized it stopped

From Node A’s perspective:

“I’m still the leader.”

Now you have two leaders.


4. The Fallout: What Breaks in Production

This is where things get ugly — and unpredictable.

Common Symptoms

  • Users are randomly logged out
  • Sessions vanish or reappear
  • Token refreshes fail
  • Admin console becomes unstable
  • Database logs fill with lock errors

Typical Log Messages

ISPN000040: Two masters announced
Could not acquire change log lock
Failed to update session state

Why This Happens

  • Two nodes attempt to:
    • Expire the same sessions
    • Update the same rows
    • Own the same cache segments
  • Infinispan tries to reconcile conflicting ownership
  • The database becomes the last line of defense — and starts rejecting writes

This is classic distributed split-brain behavior, and identity systems suffer the most because state correctness matters more than availability.


5. “The Node Is Gone” — Why That’s Actually Good News

If you ran:

kubectl get nodes

…and the node no longer exists, congratulations — AKS already saved you.

Azure Kubernetes Service includes Auto-Repair:

  • If a node stays NotReady or unresponsive beyond a threshold
  • Azure assumes hardware failure
  • The VM is deleted
  • A fresh node is provisioned

Why This Matters

Deleting the node:

  • Permanently kills the “old brain”
  • Prevents it from waking up later
  • Guarantees only one leader remains

In other words:

The most dangerous node is a frozen one that comes back.
A deleted node is safe.


6. The SRE Playbook: What To Do When This Happens

When you get a freeze or maintenance alert involving Keycloak, act calmly and methodically.


Step 1: Find Where Keycloak Is Running Now

kubectl get pods -A -l app.kubernetes.io/name=keycloak -o wide

Confirm:

  • All pods are running
  • They are on different nodes
  • No pod is stuck in Terminating

Step 2: Inspect Logs for Split-Brain Residue

kubectl logs <pod-name> | grep -E "ISPN|Lock|database"

Pay special attention to:

  • Repeated lock acquisition failures
  • “Two masters” messages
  • Continuous session rebalancing

A few messages during recovery are normal.
Persistent errors are not.


Step 3: The Safe Reset (When Things Go Sideways)

If login failures spike or behavior becomes erratic, don’t try to outsmart a broken cluster.

Do a clean reset:

kubectl scale deployment keycloak --replicas=0

Wait until:

  • All pods are terminated
  • Database locks are released

Then bring it back:

kubectl scale deployment keycloak --replicas=3

This:

  • Clears Infinispan state
  • Forces a clean leader election
  • Restores a single cluster “brain”

Yes, it’s disruptive — but far less damaging than silent identity corruption.


7. How to Reduce the Risk Going Forward

While freezes can’t be eliminated entirely, you can limit the blast radius:

  • Run at least 3 Keycloak replicas
  • Use pod anti-affinity to spread nodes
  • Tune JGroups and Infinispan timeouts carefully
  • Monitor:
    • Coordinator changes
    • Lock acquisition latency
    • Session eviction rates
  • Prefer node replacement over resurrection

Final Thoughts

Infrastructure will always be transient:

  • Hardware ages
  • Hosts get patched
  • VMs migrate
  • Nodes disappear

The goal isn’t to prevent change — it’s to design identity systems that survive it.

The Key Insight

A frozen node is more dangerous than a dead one.
If the platform deletes it, that’s not failure — that’s protection.

When your Keycloak cluster has one healthy coordinator and a clean cache state, authentication becomes boring again — exactly how it should be.

Why Your Keycloak Nodes “Freeze” — and How to Save Your Identity Cluster

Leave a Reply

Your email address will not be published. Required fields are marked *

Scroll to top