Kubernetes is often described as self-healing.
But if you run Keycloak at scale, you’ve probably learned the hard way that “self-healing” doesn’t always mean identity-safe.
One short infrastructure event — sometimes just a few seconds — can cascade into random logouts, stuck sessions, database locks, or even a full authentication outage.
This post explains why Keycloak nodes appear to “freeze,” what actually breaks under the hood, and how to respond like an SRE instead of firefighting in panic.
A Familiar Alert, a Missing Node
You get an alert:
“Potential Split-Brain Risk detected on node aks-worker-9u”
You immediately check the cluster:
kubectl get nodes
The node is gone.
No NotReady.
No SchedulingDisabled.
No trace.
The Keycloak pod that was running there? Also gone.
At first glance, this feels terrifying — did Kubernetes just eat my node?
Not quite.
This is not a Kubernetes bug.
And it’s not a Keycloak bug either.
It’s the interaction between cloud infrastructure maintenance and distributed identity coordination.
1. The “Freeze”: What Is Actually Happening?
Cloud providers like Azure must regularly maintain physical hosts. Your VMs don’t live on magic — they live on real hardware.
Azure typically handles this in one of two ways:
Option 1: Planned Upgrade (Reboot)
- The VM is stopped
- The OS is patched
- The VM reboots
- Kubernetes sees a clean node restart
This is noisy but predictable.
Option 2: Live Migration (“The Freeze”)
- Azure migrates the VM to a new physical host
- To copy memory safely, the VM’s CPU is paused
- The pause typically lasts a few seconds (sometimes longer)
💥 This is the dangerous one
From the outside:
- The node stops responding
- Heartbeats disappear
- Network traffic pauses
From inside the VM:
- Time effectively jumps forward
- The process never “crashed”
- Keycloak thinks it was running continuously
This mismatch is the root of the problem.
2. Why Keycloak Is Especially Sensitive
Keycloak is not a stateless web app.
Under the hood, it relies on:
- Infinispan for distributed caches (sessions, tokens, login state)
- JGroups for cluster membership and messaging
- Leader election for background jobs
- Database locks to prevent duplicate work
This design is powerful — but it assumes reasonable clock and heartbeat consistency.
A freeze violates that assumption.
3. How a 10-Second Freeze Becomes a Split-Brain
Let’s walk through a real failure scenario.
Step 1: Normal Operation
- Nodes A, B, and C are running Keycloak
- Node A is the cluster coordinator
- Node A holds leadership-related locks
- Sessions and cache ownership are stable
Step 2: Node A Freezes
- Azure pauses Node A’s CPU for ~10 seconds
- No heartbeats are sent
- No messages are received
To the rest of the cluster:
“Node A is dead.”
Step 3: Re-election Happens
- Nodes B and C trigger a new election
- Node B becomes the new coordinator
- Node B starts background jobs
- Node B writes new lock state to the database
So far, everything looks healthy.
Step 4: Node A Wakes Up
- CPU resumes
- Internal clock jumps forward
- Node A never realized it stopped
From Node A’s perspective:
“I’m still the leader.”
Now you have two leaders.
4. The Fallout: What Breaks in Production
This is where things get ugly — and unpredictable.
Common Symptoms
- Users are randomly logged out
- Sessions vanish or reappear
- Token refreshes fail
- Admin console becomes unstable
- Database logs fill with lock errors
Typical Log Messages
ISPN000040: Two masters announced
Could not acquire change log lock
Failed to update session state
Why This Happens
- Two nodes attempt to:
- Expire the same sessions
- Update the same rows
- Own the same cache segments
- Infinispan tries to reconcile conflicting ownership
- The database becomes the last line of defense — and starts rejecting writes
This is classic distributed split-brain behavior, and identity systems suffer the most because state correctness matters more than availability.
5. “The Node Is Gone” — Why That’s Actually Good News
If you ran:
kubectl get nodes
…and the node no longer exists, congratulations — AKS already saved you.
Azure Kubernetes Service includes Auto-Repair:
- If a node stays
NotReadyor unresponsive beyond a threshold - Azure assumes hardware failure
- The VM is deleted
- A fresh node is provisioned
Why This Matters
Deleting the node:
- Permanently kills the “old brain”
- Prevents it from waking up later
- Guarantees only one leader remains
In other words:
The most dangerous node is a frozen one that comes back.
A deleted node is safe.
6. The SRE Playbook: What To Do When This Happens
When you get a freeze or maintenance alert involving Keycloak, act calmly and methodically.
Step 1: Find Where Keycloak Is Running Now
kubectl get pods -A -l app.kubernetes.io/name=keycloak -o wide
Confirm:
- All pods are running
- They are on different nodes
- No pod is stuck in
Terminating
Step 2: Inspect Logs for Split-Brain Residue
kubectl logs <pod-name> | grep -E "ISPN|Lock|database"
Pay special attention to:
- Repeated lock acquisition failures
- “Two masters” messages
- Continuous session rebalancing
A few messages during recovery are normal.
Persistent errors are not.
Step 3: The Safe Reset (When Things Go Sideways)
If login failures spike or behavior becomes erratic, don’t try to outsmart a broken cluster.
Do a clean reset:
kubectl scale deployment keycloak --replicas=0
Wait until:
- All pods are terminated
- Database locks are released
Then bring it back:
kubectl scale deployment keycloak --replicas=3
This:
- Clears Infinispan state
- Forces a clean leader election
- Restores a single cluster “brain”
Yes, it’s disruptive — but far less damaging than silent identity corruption.
7. How to Reduce the Risk Going Forward
While freezes can’t be eliminated entirely, you can limit the blast radius:
- Run at least 3 Keycloak replicas
- Use pod anti-affinity to spread nodes
- Tune JGroups and Infinispan timeouts carefully
- Monitor:
- Coordinator changes
- Lock acquisition latency
- Session eviction rates
- Prefer node replacement over resurrection
Final Thoughts
Infrastructure will always be transient:
- Hardware ages
- Hosts get patched
- VMs migrate
- Nodes disappear
The goal isn’t to prevent change — it’s to design identity systems that survive it.
The Key Insight
A frozen node is more dangerous than a dead one.
If the platform deletes it, that’s not failure — that’s protection.
When your Keycloak cluster has one healthy coordinator and a clean cache state, authentication becomes boring again — exactly how it should be.
