FinOps

Ghost Clusters: The Kubernetes Spending Hidden Behind "It's Used"

All Posts FinOps DevOps Cybersecurity Product Updates
Share

Where this term came from

I didn't come up with the phrase "ghost cluster." A reader did.

A couple of weeks ago I posted on LinkedIn arguing that observability and FinOps are not the same tool — observability tells you why the app broke, FinOps tells you why the bill broke. Ali Qureshi (Solution Architect, FinOps author) replied with something that's stuck with me ever since:

"Observability answers why the app broke. FinOps answers why the bill broke. Conflating them means you fix the query and still pay for the ghost cluster nobody owns."

I called it "orphan spend." Ali called it a ghost cluster. He was right and I was wrong. As he put it in a follow-up reply: "The ghost cluster sticks because it has a face. Orphan spend is accounting language. Nobody feels urgency about a line item. A cluster with no owner running for 6 months is a character."

That reframe is the whole thing. The framing is the urgency. Nobody walks into a standup and says "we have orphan spend." But "there's a ghost cluster burning $12K/month and nobody knows who launched it" — that gets a reaction in five seconds.

So credit where it's due. The rest of this post is what we've learned digging into the pattern since Ali named it.

Look at your fleet right now

Pull up your Kubernetes inventory. Count the clusters. Now answer this: how many of them has anyone deployed to in the last 30 days?

You probably don't know. And that's exactly the problem.

Every Kubernetes fleet I've looked at — from tight 5-cluster shops to sprawling 80-cluster enterprise messes — has the same pattern: a stubborn long tail of clusters that exist, cost money every hour, and aren't doing meaningful work. They're not flagged as "idle" in any obvious way. The dashboard shows them green. kubectl get nodes works. Pods are scheduled. Everything looks alive.

But nobody is using them. They're ghosts.

What a ghost cluster actually is

Let me define this in human terms because the FinOps literature usually doesn't.

A ghost cluster is a Kubernetes cluster that exists, costs money every hour, and isn't connected to any active product purpose. Three pieces of that definition matter:

This is different from the textbook "idle cluster." Idle means zero CPU, zero pods. Ghost means the cluster looks just busy enough to pass the smell test of a casual look at the dashboard — and that's exactly why it's still there.

The cluster isn't dead. It's haunted.

Why ghost clusters are particularly nasty

1. Three cost drivers, not one

An EC2 instance is one cost: instance-hours. A Kubernetes cluster is at least three:

Even an "empty" cluster — one node, no real workload — is rarely under $120/month all-in. A medium one is $300–600. A "we forgot about this" production-shaped one with 4–6 nodes can hit $1,500–3,000/month doing nothing of value.

2. They look healthy on dashboards

This is the tricky part. A ghost cluster passes most automated checks:

None of those signals answer the question that actually matters: is this cluster serving traffic to a real user, or is it just keeping itself company? Nobody is monitoring "is this cluster providing business value." So it stays green forever.

3. Nobody owns deletion

This is the human reason ghosts persist. The team that created the cluster moved teams. Or got reorganized. Or that one engineer who really knew what it was for left for a startup. The Slack channel is archived. The Confluence page hasn't been updated since 2024.

Now imagine being the new platform engineer. You see a cluster called edge-data-prod-2. You don't know what it does. Nobody you ask seems to know. Are you going to be the one who deletes it? What if it actually does something?

So it stays. Year after year. $300/month, forever.

The signals that say "this is a ghost"

Here's a checklist of objective tests you can run yourself, no special tooling needed. If a cluster fails most of these, it's probably a ghost:

One signal alone doesn't prove anything. Three or more together is a very strong tell.

CP / Compute / Support — the three-part bill

Most cluster cost views show you one number: "$X this month." That's not enough. The right view splits the cost into the three structural drivers:

Cluster size Control plane Compute Support Annual total
Tiny (2 × t3.medium nodes) $72/mo (EKS) ~$60/mo ~$15/mo (one ALB) ~$1,764/yr
Small (3 × m5.large) $72/mo ~$210/mo ~$50/mo ~$3,984/yr
Medium (4–6 m5.xlarge + storage) $72/mo ~$700/mo ~$150/mo ~$11,064/yr

Now multiply by the typical fleet's 5–15 ghost clusters — the math gets uncomfortable fast. $15,000–50,000 a year in real money quietly disappearing into the long tail. Not enough per cluster to make a single ticket urgent. Enough across the fleet to fund a hire.

Splitting the bill into CP / Compute / Support also changes the conversation. "Delete this cluster and save $300/month" is abstract. "$72 of that is the EKS fee, $200 is two t3.large nodes, $30 is the ALB — here are the three resources you'd remove" is actionable.

How to actually decommission a ghost cluster

Don't just hit delete. There's a sane workflow that protects you against the rare case where the ghost turns out to be load-bearing:

  1. Stop scheduling new workloads. Cordon all nodes (kubectl cordon <node>). Optionally taint with NoSchedule. Existing pods keep running, but nothing new can land.
  2. Wait 7–14 days and watch. Monitor for any anomaly in dependent services — missing data, broken cron jobs, alarms firing somewhere unexpected. If something genuinely depended on this cluster, you'll find out now, not after deletion.
  3. Snapshot persistent volumes if anything matters. Even if you're confident, take an EBS / managed disk snapshot of any PVC that survived the cordon. They're cheap. Restoring from a deleted cluster is not.
  4. Tear down via the original IaC. Whatever provisioned the cluster — Terraform, CloudFormation, Pulumi, ArgoCD, Crossplane — use that to destroy it. Do not run kubectl delete on the cluster directly. You'll lose state, leak load balancers and disks, and the IaC will keep trying to recreate it forever.
  5. Hold a 15-minute post-mortem. Who created this cluster, what was it for, what changed, what did we save, what's the policy update? Don't make it a process festival. Just write it down so the next ghost gets caught earlier.

The whole workflow is maybe 2–3 hours of actual engineering time per cluster, spread over two weeks of waiting. The savings are immediate the moment the IaC apply completes.

The harder organizational fix

Here's the part nobody likes to hear: even with a tool that flags every ghost cluster in your fleet with the exact dollar amount on screen, deletion still requires someone with authority willing to make the call. And most companies don't have that role clearly assigned.

The platform team doesn't want to delete "someone else's" cluster. The cost-center owner is two reorgs removed from the original use case and doesn't feel qualified to say it's safe. The FinOps lead can flag it but can't unilaterally schedule a deletion. So the JIRA gets created, assigned to "the team," and ages.

What's actually missing is a default. Something like:

"This cluster is officially abandoned. Deleting in 14 days unless someone objects with a real use case."

That sentence shifts the burden of action. Right now in most orgs, the burden is on the person who wants to delete — they have to go ask permission. Flip it: the burden is on whoever wants to keep it. If nobody objects in 14 days, it's gone. Silence equals consent to delete.

A tool can do the flagging, the cost breakdown, the JIRA creation, the Slack announcement. The org has to provide the "silence equals deletion" backstop. Without it, you'll keep paying for ghosts forever no matter how good your tooling is.

CLARITY automatically flags clusters with no recent deployment activity and surfaces them with their three-part cost breakdown so you know exactly what you'll save by killing each one. The deletion itself, and the policy that authorizes it — that part still has to live in your org.

Find your ghost clusters before they cost another year

CLARITY breaks down every Kubernetes cluster into Control Plane / Compute / Support, flags ones with no recent activity, and gives you the dollar number you need to make the deletion case. Works across EKS, AKS, and GKE.

Start Free Trial

Bottom line

Ghost clusters are not a tooling failure. They're an organizational drift problem with a tooling-shaped symptom. People reorg, projects pivot, ownership evaporates, and the clusters keep billing because nobody felt empowered to be the one who pulled the plug.

The fix has two halves and you need both:

  1. Visibility — a clear list of clusters with the three-part cost breakdown, the activity signals, and the dollar amount of what you'd save. Without numbers, nobody acts.
  2. A default-delete policy — "abandoned in 14 days unless someone objects." Without policy, even the clearest list still gets ignored.

Get both right and the long tail of your Kubernetes fleet stops being a slow leak. Get only one and you'll be writing this same blog post next quarter under a different title.

For a deeper view of the K8s allocation model itself, read Kubernetes Cost Allocation: A Practical Guide for EKS, AKS, and GKE. For the broader untagged-resource problem ghosts are a flavor of, see Orphan Spend: The Hidden 79% of Your Cloud Bill Nobody Owns. For the live-API vs detailed-export debate that determines how fast you can spot ghosts in the first place, see Why Your FinOps Tool Tells You to Come Back in 48 Hours.

Find your ghost clusters before they cost another year

CLARITY breaks down every Kubernetes cluster into Control Plane / Compute / Support, flags ones with no recent activity, and gives you the dollar number to make the deletion case.

Try CLARITY Free Or request a free cloud cost audit

Did you find this article useful?