Friends & Family Preview

Stop babysitting your Kubernetes clusters.

ClusterMind deploys an AI agent inside your cluster that proactively detects issues, diagnoses root causes, and sends fix PRs — all surfaced in Slack.

Get Early Access See How It Works

ClusterMind App #ops-alerts — 6:00 AM

Diagnostic Run — galena2 cluster

Critical

Node iron1 NotReady for 17h 23m

StatefulSet pod n8n-postgresql-0 stuck on dead node

High

Longhorn instance-manager-e error state

CPU request 12,000m on galena3 blocking attachments

Medium

Disk iron3 88% full

82GB / 99GB — cleanup recommended

Born from Production

Every check has a scar behind it.

Our diagnostic engine wasn't designed in a vacuum. Each of our 20+ default checks was born from a real production incident on real clusters — and they're fully customizable.

CRITICAL

The 17-Hour Blind Spot

A node went NotReady and nobody noticed for 17 hours. The PostgreSQL StatefulSet pod was stuck on the dead node the entire time — because StatefulSet pods don't auto-reschedule like Deployments.

Undetected for 17 hours · ClusterMind catches in < 6 min

HIGH

The Phantom CPU Crisis

Three nodes cordoned during an upgrade. Cluster showed 99% CPU request utilization — triggering panic. Actual usage? 16%. ClusterMind distinguishes requests from real usage.

Reported 99% CPU · Reality 16% CPU

CRITICAL

Generation 991

A Longhorn volume entered an infinite attach/detach loop, reaching generation count 991. Each cycle risked data corruption. ClusterMind flags volumes with generation >50.

Volume generation 991 · Alert threshold > 50

HIGH

The 12-Core Ghost

A backup restore accidentally set a Longhorn instance-manager to 12 CPU cores on a single node, silently blocking all volume attachments. Generic monitoring missed it entirely.

CPU request 12,000m · Expected < 500m

How It Works

Three steps to calmer clusters.

Deploy once, diagnose forever. No agents phoning home with your data.

Deploy

$ helm install clustermind

A lightweight StatefulSet with Redis sidecar deploys inside your cluster. Your API key, your infrastructure. Data never leaves.

Detect

20+ checks / every 6h

AI runs comprehensive diagnostics: nodes, pods, storage, certificates, ArgoCD sync, AlertManager alerts. Four severity tiers, zero false positives on healthy clusters.

Resolve

#incidents → Slack

Alerts land in Slack with root cause analysis and remediation steps. Ask follow-up questions in threads. Or let ClusterMind send a fix PR to your GitOps repo.

Capabilities

Built for the 3am page.

Not another dashboard. An AI operations engineer that lives in your cluster and reports through Slack.

Proactive Diagnostics

20+ customizable checks across 4 severity tiers. Nodes, pods, Longhorn storage, certificates, ArgoCD, AlertManager. Runs every 6 hours, fully configurable.

~$0.15 per diagnostic run

Slack-Native

Alerts with severity, root cause, and kubectl commands. Ask follow-up questions in threads. No context switching to dashboards or terminals.

Response in < 30s

GitOps Auto-Remediation

Connects to your GitHub repos. Creates fix branches, opens PRs, monitors CI, watches ArgoCD rollouts. Multi-repo support out of the box.

PRs with full CI validation

Human-in-the-Loop Safety

Two-bot RBAC architecture. Read-only by default. Dangerous operations require human approval via dashboard. Protected namespaces can't be touched.

kube-system, argocd, cert-manager protected

BYOK — Your Data Stays

Bring Your Own Anthropic API Key. The agent runs inside your cluster. We receive only minimal metadata. Your secrets, logs, and data never leave your infrastructure.

You pay Anthropic directly — ~$5/mo

ROI You Can Measure

Track AI costs per diagnostic run. See estimated downtime avoided and value delivered. We track what ClusterMind saves you so ROI is never a question. Prometheus metrics for everything.

Know exactly the value delivered

Pricing

See the value before you pay.

Start free during Friends & Family. We show you exactly what you're saving so you can decide if it's worth it.

Starter

$0 platform

BYOK — $49-$149 per successful fix

Bring your own Anthropic key
20+ customizable diagnostic checks
Slack alerts & threads
GitOps auto-remediation
Approval dashboard
Pay only for proven fixes

Join Waitlist

Growth

$1,500 /mo

Discounted per-fix pricing

Everything in Starter
Priority support
Multi-cluster management
Advanced cost analytics
Custom diagnostic checks
SLA: 15min MTTA

Join Waitlist

Enterprise

$5,000 /mo

100 free fixes + volume discounts

Everything in Growth
Dedicated support engineer
Custom runbooks
One-click Slack sign-in
SLA: 2min MTTA, 15min MTTR
Compliance audit trail

Contact Sales

FAQ

Common questions.

Security & Privacy

Nowhere. The agent runs inside your Kubernetes cluster. Your logs, secrets, and cluster state never leave your infrastructure. We receive only minimal metadata (incident counts, heartbeats) for the dashboard. This is architecturally enforced, not just a policy.

Read-only RBAC by default. The agent can get, describe, logs, and top — but cannot modify anything. Write operations (scale, delete, drain) require a separate privileged bot that only executes after human approval via the dashboard.

Hard-blocked operations include: kubectl delete namespace, kubectl delete --all, any operation on protected namespaces (kube-system, argocd, cert-manager, longhorn-system, monitoring), and shell injection patterns. These cannot be bypassed, even with approval.

Your Anthropic API key is stored encrypted in your cluster's Kubernetes Secrets. We never see or handle it. Repository access is managed through a GitHub App — no SSH keys to share or rotate. All authentication uses Slack OAuth with JWT RS256 signing.

On the roadmap for Enterprise launch (Q2 2026). The BYOK architecture inherently reduces our compliance surface — since cluster data stays in your infrastructure, our platform handles only authentication and metadata.

General

Kubernetes 1.25 and above. Any distribution — EKS, GKE, AKS, k3s, RKE2, kind, OrbStack. If it runs kubectl, ClusterMind works.

About 5 minutes. Add the Slack app to your workspace, click "Connect Cluster" in the dashboard, run the helm install command. The first diagnostic runs immediately.

About $5/month in Anthropic API costs for a typical cluster running diagnostics every 6 hours. Each run costs ~$0.15 using Claude Sonnet. You pay Anthropic directly via your own API key. During Friends & Family, the ClusterMind platform itself is free.

Yes. Purpose-built checks for each: Longhorn volume health, instance-manager errors, and degraded replicas. ArgoCD sync status and application health. cert-manager certificate expiry. AlertManager firing alerts. These aren't generic — they're informed by real production incidents.

Those tools alert on metric thresholds. ClusterMind reasons about your cluster. It understands that a StatefulSet pod stuck on a NotReady node won't auto-reschedule. It knows that 99% CPU requests with 16% actual usage is a false alarm, not a crisis. And it doesn't just alert — it opens fix PRs against your GitOps repo.