ClusterMind deploys an AI agent inside your cluster that proactively detects issues, diagnoses root causes, and sends fix PRs — all surfaced in Slack.
n8n-postgresql-0 stuck on dead node12,000m on galena3 blocking attachments82GB / 99GB — cleanup recommendedBorn from Production
Our diagnostic engine wasn't designed in a vacuum. Each of our 20+ default checks was born from a real production incident on real clusters — and they're fully customizable.
A node went NotReady and nobody noticed for 17 hours. The PostgreSQL StatefulSet pod was stuck on the dead node the entire time — because StatefulSet pods don't auto-reschedule like Deployments.
Three nodes cordoned during an upgrade. Cluster showed 99% CPU request utilization — triggering panic. Actual usage? 16%. ClusterMind distinguishes requests from real usage.
A Longhorn volume entered an infinite attach/detach loop, reaching generation count 991. Each cycle risked data corruption. ClusterMind flags volumes with generation >50.
A backup restore accidentally set a Longhorn instance-manager to 12 CPU cores on a single node, silently blocking all volume attachments. Generic monitoring missed it entirely.
How It Works
Deploy once, diagnose forever. No agents phoning home with your data.
$ helm install clustermind
A lightweight StatefulSet with Redis sidecar deploys inside your cluster. Your API key, your infrastructure. Data never leaves.
20+ checks / every 6h
AI runs comprehensive diagnostics: nodes, pods, storage, certificates, ArgoCD sync, AlertManager alerts. Four severity tiers, zero false positives on healthy clusters.
#incidents → Slack
Alerts land in Slack with root cause analysis and remediation steps. Ask follow-up questions in threads. Or let ClusterMind send a fix PR to your GitOps repo.
Capabilities
Not another dashboard. An AI operations engineer that lives in your cluster and reports through Slack.
20+ customizable checks across 4 severity tiers. Nodes, pods, Longhorn storage, certificates, ArgoCD, AlertManager. Runs every 6 hours, fully configurable.
~$0.15 per diagnostic run
Alerts with severity, root cause, and kubectl commands. Ask follow-up questions in threads. No context switching to dashboards or terminals.
Response in < 30s
Connects to your GitHub repos. Creates fix branches, opens PRs, monitors CI, watches ArgoCD rollouts. Multi-repo support out of the box.
PRs with full CI validation
Two-bot RBAC architecture. Read-only by default. Dangerous operations require human approval via dashboard. Protected namespaces can't be touched.
kube-system, argocd, cert-manager protected
Bring Your Own Anthropic API Key. The agent runs inside your cluster. We receive only minimal metadata. Your secrets, logs, and data never leave your infrastructure.
You pay Anthropic directly — ~$5/mo
Track AI costs per diagnostic run. See estimated downtime avoided and value delivered. We track what ClusterMind saves you so ROI is never a question. Prometheus metrics for everything.
Know exactly the value delivered
Pricing
Start free during Friends & Family. We show you exactly what you're saving so you can decide if it's worth it.
BYOK — $49-$149 per successful fix
Discounted per-fix pricing
100 free fixes + volume discounts
FAQ
get, describe, logs, and top — but cannot modify anything. Write operations (scale, delete, drain) require a separate privileged bot that only executes after human approval via the dashboard.
kubectl delete namespace, kubectl delete --all, any operation on protected namespaces (kube-system, argocd, cert-manager, longhorn-system, monitoring), and shell injection patterns. These cannot be bypassed, even with approval.
kubectl, ClusterMind works.
helm install command. The first diagnostic runs immediately.
Get Started
Join the Friends & Family preview. Full product, zero cost. We'll show you the value before we ever charge.