Solo DevOps engineers become bottlenecks because all operational knowledge lives in their head, and new team members are too afraid to touch production.
Scans your IaC (Terraform, Helm), CI/CD pipelines, monitoring alerts, and incident history to auto-generate interactive runbooks with step-by-step remediation guides, blast-radius warnings, and safe rollback procedures.
Subscription: $49/mo for small teams, $199/mo for orgs with multiple clusters/environments
This is a hair-on-fire problem. The Reddit thread you cited is one of hundreds—solo DevOps engineers being single points of failure is possibly the #1 complained-about issue in r/devops. When that person goes on vacation or quits, production operations effectively stop. The fear of 'touching production' by new hires is universal and costs companies weeks of ramp-up time. Bus factor of 1 is an existential risk for small companies.
Target is small-to-mid engineering teams with 1-3 DevOps engineers. There are roughly 500K-1M such teams globally across startups and mid-market companies. At $49-199/mo, addressable market is $300M-$2B/year. Not a massive TAM compared to broad DevOps tooling, but large enough for a very successful company. The sweet spot is the 10-100 employee company with 1-2 DevOps people—there are hundreds of thousands of these.
DevOps teams already pay for PagerDuty ($20+/user), Datadog ($15+/host), and dozens of other tools. $49-199/mo is well within budget tolerance. However, the buyer persona (solo DevOps engineer) often doesn't control budget and must convince engineering leadership. The ROI story is compelling (reduce onboarding from months to days, reduce incident MTTR) but 'documentation tooling' historically has lower willingness-to-pay than 'monitoring' or 'security' tooling. Price the value (reduced risk, faster onboarding), not the category.
Parsing Terraform/Helm is well-documented—HCL and YAML have mature parsers. CI/CD pipeline analysis (GitHub Actions, GitLab CI) is doable via API. Monitoring alert integration (PagerDuty, OpsGenie APIs) is straightforward. The HARD part is generating actually-useful, context-aware runbooks from this data—this requires strong LLM integration and careful prompt engineering. An MVP that scans Terraform + generates basic runbooks is buildable in 6-8 weeks by a strong solo dev. Full blast-radius analysis and incident history correlation pushes to 10-12 weeks. Not trivial, but achievable.
This is the strongest dimension. NO existing product auto-generates runbooks from IaC. Rundeck requires manual authoring. Shoreline is enterprise-only automation. Confluence is static wikis. FireHydrant has manual templates. The gap between 'what exists' and 'what RunbookHQ proposes' is massive. The insight that runbooks should be GENERATED from infrastructure code rather than WRITTEN by humans is genuinely novel and technically timely (LLMs make this possible now in a way it wasn't 2 years ago).
Textbook SaaS subscription. Infrastructure changes continuously, so runbooks need continuous regeneration—this creates natural ongoing value. As teams grow, they need more runbooks for more services. As infrastructure evolves (new clusters, new services, new environments), the product becomes more valuable. Strong expansion revenue potential: start with one cluster, expand to all environments. Very low churn risk once integrated into onboarding workflow.
- +Massive competition gap—no one auto-generates runbooks from IaC, this is genuinely novel
- +Pain intensity is extreme and well-validated across DevOps communities (bus factor, onboarding fear)
- +LLM timing is perfect—this product wasn't technically feasible 2 years ago, now it is
- +Natural recurring revenue: infrastructure changes = runbooks need regeneration
- +Clear, understandable value prop that sells itself: 'new hire operates production safely on day one'
- +Strong expansion path: one team → entire org, one cluster → all environments
- !Generated runbook quality is the make-or-break factor—if the output is generic or wrong, trust is destroyed immediately. A bad runbook in production is worse than no runbook.
- !Buyer persona (solo DevOps engineer) may not have purchasing authority—may need to sell to engineering managers instead
- !Large cloud providers (AWS, GCP) or incumbents (PagerDuty/Rundeck) could add auto-generation features as LLMs become commoditized
- !Security sensitivity: scanning IaC and infrastructure configs means handling sensitive data—SOC2/security posture will be required sooner than expected
- !Risk of being perceived as 'AI-generated docs' (low trust category) rather than 'operational safety platform' (high trust category)—positioning matters enormously
Runbook automation platform that lets teams define, build, and safely execute operational procedures as automated or semi-automated workflows. Integrates with CI/CD and monitoring tools.
Incident automation platform that lets DevOps teams create automated remediations
Most DevOps teams cobble together runbooks as wiki pages in Confluence, often linked from PagerDuty or OpsGenie alerts. The de facto 'solution' for operational documentation.
Incident management platform with runbook features. Provides incident workflows, status pages, retrospectives, and runbook templates that can be attached to services.
Internal developer portals that catalog services, infrastructure, and documentation. Backstage is open-source; Port is commercial. Both aim to reduce cognitive load for developers.
Week 1-2: Terraform HCL parser that extracts resources, dependencies, and state. Week 3-4: LLM pipeline that generates runbooks from parsed infrastructure (focus on 'what does this do', 'how to safely modify', 'how to rollback'). Week 5-6: GitHub/GitLab integration to auto-detect IaC repos and regenerate on PR merge. Week 7-8: Simple web UI showing runbooks organized by service/resource with search. Ship with support for Terraform + one CI/CD platform (GitHub Actions). Skip Helm, monitoring integration, and blast-radius analysis for MVP—add these based on user feedback.
Free: Scan 1 repo, generate up to 10 runbooks (read-only, no regeneration) → $49/mo Starter: 3 repos, unlimited runbooks, auto-regeneration on infrastructure changes, team sharing → $199/mo Pro: Unlimited repos, multiple environments, incident history integration, blast-radius analysis, custom runbook templates, SSO → $499/mo Enterprise: On-prem/VPC deployment, SOC2 compliance, dedicated support, custom integrations
8-10 weeks to MVP, 12-14 weeks to first paying customer. The DevOps community is highly active on Reddit, HN, and dev.to—a Show HN post with a working demo scanning a public Terraform repo could generate significant interest. First revenue likely from a small startup team that recognizes the pain immediately. Target: 10 paying customers within 4 months of launch.
- “leading another DevOps Engineer who joined recently and isn't really confident about touching anything production related”
- “You are not a DevOps Engineer. You are an entire IT department”
- “I am often expected to be available outside my working hours when something goes down”