Data engineers inherit large legacy databases with no documentation, no formalized schema, and no information_schema — forcing weeks of manual code reading and guesswork to understand data flows and relationships.
Connects to any database, samples data, analyzes column names/types/values/foreign key patterns, and uses AI to infer entity relationships, generate ER diagrams, and produce queryable documentation. Supports Oracle, SQL Server, MySQL, Postgres, and other legacy systems.
Freemium — free for small databases (<50 tables), paid tiers for enterprise-scale databases, team collaboration, and ongoing schema drift monitoring. $49/mo individual, $299/mo team.
This is a genuine, visceral pain. The Reddit thread confirms it — someone literally spent 3 weeks reading code to understand a database. This isn't a 'nice to have' — it blocks entire migration projects worth millions. When a consultant bills $200/hr and spends 80 hours manually reverse-engineering a schema, that's $16K of pure waste per engagement. The pain is acute, time-bound, and has real dollar cost.
TAM is meaningful but niche. There are ~500K data engineers globally and millions of DBAs. Legacy database modernization is a $20B+ market. However, the specific tool market (reverse-engineering documentation) is a subset. Estimated serviceable market: $200M-$500M if you capture consultants, enterprises, and data teams. Not a billion-dollar standalone market, but strong enough for a very profitable SaaS.
Strong signals. Enterprises already pay $50K+ for Alation/Collibra. Consultants bill $150-300/hr and would happily pay $49/mo to save days of work — the ROI is absurd (save 40 hours = $6K-12K vs $49 cost). The Reddit thread shows someone literally built their own tool to solve this — that's the ultimate willingness-to-pay signal. Enterprise procurement for migration projects has budget. $299/mo for teams is well within 'expense it on a credit card' range.
Core MVP is buildable by a strong solo dev in 6-8 weeks: connect to DB via JDBC/ODBC, query system catalogs + sample data, analyze column name patterns (user_id → users.id), check referential integrity in actual data, feed to LLM for relationship inference, generate ER diagrams. The hard parts: (1) handling databases that truly lack information_schema (old Oracle, AS/400, etc.) requires specialized connectors; (2) AI inference accuracy needs to be high enough to be trusted — hallucinated relationships are worse than none; (3) scale testing on 1000+ table databases. Doable but not trivial.
This is the strongest signal. Every existing tool falls into one of two buckets: (1) free/cheap tools that only read declared metadata — useless for the core problem of undocumented databases; (2) enterprise data catalogs that cost $50K+ and take months to deploy. There is NO mid-market, AI-powered tool focused specifically on the acute problem of 'understand this legacy database fast.' The gap is enormous and well-defined.
Mixed. The acute use case (reverse-engineer a database during migration) is project-based, not recurring — you need it intensely for 2-4 weeks, then you're done. Schema drift monitoring adds recurring value but is a weaker pain point. Team collaboration and living documentation improve stickiness. The best recurring path is per-database pricing for consultants who do this repeatedly across clients. Enterprise contracts for ongoing documentation maintenance are possible but require more product depth.
- +Massive, validated gap — no AI-powered tool exists in the mid-market for this specific acute pain
- +Clear, quantifiable ROI — saves weeks of manual work that costs thousands in labor
- +Strong pain signals from real practitioners (Reddit thread, someone built their own tool)
- +Natural enterprise upsell path — starts with individual data engineers, expands to team/org
- +AI timing is perfect — LLMs are now good enough to do credible schema inference that wasn't possible 3 years ago
- +Defensible moat potential — training on patterns from thousands of legacy databases creates compounding data advantage
- !Recurring revenue challenge — core use case is project-based (migration), not ongoing. Must find sticky features (drift monitoring, living docs) or target consultants who do this repeatedly
- !Accuracy trust gap — if AI infers wrong relationships, users lose trust fast. False positives in schema inference could lead to bad migration decisions. Need high precision over recall
- !Legacy database connectivity is a long tail of pain — each old system (AS/400, Informix, ancient Oracle versions) has its own quirks. Supporting the truly legacy databases that need this most is hard
- !Enterprise sales cycle — the teams with the biggest pain are inside large enterprises with procurement processes, security reviews, and data access restrictions. Getting DB credentials from a Fortune 500 is non-trivial
- !Open-source risk — SchemaSpy or SchemaCrawler could add AI features, or someone could build an open-source alternative quickly
Database documentation tool that connects to databases, imports metadata, lets teams add descriptions, and generates documentation with ER diagrams. Supports reverse-engineering schemas from 20+ database types.
Open-source tool that analyzes database metadata and generates interactive HTML documentation with ER diagrams. Reads information_schema and foreign keys to map relationships.
Open-source database schema discovery and comprehension tool. Provides detailed schema metadata, generates ER diagrams, and supports scripting/automation for schema analysis.
General-purpose database IDE tools that include ER diagram generation and schema browsing as part of their feature set. DBeaver is open-source with a Pro tier; DbVisualizer is commercial.
Enterprise data catalog platforms that crawl databases, infer lineage, and provide searchable metadata with collaborative documentation. Increasingly adding AI features.
CLI + web UI tool that connects to PostgreSQL, MySQL, SQL Server, and Oracle. Queries system catalogs AND samples actual data (first 1000 rows per table). Uses column name pattern matching (regex-based: *_id, *_code, *_key) plus data value intersection analysis to infer foreign key relationships. Feeds metadata + samples to GPT-4/Claude to generate natural language table/column descriptions and relationship confidence scores. Outputs interactive ER diagram (use Mermaid.js or D3) and a searchable HTML documentation site. Free for <50 tables, require sign-up for larger databases. Ship in 6 weeks.
Free tier (<50 tables, single database) drives adoption with individual data engineers → $49/mo Individual (unlimited tables, multiple databases, export to PDF/Confluence, AI-generated documentation) → $299/mo Team (shared workspace, collaborative annotations, schema diff/drift alerts, SSO) → Enterprise ($1K+/mo, on-prem deployment, audit logs, API access, custom integrations). Secondary revenue: consulting marketplace connecting SchemaLens power users with companies needing legacy DB expertise.
8-12 weeks. Weeks 1-6: build MVP with 4 database connectors, AI inference, and basic web UI. Weeks 6-8: beta with 20-30 users from Reddit/HN data engineering communities. Weeks 8-10: iterate on accuracy based on feedback. Weeks 10-12: launch paid tier. First paying customers likely from consultants and freelance data engineers who hit this pain regularly. Could see $1K-5K MRR within 3 months of launch if product-market fit is validated.
- “reverse engineering a very large legacy enterprise database, no formalised schema, no information_schema, no documentation”
- “interested in tools that infer relationships automatically, or whether it's always a manual grind”
- “I just read the code for like 3 weeks. Noted down what I thought was the flow”
- “the maintainer of that old code was very open about not understanding it, because he didn't write the origin”
- “I built a tool to solve that problem a few years ago based on queries of the Oracle data dictionary”