Data teams rely on 'users telling me something seems wrong' as their primary data quality tool because dedicated tools are too expensive, require lengthy procurement, and don't stick.
A Python/SQL library that deploys as scheduled queries directly in your data warehouse (Snowflake, BigQuery, Redshift). Auto-generates anomaly detection checks from table metadata, sends alerts via Slack/PagerDuty. Zero infrastructure — runs as warehouse jobs.
freemium — free OSS for manual checks, paid tier ($99-299/mo) for auto-generated monitors, alert routing, and incident tracking
The Reddit thread is textbook evidence — 183 upvotes on a pain-venting thread, multiple comments saying 'users tell me something is wrong' is their primary tool. Data quality is a top-3 pain point in every data engineering survey. Teams are actively looking for solutions and failing to find affordable ones that stick.
TAM for data quality tooling is $3-5B. The specific segment (small-to-mid data teams, $99-299/mo) is smaller but substantial — there are ~50k+ companies with 1-5 person data teams using warehouses. At $200/mo average, that's $120M+ addressable. Not venture-scale but excellent for a bootstrapped/indie product.
$99-299/mo is in the 'put it on the team credit card' range — no procurement needed. Data teams already pay for dbt Cloud, Fivetran, etc. in this range. The pain signals show teams WANT to pay but existing options are too expensive. Risk: some teams will just stick with free OSS and never convert.
A Python/SQL library that generates and schedules warehouse queries is very buildable. No ML infrastructure needed for V1 — statistical anomaly detection (z-scores, IQR) on query results is sufficient. Metadata introspection APIs exist for all three warehouses. Slack/PagerDuty webhooks are trivial. A strong solo dev with warehouse experience could ship MVP in 4-6 weeks.
The gap is real and specific: Elementary requires dbt, Great Expectations is too complex, Soda doesn't auto-generate well, Monte Carlo is too expensive, and native dbt tests are too basic. No one has nailed 'install a Python package, point at warehouse, get monitors in 10 minutes' without requiring dbt or a new DSL. The auto-generation from metadata angle is particularly underserved.
Data quality monitoring is inherently ongoing — tables change, schemas evolve, anomalies appear continuously. Once monitors are in place, removing them feels like turning off smoke detectors. The paid tier (auto-generation, alert routing, incident tracking) provides continuous value. Natural expansion as teams add more tables and data sources.
- +Perfectly positioned in the 'missing middle' between free-but-basic and enterprise-expensive
- +Zero infrastructure approach removes the #1 adoption barrier — nothing to deploy, maintain, or get IT approval for
- +Auto-generation from metadata is a genuine differentiator — most tools require manual check writing which is why they don't stick
- +$99-299/mo pricing hits the credit card threshold — no procurement, no budget approval, instant adoption
- +The pain is validated by real community discussion with high engagement, not hypothetical
- +Python/SQL-native means zero new DSL to learn — meets data engineers where they already are
- !Elementary Data is very close to this idea and has funding + community momentum — if they drop the dbt requirement, the gap narrows significantly
- !Warehouse vendors (Snowflake, BigQuery, Databricks) are building native data quality features — could commoditize this layer over time
- !OSS-to-paid conversion is historically hard in data tooling — many teams will use free tier forever and resist paying
- !The 'auto-generate monitors from metadata' promise is easy to market but hard to make accurate — noisy alerts will kill adoption faster than no alerts
- !Small data teams (1-2 people) may not have enough pain to pay — they manage with manual checks and don't monitor enough tables to need automation
Open-source data observability built on top of dbt. Provides data quality tests, anomaly detection, and Slack/email alerts. Runs as dbt packages inside the warehouse.
Open-source data quality framework using SodaCL
Python-based data quality framework. Define 'expectations'
Enterprise data observability platform. Automated anomaly detection, lineage, root cause analysis across the data stack.
Native data testing in dbt — schema tests
Python CLI/library: `pip install datapulse && datapulse init --snowflake`. Connects to warehouse, introspects table metadata (row counts, null rates, cardinality, freshness), auto-generates a baseline set of anomaly checks as SQL queries, deploys them as scheduled warehouse jobs. Alerts go to a single Slack channel. Free tier: up to 5 tables, manual check writing only. Paid tier: unlimited auto-generated monitors, alert routing rules, and a simple web dashboard showing check history. Ship Snowflake support first — it has the most vocal small-team users.
Phase 1 (Free OSS): Python library for manual SQL check writing, basic CLI, community traction. Phase 2 ($99/mo Starter): Auto-generated monitors, Slack alerting, up to 25 tables. Phase 3 ($299/mo Pro): Unlimited tables, PagerDuty/OpsGenie integration, incident tracking, alert routing rules, check history dashboard. Phase 4 ($499+/mo Team): Multi-user access, role-based alert ownership, SLA tracking, API access. Long-term: usage-based pricing on number of monitored tables/checks executed.
6-10 weeks to MVP with free tier and community launch. 3-4 months to first paying customer if launched with strong content marketing on Reddit/HN/data engineering communities. The key is shipping a genuinely useful free tier fast, then converting power users who hit the 5-table limit or want auto-generation.
- “I use the one as old as time: users telling me 'something seems wrong'”
- “I just use python to be honest”
- “They can be expensive, and often have severe limitations”
- “most teams tried several on the list, but no tool stuck”
- “build in-house or use native features from their data warehouse”