Data engineers working on identity resolution face a complex web of decisions: field-level merge strategies, child data deduplication, unmerge/backtracking, data recency trust scoring, and unique ID management. Most teams rebuild these from scratch or cobble together partial solutions like Splink.
A SaaS platform that provides configurable entity resolution pipelines with built-in merge strategies, automatic lineage/audit trails, one-click unmerge with child cascade, recency-weighted field resolution, and a unified ID graph. Integrates with warehouses (Snowflake, BigQuery, Databricks) and exposes APIs.
subscription
The Reddit thread itself is a masterclass in pain signals — every comment describes a different dimension of complexity (merge strategies, unmerge, child cascade, recency trust, ID management). This is a known, recurring headache that data engineers face repeatedly across companies. The phrase 'welcome to the problem space' implies veterans know this is unsolved. Teams spend months rebuilding these pipelines from scratch. Pain is real, frequent, and expensive.
Entity resolution/MDM TAM is $15-20B and growing 12-15% CAGR. Even capturing a niche (developer-first, warehouse-native ER for mid-to-large companies) represents a $500M+ addressable segment. Every company with customer data eventually needs identity resolution. Not consumer-tiny, not enterprise-only — sweet spot for a focused SaaS.
Enterprise MDM buyers already pay $200K-$1M+/year for Reltio/Tamr/Informatica. Data engineering teams have tooling budgets ($5K-$50K/year per tool is normal for Snowflake/dbt/Fivetran ecosystem). A warehouse-native ER platform at $1K-$10K/month would be a fraction of what enterprises pay today. However, open-source alternatives (Splink) create a free floor, and convincing data engineers to pay for managed services over DIY requires proving significant time savings. Score docked because the buyer (data engineer) often isn't the budget holder.
Entity resolution is genuinely hard computer science — probabilistic matching, graph algorithms, conflict resolution logic, lineage DAGs, warehouse-native execution (Snowflake UDFs vs BigQuery remote functions vs Databricks). A true MVP covering configurable merge strategies, unmerge with cascade, lineage tracking, recency weighting, AND multi-warehouse integration is ambitious for 4-8 weeks. A solo dev could build a proof-of-concept for ONE warehouse with basic merge/unmerge in 8 weeks, but production-grade multi-warehouse support with all promised features is more like 4-6 months. The core matching engine alone is a deep problem.
The whitespace is clear and validated: NO existing product offers configurable merge/unmerge + lineage + recency-weighted resolution + warehouse-native execution + developer-first UX together. Open-source tools (Splink, Zingg) only do matching. Enterprise platforms (Reltio, Informatica, Tamr) are $200K+/year, not warehouse-native, and not built for data engineers. API tools (Senzing, Tilores) resolve but don't manage. The gap is real and well-defined.
Entity resolution is inherently ongoing — new records arrive daily, matches evolve, merges/unmerges happen continuously, data quality degrades over time. This is not a one-time ETL job. Companies need persistent identity graphs maintained in perpetuity. Usage-based pricing on record volume + monthly platform fee is natural. Very strong subscription/consumption model fit. Once integrated into a data pipeline, switching costs are extremely high.
- +Clearly validated pain with specific, articulated sub-problems (merge, unmerge, lineage, recency) — not a solution looking for a problem
- +Massive competition gap: nothing is both warehouse-native AND developer-first with full MDM workflows
- +Existing market spending proves willingness to pay — you just need to offer 80% of value at 10% of enterprise MDM price
- +Extremely high switching costs once integrated into data pipelines — strong retention moat
- +Growing market with tailwinds: cloud warehouse adoption, data mesh, regulatory pressure all increase demand
- !Technical complexity is high — entity resolution is a deep domain with many edge cases. Underestimating build time is the #1 risk
- !Open-source Splink is 'good enough' for many teams, creating a free floor that makes initial conversion harder
- !Selling to data engineers (influencers) vs. data platform leaders (budget holders) creates a two-step sale that slows deals
- !Multi-warehouse support (Snowflake + BigQuery + Databricks) triples integration surface area — scope creep risk
- !Enterprise MDM vendors (Reltio, Informatica) could build warehouse-native connectors and close the gap from above
Open-source Python library for probabilistic record linkage and entity resolution. Built by the UK Ministry of Justice. Runs on Spark, DuckDB, or Athena. Identifies matching records using Fellegi-Sunter probabilistic model.
ML-powered data mastering platform combining machine learning with human-in-the-loop curation for entity resolution, schema mapping, and data classification. Targets large enterprise data unification.
Cloud-native MDM SaaS platform providing golden record management, matching, merging, and graph-based relationship visualization. Strong in healthcare, financial services, life sciences.
Embeddable entity resolution API/engine using proprietary AI. Self-hosted or cloud-deployed. Focuses purely on entity resolution
Open-source ML-based entity resolution built on Apache Spark. Uses active learning — you label a few examples, it trains a model and scales matching across large datasets.
Start with ONE warehouse (Snowflake — largest data engineering community). Build a managed entity resolution pipeline with: (1) configurable field-level merge strategies via YAML/code, (2) basic unmerge with child cascade, (3) automatic merge lineage/audit log, (4) recency-weighted field resolution, (5) unified ID graph queryable via SQL. Skip the UI initially — expose everything via SQL functions + a CLI/API. Use Splink's matching under the hood for the probabilistic linkage layer and focus your differentiation on the MDM workflow layer (merge/unmerge/lineage/survivorship). Deploy as a Snowflake Native App or dbt package + managed service.
Free: open-source dbt package or Snowflake Native App for basic entity matching (captures Splink users). Paid ($500-2K/month): managed merge/unmerge workflows, lineage tracking, recency weighting, conflict resolution UI. Enterprise ($5K-20K/month): multi-warehouse support, SSO, audit compliance (SOC2/HIPAA), dedicated support, custom merge strategies. Scale: consumption-based pricing on records resolved per month.
3-5 months to MVP with first design partner paying. 6-9 months to repeatable revenue with 5-10 paying customers. The key is finding 2-3 design partners from the Reddit thread commenters or similar communities who will co-develop the MVP in exchange for discounted pricing.
- “Welcome to the problem space (implying it's a known, recurring headache)”
- “do you throw away fields of data, or do you consider both sets, to enrich your master”
- “Do you keep a backtracking trace, to be able to unmerge. Unmerge of children too”
- “Do you trust more recent data more than older data”
- “What unique ID do you keep, or do you make up a [new one]”
- “If they have child data, do you keep the union of all children”