7.7highGO

DocScan Pipeline API

Drop-in OCR API that auto-preprocesses messy document images before extraction using best-in-class small VLMs.

DevToolsDevelopers building document processing workflows, fintech companies doing KY...
The Gap

Developers using local OCR models waste time building image preprocessing pipelines (rotation correction, quality enhancement) and switching between models to handle edge cases like MRZ zones, angled photos, and low-quality scans.

Solution

A self-hostable or cloud OCR API that chains image preprocessing (auto-rotation, deskewing, enhancement) with the best small VLM for the task, returning structured JSON output. Automatically detects document type and routes to the right model/config.

Revenue Model

Freemium API: free tier at 500 pages/month, paid tiers based on volume. Self-hosted license for enterprises.

Feasibility Scores
Pain Intensity8/10

The Reddit thread and pain signals are textbook developer frustration. Preprocessing is genuinely tedious - rotation correction, deskew, quality enhancement are all well-known time sinks. The MRZ edge case alone can burn a week. Every developer who has tried to build document processing has hit this wall. The pain is real, recurring, and currently solved by duct-taping 3-4 libraries together.

Market Size7/10

Document processing TAM is enormous ($10B+), but this targets a specific niche: developers who want better-than-Tesseract but simpler-than-Google-Document-AI. Estimated serviceable market is $200M-500M covering SMB fintech KYC, SaaS document workflows, and developer tools. The self-hosted angle opens enterprise deals. Not a tiny market, but you are competing for a slice of a market with very large incumbents.

Willingness to Pay7/10

Developers already pay for OCR APIs (Google, AWS, Azure). Companies doing KYC pay $0.05-$0.50 per verification. The key insight is that teams currently paying $0.10/page to Google would pay $0.03-0.05/page for equivalent quality with self-host option. Enterprise self-hosted licenses ($500-2000/month) are viable for data-sensitive industries. The 109 upvotes on an OCR post suggest engaged audience, but converting Reddit enthusiasm to paying customers is always a gap.

Technical Feasibility8/10

Highly feasible for a solo dev with ML/CV background. The core components exist: OpenCV for preprocessing, Qwen2.5-VL/Florence for extraction, FastAPI for the API layer. The innovation is in the orchestration and document-type routing, not in building models from scratch. MVP in 4-6 weeks is realistic. The preprocessing pipeline (deskew, rotation, enhancement) is well-understood computer vision. Main risk is getting the auto-routing reliable across diverse document types.

Competition Gap8/10

This is the strongest dimension. No existing product combines: (1) smart preprocessing pipeline, (2) small VLM-powered extraction, (3) automatic document type detection and routing, (4) structured JSON output, AND (5) self-hostable. Cloud giants don't offer self-hosting. Open-source tools don't offer the orchestration layer. The 'intelligent preprocessing + VLM routing' combo is genuinely underserved. The Reddit comments confirm developers are manually stitching these pieces together.

Recurring Potential9/10

Document processing is inherently ongoing - companies process documents continuously, not once. KYC verification is per-customer. Invoice processing is monthly. API usage is naturally metered and recurring. Self-hosted licenses renew annually. This is one of the most naturally recurring use cases in developer tools. Once integrated into a pipeline, switching costs are high.

Strengths
  • +Clear, validated pain point with direct user quotes from an engaged community (109 upvotes, 43 comments)
  • +Strong competition gap - no one combines preprocessing + VLM routing + self-hosting in a drop-in API
  • +Excellent recurring revenue dynamics - document processing is continuous, not one-time
  • +Timing is perfect - small VLMs (Qwen2.5-VL 2B/7B) just crossed the quality threshold to make this viable without GPU clusters
  • +Self-hosted angle is a massive differentiator for regulated industries (fintech, healthcare, government) where data cannot leave premises
  • +Technical moat grows with each document type and preprocessing rule added - hard for competitors to replicate the routing intelligence
Risks
  • !Cloud giants (Google, AWS, Azure) could add better preprocessing and VLM-based extraction to their existing products, compressing the gap
  • !Small VLM quality may not match cloud API quality for edge cases, leading to churn from developers who expected parity
  • !Developer tools market is notoriously hard to monetize - many will use the free tier or self-host and never pay
  • !Document type auto-detection and routing is the hardest technical challenge - if it fails on edge cases, the whole value prop collapses
  • !Supporting the long tail of document types (global IDs, varied invoice formats, handwritten forms) could become an endless engineering treadmill
Competition
Google Document AI

Cloud-based document processing with specialized processors for invoices, receipts, IDs, and custom documents. Handles preprocessing internally with Google's infrastructure.

Pricing: Pay-per-use: ~$0.01-$0.10/page depending on processor type. 1000 pages/month free tier.
Gap: Not self-hostable. Expensive at scale. Black box - no control over preprocessing pipeline. Vendor lock-in. Overkill for teams that just need clean OCR with good preprocessing. No local/on-prem option for data-sensitive industries.
AWS Textract

Amazon's document text extraction service with table, form, and query-based extraction. Includes some built-in image correction.

Pricing: $1.50/1000 pages for basic OCR, up to $15/1000 pages for specialized extraction like lending documents.
Gap: No self-hosted option. Preprocessing is limited and opaque - developers still need to handle rotation/deskew for edge cases. Poor with non-standard documents. MRZ and passport handling requires separate services. No VLM-based extraction.
Doctr (mindee/doctr)

Open-source OCR library with built-in preprocessing

Pricing: Free and open source. Mindee's hosted API starts at $0.05/page.
Gap: No VLM integration - purely classical OCR pipeline. No automatic document type detection and routing. No structured JSON output by document type. Requires ML expertise to tune. No MRZ specialization. Not a drop-in API - you still build the orchestration layer yourself.
PaddleOCR

Open-source OCR toolkit from Baidu with text detection, recognition, and some preprocessing. Supports 80+ languages.

Pricing: Free and open source.
Gap: Not a turnkey API - significant integration work needed. No document-type-aware routing. Preprocessing pipeline is manual. No VLM fallback for complex documents. Configuration is complex. No structured JSON output mapped to document fields. Community comments specifically cite it as 'not a simple model' to use.
Unstructured.io

Open-source library and hosted API for extracting and transforming unstructured data from documents, PDFs, images, and more. Focuses on RAG pipeline preprocessing.

Pricing: Open-source core. Hosted API: free tier, then usage-based starting ~$0.01/page. Enterprise pricing available.
Gap: Optimized for text extraction for LLM ingestion, NOT structured document field extraction. Weak on image preprocessing for messy photos. No VLM-powered OCR. Not designed for KYC/identity document workflows. No MRZ handling. Image quality enhancement is not a focus.
MVP Suggestion

FastAPI service with 3 endpoints: /ocr (general), /ocr/id (identity documents), /ocr/invoice. Preprocessing pipeline: auto-rotation via OpenCV, deskew, contrast enhancement. Use Qwen2.5-VL-2B as the default model with MRZ-specific handling for passports/IDs. Return structured JSON with confidence scores. Docker image for self-hosting. Ship with a simple web playground for testing. Focus MVP on identity documents (passport, driver license, national ID) since KYC is the highest willingness-to-pay use case. Skip auto document-type detection in MVP - let the developer specify the endpoint.

Monetization Path

Free tier (500 pages/month, community support) -> Pro ($49/month for 10K pages, priority models, webhook callbacks) -> Business ($199/month for 100K pages, custom document types, SLA) -> Enterprise (self-hosted license $999-2999/month, on-prem deployment support, custom model fine-tuning). Early revenue from Pro tier targeting indie SaaS builders doing KYC. Scale revenue from Enterprise self-hosted licenses to fintechs and banks.

Time to Revenue

4-6 weeks to MVP, 8-10 weeks to first paying customer. The path: Week 1-2 build preprocessing pipeline and API skeleton, Week 3-4 integrate VLM and build document-type handlers, Week 5-6 add billing/auth and deploy. First revenue likely from a HackerNews/Reddit launch targeting the same community where the pain signals originated. Identity document processing for KYC is the fastest path to revenue since those buyers have budget and urgency.

What people are saying
  • needed some image pre-processing to rotate images correctly for good results
  • MRZ at the bottom of Passport or ID documents throws it in a loop
  • from clear scans to potato phone pics
  • Paddle but that's not a simple model like qwen