All posts
Engineering 11 min read

Why we rebuilt our document OCR pipeline from scratch

Three failed iterations, one rewrite, and a 3× accuracy gain on low-quality passport scans. A retrospective on engineering the verification core.

Platform Engineering Mar 19, 2026

Eighteen months ago, our document OCR pipeline looked reasonable on the dashboard. 97.2% character accuracy across our evaluation set. A P95 latency under 800ms. We shipped it, and for six months it ran without complaint from the metrics side.

Then the support tickets started telling a different story. Passport scans rejected that looked fine. Utility bills that had clearly been manipulated passing authenticity checks. Complaints that field extraction was confusing middle name with given name on Ethiopian passports. The evaluation set was clean. The production traffic was not.

This is the story of why that divergence happened, why the first three fixes failed, and why we eventually rebuilt the entire pipeline.

The evaluation set was a lie

Our original evaluation corpus was ~8,000 documents collected over the year the platform was being built. The collection process: compliance team asked for submissions from "real users", engineers opened the samples, and anything obviously broken was discarded and re-requested.

In hindsight, the discard step was the poison. The obviously broken category (pictures taken in poor light, phone cameras with oil on the lens, documents photographed through glass, scans with aggressive compression) is not rare in production. It is modal. By discarding them in the eval set, we had trained and benchmarked against a world that did not exist.

Our real production distribution, once we started logging it properly:

  • 31% had motion blur above a perceivable threshold
  • 24% were taken in lighting that crushed the document contrast below 0.25
  • 18% had rotation greater than 15° from horizontal
  • 12% were partially cropped at the edges
  • 9% were photos of photos (screen reflections visible)

97% accuracy on the eval set became 71% accuracy on the actual distribution. The right lesson was not "train a bigger model". It was "the eval set is wrong."

Iteration one: throw more data at it

The obvious first move was to build a real eval set. We relabelled 12,000 production samples, including the broken ones. We retrained. Accuracy on the new eval set: 84%.

That felt like progress until we looked at the failure modes. The model was confidently wrong on specific patterns: Ethiopian Amharic characters getting misrecognised as similar-looking Arabic, British passports with their security-feature hologram overlapping the surname field, Nigerian utility bills where the company logo looked like a "2" in certain lighting.

More data did not fix these. They were systematic errors, not statistical ones.

Iteration two: specialised models per document class

The theory: instead of one OCR model that handles everything, route documents to specialist models, Ethiopian passport model, UK passport model, Nigerian utility bill model, etc. Each specialist is trained on its narrow domain and should crush it.

The operational problem emerged within weeks: routing is a verification problem itself. To send a document to the "Ethiopian passport" model, you first have to know it is an Ethiopian passport, which requires an OCR pass to read place of issue using a model that doesn't know it is Ethiopian yet. Classic chicken-and-egg.

Also: we had 180+ document classes in active production. Maintaining and retraining 180+ specialist models was a full-time job for a team we did not have.

Iteration three: multi-modal with large vision models

By late 2024, vision-language models had become competent enough that we tried replacing the classical OCR pipeline with a large vision model that could look at a document image and just answer questions about it: "What is the surname?" "When does this document expire?"

Accuracy on our messy production distribution: 91%. A meaningful jump. But:

  • P95 latency rose to 2.1 seconds, four times our budget
  • Per-verification cost rose 9× at 2024 GPU prices
  • The model occasionally hallucinated confidently, making up a passport number that was not on the document

Hallucinations in a compliance pipeline are disqualifying. You cannot tell a regulator that the verification engine sometimes makes up data. That killed the approach.

The rewrite

The fourth attempt was not a model change. It was an architectural one. We rebuilt the pipeline around a pre-processing stage that did the classical computer vision work (deskew, deblur, light normalisation, crop detection) as a dedicated stage before any model saw the image. Then a small, fast OCR model handled the text. Then a verification stage cross-checked extracted fields against the document's known structure.

Three stages, each doing one job well, each with explicit confidence scores that propagated forward. If pre-processing could not produce a clean image, we asked the user to retake before wasting model cycles. If the OCR stage was uncertain about a field, we flagged it for human review rather than guessing.

Results:

  • Production accuracy: 94.8% across the messy distribution
  • Human-review rate: down to 3.1% from 12%
  • P95 latency: 620ms
  • Zero hallucinations, because the architecture physically cannot invent a field that wasn't extracted

The instinct for any model performance problem is "bigger model, more data." Most of the time the real answer is "do less, do it better, and stop pretending the inputs are clean."

Lessons that generalised

Three things we carried forward into every other CredFlare model since:

  1. The eval set is the product. If it doesn't mirror production, every accuracy number it produces is theatre. Re-sample the eval set from live traffic quarterly.
  2. Confidence scores beat binary answers. A pipeline that can say "I am 0.62 confident the surname is BELLO" is a pipeline you can build a review queue around. A pipeline that only emits a final answer is a pipeline you cannot debug.
  3. Separate the work. Pre-processing, extraction, verification are three different jobs with three different failure modes. Pushing them all into one model hides where the errors actually come from.

The OCR pipeline is unglamorous infrastructure. But it is the load-bearing wall of the whole verification platform: every claim the product makes about document authenticity, expiry checking, or field extraction rests on it being right. Getting it to 95%+ on the actual distribution of production inputs was the single highest-leverage work the platform team did in 2025.

Continue reading

More from the CredFlare blog

All posts