ModelsHigh Impact·Friday, April 17, 2026

NVIDIA's Nemotron OCR v2: 34.7 Pages/Second Multilingual OCR

NVIDIA released Nemotron OCR v2, a multilingual OCR model trained on 12M synthetic images achieving 34.7 pages/second on a single A100.

What happened

NVIDIA's team released Nemotron OCR v2, a multilingual OCR model built on the FOTS architecture with a shared convolutional backbone (RegNetX-8GF) that unifies text detection, recognition, and relational reasoning in one pass. The model was trained on 12 million synthetic images with precise bounding boxes, transcriptions, and reading order labels — bypassing the quality/scale tradeoffs of manual annotation or web-scraped PDFs. The multilingual variant uses a 6-layer Transformer recognizer with a 14,244-token vocabulary. Throughput reaches 34.7 pages/second on a single A100, with the English-only variant running faster due to a smaller 3-layer recognizer with 855 tokens.

Why it matters to you

personalized

The FOTS-based architecture runs the expensive convolution once and reuses feature maps across detection, recognition, and layout reasoning — meaning you're not paying triple compute for three tasks. At 34.7 pages/sec on A100, this crushes most production OCR pipelines that chain separate models. The multilingual recognizer's 14,244-token vocabulary is the only real throughput cost, so if you're English-only, the smaller variant is even faster.

What to do about it

Benchmark Nemotron OCR v2 against your current OCR stack (Tesseract, PaddleOCR, or cloud APIs) on a 100-page sample document this week — measure pages/sec and character error rate to decide if you can drop your cloud OCR bill entirely.

Try this now

Python10 min

1
Install the model: pip install nemo_toolkit or follow NVIDIA's Nemotron OCR v2 repo setup instructions on GitHub

Community

4 comments