NVIDIA released Nemotron OCR v2, a multilingual OCR model trained on 12M synthetic images achieving 34.7 pages/second on a single A100.
NVIDIA's team released Nemotron OCR v2, a multilingual OCR model built on the FOTS architecture with a shared convolutional backbone (RegNetX-8GF) that unifies text detection, recognition, and relational reasoning in one pass. The model was trained on 12 million synthetic images with precise bounding boxes, transcriptions, and reading order labels — bypassing the quality/scale tradeoffs of manual annotation or web-scraped PDFs. The multilingual variant uses a 6-layer Transformer recognizer with a 14,244-token vocabulary. Throughput reaches 34.7 pages/second on a single A100, with the English-only variant running faster due to a smaller 3-layer recognizer with 855 tokens.
The FOTS-based architecture runs the expensive convolution once and reuses feature maps across detection, recognition, and layout reasoning — meaning you're not paying triple compute for three tasks. At 34.7 pages/sec on A100, this crushes most production OCR pipelines that chain separate models. The multilingual recognizer's 14,244-token vocabulary is the only real throughput cost, so if you're English-only, the smaller variant is even faster.
Benchmark Nemotron OCR v2 against your current OCR stack (Tesseract, PaddleOCR, or cloud APIs) on a 100-page sample document this week — measure pages/sec and character error rate to decide if you can drop your cloud OCR bill entirely.
Install the model: pip install nemo_toolkit or follow NVIDIA's Nemotron OCR v2 repo setup instructions on GitHub
Tags
Also today
Signals by role
Also today
Tools mentioned