Nvidia-Nemotron-OCR-v1
Identity
| Property |
Value |
| ID |
nvidia-nemotron-ocr-v1 |
| Parameters |
~52M recognizer core + detector/relational OCR components |
| HuggingFace |
affectively-ai/nvidia-nemotron-ocr-v1 |
| Quantization |
FP32 PyTorch checkpoints (~0.21 GB) |
| License |
NVIDIA Open Model License (HF metadata: other) |
Axis 1: Architecture
| Property |
Value |
| Family |
Hybrid OCR pipeline (detector + recognizer + relational reasoning) |
| Primary artifacts |
checkpoints/detector.pth, checkpoints/recognizer.pth, checkpoints/relational.pth |
| Charset |
checkpoints/charset.txt |
| Input granularity |
Page/image-level OCR with line-aware extraction |
| Output form |
Plain text transcription |
| Precision |
FP32 checkpoints |
Architecture assessment: This is a specialist OCR model family, not a general VLM. It is optimized for text extraction from document-like images and should be routed through an OCR-native runtime path rather than chat-first multimodal compatibility layers.
Axis 2: Runtime
| Runtime |
Viable |
Notes |
| WASM (browser) |
No |
Checkpoints/runtime exceed practical browser constraints |
| ONNX/WebGPU |
No |
No maintained ONNX export in this deployment path |
| Native (device) |
Conditional |
Possible with local OCR runtime and dependencies |
| Edge Worker |
No |
OCR binary/runtime requirements exceed worker limits |
| Cloud Run (coordinator-only) |
Yes |
Primary runtime (ENTRY_POINT=nemotron-ocr) |
| Cloud GPU |
Optional |
Not required for current production OCR lane |
Primary runtime: Cloud Run coordinator-only native OCR lane, scale-to-zero, with no layer-node fan-out.
Axis 3: Modality
| Property |
Value |
| Input |
Image (URL, data URL, or base64 payload) |
| Output |
Extracted text |
| Category |
OCR / image-to-text |
Axis 4: Task Fitness
| Task |
Fitness |
Notes |
| Printed document OCR |
Very good |
Primary workload |
| Screenshot text extraction |
Very good |
Works well on high-contrast captures |
| Mixed-layout pages (tables/forms) |
Good |
Reliable extraction, layout semantics remain limited |
| Handwriting OCR |
Moderate |
Accuracy depends heavily on handwriting style/quality |
| Vision-language reasoning |
Poor |
Not a chat-oriented VLM |
Role in the zoo: Dedicated OCR specialist for document transcription and ingestion. It should back OCR endpoints directly, not serve as a general multimodal reasoning model.
Axis 5: Operational Cost
| Property |
Value |
| Checkpoint footprint |
~0.21 GB |
| Cloud Run topology |
1 coordinator, 0 layer nodes |
| Cloud Run resources |
2 vCPU, 4 GiB memory |
| Request timeout |
300s |
| Idle cost |
~$0/month (min-instances = 0) |
| Cold start profile |
Present (scale-to-zero), acceptable for OCR workloads |
Verdict
NVIDIA Nemotron OCR v1 is a strong specialist for production OCR extraction in the model zoo. It should remain on a native OCR runtime lane with coordinator-only deployment and scale-to-zero economics. Use it for transcription and ingestion pipelines, not for general multimodal reasoning.