Nvidia-Nemotron-OCR-v1

Identity

Property Value
ID nvidia-nemotron-ocr-v1
Parameters ~52M recognizer core + detector/relational OCR components
HuggingFace affectively-ai/nvidia-nemotron-ocr-v1
Quantization FP32 PyTorch checkpoints (~0.21 GB)
License NVIDIA Open Model License (HF metadata: other)

Axis 1: Architecture

Property Value
Family Hybrid OCR pipeline (detector + recognizer + relational reasoning)
Primary artifacts checkpoints/detector.pth, checkpoints/recognizer.pth, checkpoints/relational.pth
Charset checkpoints/charset.txt
Input granularity Page/image-level OCR with line-aware extraction
Output form Plain text transcription
Precision FP32 checkpoints

Architecture assessment: This is a specialist OCR model family, not a general VLM. It is optimized for text extraction from document-like images and should be routed through an OCR-native runtime path rather than chat-first multimodal compatibility layers.

Axis 2: Runtime

Runtime Viable Notes
WASM (browser) No Checkpoints/runtime exceed practical browser constraints
ONNX/WebGPU No No maintained ONNX export in this deployment path
Native (device) Conditional Possible with local OCR runtime and dependencies
Edge Worker No OCR binary/runtime requirements exceed worker limits
Cloud Run (coordinator-only) Yes Primary runtime (ENTRY_POINT=nemotron-ocr)
Cloud GPU Optional Not required for current production OCR lane

Primary runtime: Cloud Run coordinator-only native OCR lane, scale-to-zero, with no layer-node fan-out.

Axis 3: Modality

Property Value
Input Image (URL, data URL, or base64 payload)
Output Extracted text
Category OCR / image-to-text

Axis 4: Task Fitness

Task Fitness Notes
Printed document OCR Very good Primary workload
Screenshot text extraction Very good Works well on high-contrast captures
Mixed-layout pages (tables/forms) Good Reliable extraction, layout semantics remain limited
Handwriting OCR Moderate Accuracy depends heavily on handwriting style/quality
Vision-language reasoning Poor Not a chat-oriented VLM

Role in the zoo: Dedicated OCR specialist for document transcription and ingestion. It should back OCR endpoints directly, not serve as a general multimodal reasoning model.

Axis 5: Operational Cost

Property Value
Checkpoint footprint ~0.21 GB
Cloud Run topology 1 coordinator, 0 layer nodes
Cloud Run resources 2 vCPU, 4 GiB memory
Request timeout 300s
Idle cost ~$0/month (min-instances = 0)
Cold start profile Present (scale-to-zero), acceptable for OCR workloads

Verdict

NVIDIA Nemotron OCR v1 is a strong specialist for production OCR extraction in the model zoo. It should remain on a native OCR runtime lane with coordinator-only deployment and scale-to-zero economics. Use it for transcription and ingestion pipelines, not for general multimodal reasoning.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for affectively-ai/nvidia-nemotron-ocr-v1

Finetuned
(1)
this model