Nvidia-Nemotron-OCR-v1

Identity

Property	Value
ID	`nvidia-nemotron-ocr-v1`
Parameters	~52M recognizer core + detector/relational OCR components
HuggingFace	`affectively-ai/nvidia-nemotron-ocr-v1`
Quantization	FP32 PyTorch checkpoints (~0.21 GB)
License	NVIDIA Open Model License (HF metadata: `other`)

Axis 1: Architecture

Property	Value
Family	Hybrid OCR pipeline (detector + recognizer + relational reasoning)
Primary artifacts	`checkpoints/detector.pth`, `checkpoints/recognizer.pth`, `checkpoints/relational.pth`
Charset	`checkpoints/charset.txt`
Input granularity	Page/image-level OCR with line-aware extraction
Output form	Plain text transcription
Precision	FP32 checkpoints

Architecture assessment: This is a specialist OCR model family, not a general VLM. It is optimized for text extraction from document-like images and should be routed through an OCR-native runtime path rather than chat-first multimodal compatibility layers.

Axis 2: Runtime

Runtime	Viable	Notes
WASM (browser)	No	Checkpoints/runtime exceed practical browser constraints
ONNX/WebGPU	No	No maintained ONNX export in this deployment path
Native (device)	Conditional	Possible with local OCR runtime and dependencies
Edge Worker	No	OCR binary/runtime requirements exceed worker limits
Cloud Run (coordinator-only)	Yes	Primary runtime (`ENTRY_POINT=nemotron-ocr`)
Cloud GPU	Optional	Not required for current production OCR lane

Primary runtime: Cloud Run coordinator-only native OCR lane, scale-to-zero, with no layer-node fan-out.

Axis 3: Modality

Property	Value
Input	Image (URL, data URL, or base64 payload)
Output	Extracted text
Category	OCR / image-to-text

Axis 4: Task Fitness

Task	Fitness	Notes
Printed document OCR	Very good	Primary workload
Screenshot text extraction	Very good	Works well on high-contrast captures
Mixed-layout pages (tables/forms)	Good	Reliable extraction, layout semantics remain limited
Handwriting OCR	Moderate	Accuracy depends heavily on handwriting style/quality
Vision-language reasoning	Poor	Not a chat-oriented VLM

Role in the zoo: Dedicated OCR specialist for document transcription and ingestion. It should back OCR endpoints directly, not serve as a general multimodal reasoning model.

Axis 5: Operational Cost

Property	Value
Checkpoint footprint	~0.21 GB
Cloud Run topology	1 coordinator, 0 layer nodes
Cloud Run resources	2 vCPU, 4 GiB memory
Request timeout	300s
Idle cost	~$0/month (min-instances = 0)
Cold start profile	Present (scale-to-zero), acceptable for OCR workloads

Verdict

NVIDIA Nemotron OCR v1 is a strong specialist for production OCR extraction in the model zoo. It should remain on a native OCR runtime lane with coordinator-only deployment and scale-to-zero economics. Use it for transcription and ingestion pipelines, not for general multimodal reasoning.

Downloads last month: -

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for affectively-ai/nvidia-nemotron-ocr-v1

Base model

nvidia/nemotron-ocr-v1

Finetuned

(1)

this model