nemotron-ocr-v2-mlx / README.md
mweinbach's picture
Upload README.md with huggingface_hub
c14ee04 verified
metadata
license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
library_name: mlx
tags:
  - mlx
  - ocr
  - apple-silicon
  - swift
  - nemotron
  - nvidia
  - document-understanding
  - text-recognition
pipeline_tag: image-to-text
base_model: nvidia/nemotron-ocr-v2

Nemotron OCR v2 — MLX (Apple Silicon)

MLX-format weights for NVIDIA Nemotron OCR v2, converted from PyTorch for native Apple Silicon inference via mlx-swift.

Model Details

Three-stage OCR pipeline optimized for Apple Silicon CPU and GPU:

Component Architecture Parameters
Detector RegNet-X-8GF + ASPP + FPN ~43M
Recognizer CNN encoder + Transformer decoder ~6M
Relational Graph neural network + Transformer ~2M

Variants

Variant Charset Vocab Size Recognizer Seq Length
v2_english 855 characters 858 tokens 32
v2_multilingual ~42K characters ~42K tokens 512

Device-Tuned Presets

Device Detector Resolution dtype Notes
GPU 512×512 bfloat16 Best throughput on Apple Silicon GPU via Metal
CPU 256×256 float32 Reduced resolution avoids expensive full-res CPU inference

Usage with SwiftNemotronOCR

This model is designed for use with SwiftNemotronOCR, an all-Swift OCR pipeline.

Quick Start

# Clone the Swift package
git clone https://github.com/mweinbach/SwiftNemotronOCR.git
cd SwiftNemotronOCR

# Build
swift build -c release

# Compile Metal shaders (required for swift build — see repo README)
# ... (see SwiftNemotronOCR README for the metallib build step)

# Download this model
# Place v2_english/ and/or v2_multilingual/ under a model/mlx/ directory

# Run OCR on GPU
.build/release/apple-ocr-runner \
  --runtime mlx \
  --device GPU \
  --model-dir /path/to/model/mlx \
  --image /path/to/image.png \
  --variant en \
  --level paragraph

Output

JSON with detected text regions, confidence scores, and bounding quads:

{
  "images": [{
    "region_count": 24,
    "latency_ms": 2841.6,
    "regions": [
      {"text": "Council", "confidence": 0.48, "quad": {"points": [...]}},
      {"text": "RECONCILIATION", "confidence": 0.41, "quad": {"points": [...]}}
    ]
  }]
}

File Structure

v2_english/
  config.json              # Model configuration + device presets
  charset.txt              # Character vocabulary (JSON array)
  manifest.json            # Component metadata
  detector.safetensors     # RegNet-X-8GF detector (~181 MB)
  recognizer.safetensors   # Transformer recognizer (~24 MB)
  relational.safetensors   # GNN relational model (~9 MB)

v2_multilingual/
  config.json
  charset.txt
  manifest.json
  detector.safetensors     # Same detector (~181 MB)
  recognizer.safetensors   # Larger recognizer (~144 MB)
  relational.safetensors   # (~9 MB)

Related

Conversion

Weights were converted from the original PyTorch checkpoints:

  • Conv weights transposed from OIHW → OHWI (MLX convention)
  • Saved as safetensors format
  • Config generated with device-specific detector resolution presets

License

This model inherits the NVIDIA Open Model License from the original Nemotron OCR v2 release.