Upload README.md with huggingface_hub

c14ee04 verified 19 days ago

4.09 kB

license: other
license_name: nvidia-open-model-license
license_link: >-
  https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
library_name: mlx
tags:
  - mlx
  - ocr
  - apple-silicon
  - swift
  - nemotron
  - nvidia
  - document-understanding
  - text-recognition
pipeline_tag: image-to-text
base_model: nvidia/nemotron-ocr-v2

Nemotron OCR v2 — MLX (Apple Silicon)

MLX-format weights for NVIDIA Nemotron OCR v2, converted from PyTorch for native Apple Silicon inference via mlx-swift.

Model Details

Three-stage OCR pipeline optimized for Apple Silicon CPU and GPU:

Component	Architecture	Parameters
Detector	RegNet-X-8GF + ASPP + FPN	~43M
Recognizer	CNN encoder + Transformer decoder	~6M
Relational	Graph neural network + Transformer	~2M

Variants

Variant	Charset	Vocab Size	Recognizer Seq Length
`v2_english`	855 characters	858 tokens	32
`v2_multilingual`	~42K characters	~42K tokens	512

Device-Tuned Presets

Device	Detector Resolution	dtype	Notes
GPU	512×512	bfloat16	Best throughput on Apple Silicon GPU via Metal
CPU	256×256	float32	Reduced resolution avoids expensive full-res CPU inference

Usage with SwiftNemotronOCR

This model is designed for use with SwiftNemotronOCR, an all-Swift OCR pipeline.

Quick Start

# Clone the Swift package
git clone https://github.com/mweinbach/SwiftNemotronOCR.git
cd SwiftNemotronOCR

# Build
swift build -c release

# Compile Metal shaders (required for swift build — see repo README)
# ... (see SwiftNemotronOCR README for the metallib build step)

# Download this model
# Place v2_english/ and/or v2_multilingual/ under a model/mlx/ directory

# Run OCR on GPU
.build/release/apple-ocr-runner \
  --runtime mlx \
  --device GPU \
  --model-dir /path/to/model/mlx \
  --image /path/to/image.png \
  --variant en \
  --level paragraph

Output

JSON with detected text regions, confidence scores, and bounding quads:

{
  "images": [{
    "region_count": 24,
    "latency_ms": 2841.6,
    "regions": [
      {"text": "Council", "confidence": 0.48, "quad": {"points": [...]}},
      {"text": "RECONCILIATION", "confidence": 0.41, "quad": {"points": [...]}}
    ]
  }]
}

File Structure

v2_english/
  config.json              # Model configuration + device presets
  charset.txt              # Character vocabulary (JSON array)
  manifest.json            # Component metadata
  detector.safetensors     # RegNet-X-8GF detector (~181 MB)
  recognizer.safetensors   # Transformer recognizer (~24 MB)
  relational.safetensors   # GNN relational model (~9 MB)

v2_multilingual/
  config.json
  charset.txt
  manifest.json
  detector.safetensors     # Same detector (~181 MB)
  recognizer.safetensors   # Larger recognizer (~144 MB)
  relational.safetensors   # (~9 MB)

Swift Package: mweinbach/SwiftNemotronOCR — Full Swift implementation with CoreML + MLX backends
CoreML Weights: mweinbach/nemotron-ocr-v2-coreml — ANE-optimized CoreML models
Original Model: nvidia/nemotron-ocr-v2 — NVIDIA's original PyTorch release

Conversion

Weights were converted from the original PyTorch checkpoints:

Conv weights transposed from OIHW → OHWI (MLX convention)
Saved as safetensors format
Config generated with device-specific detector resolution presets

License

This model inherits the NVIDIA Open Model License from the original Nemotron OCR v2 release.

mweinbach
/

nemotron-ocr-v2-mlx