Nemotron OCR v2 CoreML

CoreML conversion of the English neural stages from NVIDIA Nemotron OCR v2.

SwiftPM package: github.com/mweinbach/OCRCoreML

Included Models

stage file input outputs
detector DetectorGPUInt8_768.mlpackage image: Float32[1, 3, 768, 768] prob, rboxes, features
recognizer RecognizerFeaturesInt8.mlpackage regions: Float32[128, 128, 8, 32] logits, features
relational RelationalInt8.mlpackage rectified regions, original quads, recognizer features, valid count words, lines, line_log_var

The recognizer emits transformer features; those are required by the relational model, so this bundle covers the full neural OCR pipeline rather than detector-only inference.

Pipeline Boundary

The original Python package uses CUDA/C++ helpers for the non-neural stages: rotated-box NMS, rboxes to quads, quad rectification, feature-map grid sampling, sequence decoding, relation-graph decoding, and reading-order formatting. Those operations are not CoreML models. Apple apps integrating this bundle must port or replace those post-processing steps.

The linked SwiftPM package includes wrappers for all three CoreML models and a greedy recognizer decoder. It exposes raw tensors rather than claiming complete image-to-text OCR until the geometric and graph post-processing is ported.

Files

file purpose
DetectorGPUInt8_768.mlpackage/ detector CoreML package
RecognizerFeaturesInt8.mlpackage/ recognizer CoreML package with logits and features
RelationalInt8.mlpackage/ relational CoreML package
charset.txt English checkpoint charset
model_config.json English checkpoint config
configs/ conversion configs used for the three packages
benchmarks/ local CoreML benchmark results
parity/ PyTorch-vs-CoreML parity reports
checksums.sha256 SHA-256 checksums for package files
LICENSE, NOTICE license terms and redistribution notice

Performance

Local median latencies after warmup:

stage GPU/ALL median CPU+NE median CPU median
detector 10.65 ms 50.46 ms 157.71 ms
recognizer + features 4.53 ms 11.04 ms 47.58 ms
relational 1.72 ms 6.38 ms 34.53 ms

GPU/CoreML ALL is the best single-shot latency path on the test machine. CPU+ANE is useful when GPU time needs to be reserved for rendering or other workloads.

Swift Usage

import OCRCoreML

let pipeline = try OCRPipeline(computeUnits: .cpuAndGPU)
let detectorPrediction = try pipeline.detect(image: cgImage)

let recognizerPrediction = try pipeline.recognize(regions: regions)
let decoded = try pipeline.recognizer.decode(
    logits: recognizerPrediction.output.logits,
    count: detectedRegionCount
)

let relationalPrediction = try pipeline.relate(
    rectifiedQuads: relationalRegionFeatures,
    originalQuads: originalQuads,
    recognizerFeatures: recognizerPrediction.output.features,
    numValid: detectedRegionCount
)

See the SwiftPM docs for exact app integration notes: https://github.com/mweinbach/OCRCoreML

License

The converted model weights inherit the NVIDIA Open Model License Agreement. The upstream source code and helper scripts are Apache 2.0. See LICENSE and NOTICE.

Downloads last month
57
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mweinbach1/nemotron-ocr-v2-coreml

Quantized
(3)
this model