chandra-ocr-2 — NVFP4A16 (W4A16)

NVFP4A16 (4-bit weight, 16-bit activation) quantization of datalab-to/chandra-ocr-2 produced with llm-compressor and packed as compressed-tensors for native vLLM inference.

Fastest of the three quants we measured on Blackwell under the path that actually matters for real OCR pipelines (page-level concurrent fan-out). ~5058 ms/page = 0.20 pages/s, 2.5× over bf16.

For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.


Quantization recipe

# recipe.yaml (shipped in this repo)
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore:
        - lm_head
        - 're:.*visual.*'        # keep vision tower in bf16
        - 're:.*linear_attn.*'   # keep linear-attn fp16
      scheme: NVFP4A16
  • Weights: NVFP4 (4-bit microscaled FP4, Blackwell-native)
  • Activations: FP16
  • lm_head, the entire visual.* ViT tower, and linear_attn.* left in fp16/bf16 (per the upstream Qwen3.5-VL NVFP4 recipe).
  • Calibration: 512 samples × 4096 tokens from HuggingFaceH4/ultrachat_200k.

Hardware requirements

GPU family Compute capability NVFP4 status Recommended?
Blackwell (RTX PRO 6000, B100/B200, RTX 5090) sm_100+ Native FP4 tensor cores ✅ ideal
Hopper (H100/H200) sm_90 Software emulation via marlin runnable but slower; prefer FP8_DYNAMIC
Ada (RTX 4090/L40S) sm_89 No FP4 ❌ use FP8_DYNAMIC variant
Ampere / older ≤ sm_86 No FP4 ❌ use BF16 / FP8 elsewhere

vLLM ≥ 0.19.1 required (compressed-tensors NVFP4A16 VL kernel landed there; v0.17 rejects with Unsupported data_type: nv_fp).

Benchmark (vs. other Chandra-2 quants)

Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese financial-statement PDF, vLLM 0.19.1, max-num-seqs=128, max-num-batched-tokens=32768, kv-cache=fp8.

Build Sequential per-doc Concurrent per-page Best ms/page vs bf16
bf16 baseline 12724 ms 12642 ms 12642 1.0×
FP8_DYNAMIC 5434 ms 9525 ms 5434 2.3×
NVFP4A16 12280 ms 5058 ms 5058 2.5×
NVFP4 (W4A4) 10092 ms 5794 ms 5794 2.2×

Take-away: NVFP4A16 wins only with concurrent page fan-out (continuous batching). Single-request serial loops favor FP8_DYNAMIC.

Usage

vLLM (OpenAI-compatible server) — recommended

vllm serve dangvansam/chandra-ocr-2-NVFP4A16 \
  --served-model-name chandra \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'
# Client — call exactly like the bf16 original
from openai import OpenAI
import base64, pathlib

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="chandra",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "<ocr_layout>"},
        ],
    }],
    max_tokens=12000,
    temperature=0.0,
)
print(resp.choices[0].message.content)

HuggingFace Transformers

The vision tower is left in bf16, so transformers ≥ 5.2 loads this checkpoint identically to the upstream — only the LLM-side weights are 4-bit. Use the snippet from the upstream card, replacing "datalab-to/chandra-ocr-2" with "dangvansam/chandra-ocr-2-NVFP4A16".

When to pick which Chandra-2 quant

Workload Pick
Page-concurrent fan-out on Blackwell NVFP4A16 (this repo)
Single sequential request per doc FP8_DYNAMIC
Ada / Hopper GPU, FP8 acceptable FP8_DYNAMIC
Max compression, accuracy not critical NVFP4 (W4A4)
Reference accuracy / older hardware upstream bf16

Files

  • model.safetensors — NVFP4A16-packed weights (~11 GB)
  • config.json, processor_config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, generation_config.json — copied from upstream
  • recipe.yaml — exact llm-compressor recipe used

License & attribution

Inherits the upstream OpenRAIL-M license from datalab-to/chandra-ocr-2. Free for research, personal use, and startups <$2M; not for use competing with Datalab's hosted API. For broader commercial use see Datalab pricing.

This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.

Citation

@misc{chandra_ocr_2,
  author = {Datalab},
  title  = {Chandra OCR 2},
  year   = {2026},
  url    = {https://huggingface.co/datalab-to/chandra-ocr-2}
}
Downloads last month
23
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dangvansam/chandra-ocr-2-NVFP4A16

Quantized
(10)
this model