chandra-ocr-2 — NVFP4 (W4A4)

NVFP4 W4A4 (4-bit weight, 4-bit activation) quantization of datalab-to/chandra-ocr-2 produced with llm-compressor and packed as compressed-tensors for native vLLM inference.

The maximum-compression variant. Both weights and activations are FP4, so it needs Blackwell's native FP4 tensor cores to run at all. In our OCR benchmark it was slower than NVFP4A16 (5794 vs 5058 ms/page) despite the smaller activation footprint — for long-output OCR workloads the W4A4 path didn't translate the extra compression into wall-clock wins. Ship the NVFP4A16 sibling instead unless you specifically need W4A4 for memory reasons.

For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.


Quantization recipe

# recipe.yaml (shipped in this repo)
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore:
        - 're:.*lm_head'
        - 're:visual.*'             # keep ViT vision tower bf16
        - 're:model.visual.*'
        - 're:.*mlp.gate$'
        - 're:.*embed_tokens$'
        - 're:.*shared_expert_gate$'
        - 're:.*mlp\.shared_expert$'
        - 're:.*linear_attn.*'
      scheme: NVFP4          # W4A4
  • Weights: NVFP4 (4-bit microscaled FP4)
  • Activations: NVFP4 (4-bit, computed dynamically)
  • Vision tower, lm_head, MoE gates, linear_attn.* kept in bf16.
  • Calibration: 512 samples × 4096 tokens from HuggingFaceH4/ultrachat_200k.

Hardware requirements

GPU family Compute capability NVFP4 status Recommended?
Blackwell (RTX PRO 6000, B100/B200, RTX 5090) sm_100+ Native FP4 tensor cores ✅ only here
Hopper (H100/H200) sm_90 Software emulation ❌ slower than FP8
Ada (RTX 4090/L40S) sm_89 No FP4 ❌ use FP8_DYNAMIC
Ampere / older ≤ sm_86 No FP4

vLLM ≥ 0.19.1 required (compressed-tensors NVFP4 W4A4 VL kernel landed there; v0.17 rejects with Unsupported data_type: nv_fp).

Benchmark (vs. other Chandra-2 quants)

Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese financial-statement PDF, vLLM 0.19.1, max-num-seqs=128, max-num-batched-tokens=32768, kv-cache=fp8.

Build Sequential per-doc Concurrent per-page Best ms/page vs bf16
bf16 baseline 12724 ms 12642 ms 12642 1.0×
FP8_DYNAMIC 5434 ms 9525 ms 5434 2.3×
NVFP4A16 12280 ms 5058 ms 5058 2.5×
NVFP4 (W4A4) 10092 ms 5794 ms 5794 2.2×

Take-away: NVFP4 W4A4 sits behind both FP8_DYNAMIC (faster sequential) and NVFP4A16 (faster concurrent) on real OCR workloads. Keep it as a reference point or for memory-constrained deployments — production should prefer NVFP4A16.

Usage

vLLM (OpenAI-compatible server) — recommended

vllm serve dangvansam/chandra-ocr-2-NVFP4 \
  --served-model-name chandra \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'
from openai import OpenAI
import base64, pathlib

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="chandra",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "<ocr_layout>"},
        ],
    }],
    max_tokens=12000,
    temperature=0.0,
)
print(resp.choices[0].message.content)

HuggingFace Transformers

Vision tower kept in bf16 so transformers ≥ 5.2 loads this checkpoint via AutoModelForImageTextToText. Use the upstream snippet with dangvansam/chandra-ocr-2-NVFP4 substituted for the base id.

When to pick which Chandra-2 quant

Workload Pick
Page-concurrent fan-out on Blackwell NVFP4A16
Single sequential request per doc FP8_DYNAMIC
Ada / Hopper GPU, FP8 acceptable FP8_DYNAMIC
Max compression, accuracy not critical NVFP4 W4A4 (this repo)
Reference accuracy / older hardware upstream bf16

Files

  • model.safetensors — NVFP4 W4A4-packed weights (~11 GB)
  • config.json, processor_config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, generation_config.json — copied from upstream
  • recipe.yaml — exact llm-compressor recipe used

License & attribution

Inherits the upstream OpenRAIL-M license from datalab-to/chandra-ocr-2. Free for research, personal use, and startups <$2M; not for use competing with Datalab's hosted API. For broader commercial use see Datalab pricing.

This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.

Citation

@misc{chandra_ocr_2,
  author = {Datalab},
  title  = {Chandra OCR 2},
  year   = {2026},
  url    = {https://huggingface.co/datalab-to/chandra-ocr-2}
}
Downloads last month
25
Safetensors
Model size
3B params
Tensor type
F32
·
BF16
·
F8_E4M3
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for dangvansam/chandra-ocr-2-NVFP4

Quantized
(10)
this model