Instructions to use dangvansam/chandra-ocr-2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use dangvansam/chandra-ocr-2-NVFP4 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="dangvansam/chandra-ocr-2-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("dangvansam/chandra-ocr-2-NVFP4") model = AutoModelForImageTextToText.from_pretrained("dangvansam/chandra-ocr-2-NVFP4") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use dangvansam/chandra-ocr-2-NVFP4 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "dangvansam/chandra-ocr-2-NVFP4" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dangvansam/chandra-ocr-2-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/dangvansam/chandra-ocr-2-NVFP4
- SGLang
How to use dangvansam/chandra-ocr-2-NVFP4 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "dangvansam/chandra-ocr-2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dangvansam/chandra-ocr-2-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "dangvansam/chandra-ocr-2-NVFP4" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "dangvansam/chandra-ocr-2-NVFP4", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use dangvansam/chandra-ocr-2-NVFP4 with Docker Model Runner:
docker model run hf.co/dangvansam/chandra-ocr-2-NVFP4
chandra-ocr-2 — NVFP4 (W4A4)
NVFP4 W4A4 (4-bit weight, 4-bit activation) quantization of
datalab-to/chandra-ocr-2
produced with llm-compressor
and packed as compressed-tensors
for native vLLM inference.
The maximum-compression variant. Both weights and activations are FP4, so it needs Blackwell's native FP4 tensor cores to run at all. In our OCR benchmark it was slower than NVFP4A16 (5794 vs 5058 ms/page) despite the smaller activation footprint — for long-output OCR workloads the W4A4 path didn't translate the extra compression into wall-clock wins. Ship the NVFP4A16 sibling instead unless you specifically need W4A4 for memory reasons.
For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.
Quantization recipe
# recipe.yaml (shipped in this repo)
default_stage:
default_modifiers:
QuantizationModifier:
targets: [Linear]
ignore:
- 're:.*lm_head'
- 're:visual.*' # keep ViT vision tower bf16
- 're:model.visual.*'
- 're:.*mlp.gate$'
- 're:.*embed_tokens$'
- 're:.*shared_expert_gate$'
- 're:.*mlp\.shared_expert$'
- 're:.*linear_attn.*'
scheme: NVFP4 # W4A4
- Weights: NVFP4 (4-bit microscaled FP4)
- Activations: NVFP4 (4-bit, computed dynamically)
- Vision tower,
lm_head, MoE gates,linear_attn.*kept in bf16. - Calibration: 512 samples × 4096 tokens from
HuggingFaceH4/ultrachat_200k.
Hardware requirements
| GPU family | Compute capability | NVFP4 status | Recommended? |
|---|---|---|---|
| Blackwell (RTX PRO 6000, B100/B200, RTX 5090) | sm_100+ | Native FP4 tensor cores | ✅ only here |
| Hopper (H100/H200) | sm_90 | Software emulation | ❌ slower than FP8 |
| Ada (RTX 4090/L40S) | sm_89 | No FP4 | ❌ use FP8_DYNAMIC |
| Ampere / older | ≤ sm_86 | No FP4 | ❌ |
vLLM ≥ 0.19.1 required (compressed-tensors NVFP4 W4A4 VL kernel
landed there; v0.17 rejects with Unsupported data_type: nv_fp).
Benchmark (vs. other Chandra-2 quants)
Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese
financial-statement PDF, vLLM 0.19.1, max-num-seqs=128,
max-num-batched-tokens=32768, kv-cache=fp8.
| Build | Sequential per-doc | Concurrent per-page | Best ms/page | vs bf16 |
|---|---|---|---|---|
| bf16 baseline | 12724 ms | 12642 ms | 12642 | 1.0× |
| FP8_DYNAMIC | 5434 ms | 9525 ms | 5434 | 2.3× |
| NVFP4A16 | 12280 ms | 5058 ms | 5058 | 2.5× |
| NVFP4 (W4A4) | 10092 ms | 5794 ms | 5794 | 2.2× |
Take-away: NVFP4 W4A4 sits behind both FP8_DYNAMIC (faster sequential) and NVFP4A16 (faster concurrent) on real OCR workloads. Keep it as a reference point or for memory-constrained deployments — production should prefer NVFP4A16.
Usage
vLLM (OpenAI-compatible server) — recommended
vllm serve dangvansam/chandra-ocr-2-NVFP4 \
--served-model-name chandra \
--max-model-len 16384 \
--max-num-seqs 128 \
--max-num-batched-tokens 32768 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching \
--enable-chunked-prefill \
--trust-remote-code \
--mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'
from openai import OpenAI
import base64, pathlib
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()
resp = client.chat.completions.create(
model="chandra",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{img_b64}"}},
{"type": "text", "text": "<ocr_layout>"},
],
}],
max_tokens=12000,
temperature=0.0,
)
print(resp.choices[0].message.content)
HuggingFace Transformers
Vision tower kept in bf16 so transformers ≥ 5.2 loads this
checkpoint via AutoModelForImageTextToText. Use the upstream snippet
with dangvansam/chandra-ocr-2-NVFP4 substituted for the base id.
When to pick which Chandra-2 quant
| Workload | Pick |
|---|---|
| Page-concurrent fan-out on Blackwell | NVFP4A16 |
| Single sequential request per doc | FP8_DYNAMIC |
| Ada / Hopper GPU, FP8 acceptable | FP8_DYNAMIC |
| Max compression, accuracy not critical | NVFP4 W4A4 (this repo) |
| Reference accuracy / older hardware | upstream bf16 |
Files
model.safetensors— NVFP4 W4A4-packed weights (~11 GB)config.json,processor_config.json,preprocessor_config.json,tokenizer.json,tokenizer_config.json,chat_template.jinja,generation_config.json— copied from upstreamrecipe.yaml— exact llm-compressor recipe used
License & attribution
Inherits the upstream OpenRAIL-M license from
datalab-to/chandra-ocr-2. Free for research, personal use, and
startups <$2M; not for use competing with Datalab's hosted API.
For broader commercial use see Datalab pricing.
This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.
Citation
@misc{chandra_ocr_2,
author = {Datalab},
title = {Chandra OCR 2},
year = {2026},
url = {https://huggingface.co/datalab-to/chandra-ocr-2}
}
- Downloads last month
- 25
Model tree for dangvansam/chandra-ocr-2-NVFP4
Base model
datalab-to/chandra-ocr-2