Instructions to use dangvansam/chandra-ocr-2-NVFP4 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dangvansam/chandra-ocr-2-NVFP4 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="dangvansam/chandra-ocr-2-NVFP4")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("dangvansam/chandra-ocr-2-NVFP4")
model = AutoModelForMultimodalLM.from_pretrained("dangvansam/chandra-ocr-2-NVFP4", device_map="auto")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dangvansam/chandra-ocr-2-NVFP4 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dangvansam/chandra-ocr-2-NVFP4"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/dangvansam/chandra-ocr-2-NVFP4

SGLang

How to use dangvansam/chandra-ocr-2-NVFP4 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dangvansam/chandra-ocr-2-NVFP4" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dangvansam/chandra-ocr-2-NVFP4" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-NVFP4",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use dangvansam/chandra-ocr-2-NVFP4 with Docker Model Runner:
```
docker model run hf.co/dangvansam/chandra-ocr-2-NVFP4
```

chandra-ocr-2 — NVFP4 (W4A4)

NVFP4 W4A4 (4-bit weight, 4-bit activation) quantization of datalab-to/chandra-ocr-2 produced with llm-compressor and packed as compressed-tensors for native vLLM inference.

The maximum-compression variant. Both weights and activations are FP4, so it needs Blackwell's native FP4 tensor cores to run at all. In our OCR benchmark it was slower than NVFP4A16 (5794 vs 5058 ms/page) despite the smaller activation footprint — for long-output OCR workloads the W4A4 path didn't translate the extra compression into wall-clock wins. Ship the NVFP4A16 sibling instead unless you specifically need W4A4 for memory reasons.

For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.

Quantization recipe

# recipe.yaml (shipped in this repo)
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore:
        - 're:.*lm_head'
        - 're:visual.*'             # keep ViT vision tower bf16
        - 're:model.visual.*'
        - 're:.*mlp.gate$'
        - 're:.*embed_tokens$'
        - 're:.*shared_expert_gate$'
        - 're:.*mlp\.shared_expert$'
        - 're:.*linear_attn.*'
      scheme: NVFP4          # W4A4

Weights: NVFP4 (4-bit microscaled FP4)
Activations: NVFP4 (4-bit, computed dynamically)
Vision tower, lm_head, MoE gates, linear_attn.* kept in bf16.
Calibration: 512 samples × 4096 tokens from HuggingFaceH4/ultrachat_200k.

Hardware requirements

GPU family	Compute capability	NVFP4 status	Recommended?
Blackwell (RTX PRO 6000, B100/B200, RTX 5090)	sm_100+	Native FP4 tensor cores	✅ only here
Hopper (H100/H200)	sm_90	Software emulation	❌ slower than FP8
Ada (RTX 4090/L40S)	sm_89	No FP4	❌ use FP8_DYNAMIC
Ampere / older	≤ sm_86	No FP4	❌

vLLM ≥ 0.19.1 required (compressed-tensors NVFP4 W4A4 VL kernel landed there; v0.17 rejects with Unsupported data_type: nv_fp).

Benchmark (vs. other Chandra-2 quants)

Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese financial-statement PDF, vLLM 0.19.1, max-num-seqs=128, max-num-batched-tokens=32768, kv-cache=fp8.

Build	Sequential per-doc	Concurrent per-page	Best ms/page	vs bf16
bf16 baseline	12724 ms	12642 ms	12642	1.0×
FP8_DYNAMIC	5434 ms	9525 ms	5434	2.3×
NVFP4A16	12280 ms	5058 ms	5058	2.5×
NVFP4 (W4A4)	10092 ms	5794 ms	5794	2.2×

Take-away: NVFP4 W4A4 sits behind both FP8_DYNAMIC (faster sequential) and NVFP4A16 (faster concurrent) on real OCR workloads. Keep it as a reference point or for memory-constrained deployments — production should prefer NVFP4A16.

Usage

vLLM (OpenAI-compatible server) — recommended

vllm serve dangvansam/chandra-ocr-2-NVFP4 \
  --served-model-name chandra \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'

from openai import OpenAI
import base64, pathlib

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="chandra",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "<ocr_layout>"},
        ],
    }],
    max_tokens=12000,
    temperature=0.0,
)
print(resp.choices[0].message.content)

HuggingFace Transformers

Vision tower kept in bf16 so transformers ≥ 5.2 loads this checkpoint via AutoModelForImageTextToText. Use the upstream snippet with dangvansam/chandra-ocr-2-NVFP4 substituted for the base id.

When to pick which Chandra-2 quant

Workload	Pick
Page-concurrent fan-out on Blackwell	NVFP4A16
Single sequential request per doc	FP8_DYNAMIC
Ada / Hopper GPU, FP8 acceptable	FP8_DYNAMIC
Max compression, accuracy not critical	NVFP4 W4A4 (this repo)
Reference accuracy / older hardware	upstream bf16

Files

model.safetensors — NVFP4 W4A4-packed weights (~11 GB)
config.json, processor_config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, generation_config.json — copied from upstream
recipe.yaml — exact llm-compressor recipe used

License & attribution

Inherits the upstream OpenRAIL-M license from datalab-to/chandra-ocr-2. Free for research, personal use, and startups <$2M; not for use competing with Datalab's hosted API. For broader commercial use see Datalab pricing.

This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.

Citation

@misc{chandra_ocr_2,
  author = {Datalab},
  title  = {Chandra OCR 2},
  year   = {2026},
  url    = {https://huggingface.co/datalab-to/chandra-ocr-2}
}

Downloads last month: 2,246

Safetensors

Model size

3B params

Tensor type

F32

BF16

F8_E4M3

Model tree for dangvansam/chandra-ocr-2-NVFP4

Base model

datalab-to/chandra-ocr-2

Quantized

(35)

this model