Instructions to use dangvansam/chandra-ocr-2-NVFP4A16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use dangvansam/chandra-ocr-2-NVFP4A16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="dangvansam/chandra-ocr-2-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("dangvansam/chandra-ocr-2-NVFP4A16")
model = AutoModelForMultimodalLM.from_pretrained("dangvansam/chandra-ocr-2-NVFP4A16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use dangvansam/chandra-ocr-2-NVFP4A16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "dangvansam/chandra-ocr-2-NVFP4A16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/dangvansam/chandra-ocr-2-NVFP4A16

SGLang

How to use dangvansam/chandra-ocr-2-NVFP4A16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "dangvansam/chandra-ocr-2-NVFP4A16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "dangvansam/chandra-ocr-2-NVFP4A16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "dangvansam/chandra-ocr-2-NVFP4A16",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use dangvansam/chandra-ocr-2-NVFP4A16 with Docker Model Runner:
```
docker model run hf.co/dangvansam/chandra-ocr-2-NVFP4A16
```

chandra-ocr-2 — NVFP4A16 (W4A16)

NVFP4A16 (4-bit weight, 16-bit activation) quantization of datalab-to/chandra-ocr-2 produced with llm-compressor and packed as compressed-tensors for native vLLM inference.

Fastest of the three quants we measured on Blackwell under the path that actually matters for real OCR pipelines (page-level concurrent fan-out). ~5058 ms/page = 0.20 pages/s, 2.5× over bf16.

For the original model description, intended uses, accuracy benchmarks (olmOCR-bench, 90-language) and license terms, see the upstream card: https://huggingface.co/datalab-to/chandra-ocr-2.

Quantization recipe

# recipe.yaml (shipped in this repo)
default_stage:
  default_modifiers:
    QuantizationModifier:
      targets: [Linear]
      ignore:
        - lm_head
        - 're:.*visual.*'        # keep vision tower in bf16
        - 're:.*linear_attn.*'   # keep linear-attn fp16
      scheme: NVFP4A16

Weights: NVFP4 (4-bit microscaled FP4, Blackwell-native)
Activations: FP16
lm_head, the entire visual.* ViT tower, and linear_attn.* left in fp16/bf16 (per the upstream Qwen3.5-VL NVFP4 recipe).
Calibration: 512 samples × 4096 tokens from HuggingFaceH4/ultrachat_200k.

Hardware requirements

GPU family	Compute capability	NVFP4 status	Recommended?
Blackwell (RTX PRO 6000, B100/B200, RTX 5090)	sm_100+	Native FP4 tensor cores	✅ ideal
Hopper (H100/H200)	sm_90	Software emulation via marlin	runnable but slower; prefer FP8_DYNAMIC
Ada (RTX 4090/L40S)	sm_89	No FP4	❌ use FP8_DYNAMIC variant
Ampere / older	≤ sm_86	No FP4	❌ use BF16 / FP8 elsewhere

vLLM ≥ 0.19.1 required (compressed-tensors NVFP4A16 VL kernel landed there; v0.17 rejects with Unsupported data_type: nv_fp).

Benchmark (vs. other Chandra-2 quants)

Test bed: RTX PRO 6000 Blackwell Max-Q (96 GB), 14-page Vietnamese financial-statement PDF, vLLM 0.19.1, max-num-seqs=128, max-num-batched-tokens=32768, kv-cache=fp8.

Build	Sequential per-doc	Concurrent per-page	Best ms/page	vs bf16
bf16 baseline	12724 ms	12642 ms	12642	1.0×
FP8_DYNAMIC	5434 ms	9525 ms	5434	2.3×
NVFP4A16	12280 ms	5058 ms	5058	2.5×
NVFP4 (W4A4)	10092 ms	5794 ms	5794	2.2×

Take-away: NVFP4A16 wins only with concurrent page fan-out (continuous batching). Single-request serial loops favor FP8_DYNAMIC.

Usage

vLLM (OpenAI-compatible server) — recommended

vllm serve dangvansam/chandra-ocr-2-NVFP4A16 \
  --served-model-name chandra \
  --max-model-len 16384 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.92 \
  --enable-prefix-caching \
  --enable-chunked-prefill \
  --trust-remote-code \
  --mm-processor-kwargs '{"min_pixels": 3136, "max_pixels": 6291456}'

# Client — call exactly like the bf16 original
from openai import OpenAI
import base64, pathlib

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
img_b64 = base64.b64encode(pathlib.Path("page.png").read_bytes()).decode()

resp = client.chat.completions.create(
    model="chandra",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{img_b64}"}},
            {"type": "text", "text": "<ocr_layout>"},
        ],
    }],
    max_tokens=12000,
    temperature=0.0,
)
print(resp.choices[0].message.content)

HuggingFace Transformers

The vision tower is left in bf16, so transformers ≥ 5.2 loads this checkpoint identically to the upstream — only the LLM-side weights are 4-bit. Use the snippet from the upstream card, replacing "datalab-to/chandra-ocr-2" with "dangvansam/chandra-ocr-2-NVFP4A16".

When to pick which Chandra-2 quant

Workload	Pick
Page-concurrent fan-out on Blackwell	NVFP4A16 (this repo)
Single sequential request per doc	FP8_DYNAMIC
Ada / Hopper GPU, FP8 acceptable	FP8_DYNAMIC
Max compression, accuracy not critical	NVFP4 (W4A4)
Reference accuracy / older hardware	upstream bf16

Files

model.safetensors — NVFP4A16-packed weights (~11 GB)
config.json, processor_config.json, preprocessor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, generation_config.json — copied from upstream
recipe.yaml — exact llm-compressor recipe used

License & attribution

Inherits the upstream OpenRAIL-M license from datalab-to/chandra-ocr-2. Free for research, personal use, and startups <$2M; not for use competing with Datalab's hosted API. For broader commercial use see Datalab pricing.

This is an unofficial community quant. No additional weights or data were added — only a numerical re-encoding of the upstream model. All credit for the model itself goes to Datalab.

Citation

@misc{chandra_ocr_2,
  author = {Datalab},
  title  = {Chandra OCR 2},
  year   = {2026},
  url    = {https://huggingface.co/datalab-to/chandra-ocr-2}
}

Downloads last month: 840

Safetensors

Model size

3B params

Tensor type

F32

BF16

F8_E4M3

Model tree for dangvansam/chandra-ocr-2-NVFP4A16

Base model

datalab-to/chandra-ocr-2

Quantized

(25)

this model