Chandra GPTQ 4-bit

Chandra-GPTQ-4bit is a 4-bit GPTQ-quantized version of datalab-to/chandra, optimized for efficient inference with vLLM on modern GPUs such as A100.

This model significantly reduces VRAM usage while maintaining strong OCR and document understanding performance.


πŸ”Ž Model Overview

  • Base Model: datalab-to/chandra
  • Quantization: GPTQ 4-bit
  • Precision: W4A16
  • Group Size: 128
  • Symmetric Quantization: Yes
  • Format: GPTQ
  • Primary Use: OCR / Image-to-Text / Document Understanding

✨ Capabilities

Chandra is a multimodal OCR model capable of:

  • Converting documents to Markdown, HTML, or JSON
  • Preserving detailed layout information
  • Extracting tables, forms, and structured data
  • Handling handwriting
  • Reconstructing checkboxes and form elements
  • Extracting images and captions
  • Supporting 40+ languages
  • Understanding complex layouts (math, tables, diagrams)

πŸš€ Usage with vLLM (Recommended)

Install vLLM:

pip install vllm

Start OpenAI-compatible API server:

python -m vllm.entrypoints.openai.api_server \
  --host 0.0.0.0 \
  --port $PORT \
  --model kishlay9890/chandra-gptq-4bit \
  --max-model-len 8192 \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization $GPU_UTIL \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 8 \
  --seed 1234

πŸ’Ύ Hardware Requirements

Recommended:

  • NVIDIA A100 (40GB or higher)
  • CUDA 12+
  • 16–24GB VRAM minimum for inference

Compared to FP16 models, this 4-bit version significantly lowers memory requirements.


πŸ“œ License

Please refer to the original base model (datalab-to/chandra) for license details.


πŸ™Œ Acknowledgements

  • Base model: datalab-to/chandra
  • Quantization: GPTQ

Downloads last month
66
Safetensors
Model size
9B params
Tensor type
BF16
Β·
I32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ 1 Ask for provider support

Model tree for kishlay9890/chandra-gptq-4bit

Base model

datalab-to/chandra
Quantized
(10)
this model