You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Sarvam-30b GPTQ Quantized (W4A16)

This is a 4-bit quantized version of sarvamai/sarvam-30b, created using the LLM Compressor library.

Compression Details (Readme)

This model was quantized using Post-Training Quantization (PTQ) specifically using the GPTQ algorithm. No pruning or distillation was applied.

1. Algorithm Configuration

The following parameters were used for the GPTQ quantization process:

  • Quantization Method: GPTQ
  • Weight Precision: 4-bit (W4A16 - 4-bit weights, 16-bit activations)
  • Scheme: W4A16
  • Block Size / Group Size: 128
  • Sequential Update: True
  • Symmetry: True
  • Target Modules: All Linear layers
  • Ignored Modules: lm_head and re:.*gate.* (MoE router gates were kept in FP16 to preserve routing accuracy)

2. Calibration Dataset

  • Dataset: LinguaLift/IndicMMLU-Pro (Hindi subset)
  • Split: Validation
  • Number of Samples: 512
  • Maximum Sequence Length: 2048
  • Preprocessing: Samples were formatted using the model's chat template (User/Assistant format combining the question, options, and cot_content fields) before tokenization.

3. Software & Hardware

YAML Config File Details

The YAML block at the top of this README serves as the configuration file, containing the model weights metadata and the relevant quantization parameters required to load and interpret the model correctly. Specifically:

  • quant_method: gptq: Tells the inference engine (vLLM/Transformers) which kernel to use.
  • bits: 4 / scheme: W4A16: Defines the precision of weights and activations.
  • group_size: 128 / block_size: 128: Defines the granularity of quantization.
  • desc_act: true / sequential_update: true: Indicates the order of weight processing during quantization (improves accuracy).
  • ignore_modules: Specifies which layers were excluded from quantization (critical for MoE routing layers).

Usage

This model can be deployed efficiently using vLLM or Hugging Face Transformers.

vLLM Example (Recommended for L4/T4 GPUs):

python -m vllm.entrypoints.openai.api_server \
    --model amir22010/sarvam-30b-gptq-4bit \
    --quantization gptq \
    --dtype float16 \
    --max-model-len 2048 \
    --trust-remote-code

Transformers Example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "amir22010/sarvam-30b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

# `vllm_config.yaml` for GPTQ W4A16 Quantized Sarvam-30b

Based on compression setup, here's a well-tuned configuration:

```yaml
# === Model ===
model: sarvam-30b-gptq-4bit          # Path to your quantized model directory
trust_remote_code: true               # Required for sarvam models
dtype: float16                        # W4A16 → weights are int4, activations are fp16

# === Quantization ===
quantization: gptq_marlin             # Use Marlin kernel for GPTQ (much faster than plain gptq)
# If Marlin doesn't work, fall back to:
# quantization: gptq

# === Parallelism ===
tensor_parallel_size: 2               # Adjust: 1 for A100-80GB, 2 for A100-40GB / A6000, etc.
# pipeline_parallel_size: 1           # Increase if spanning multiple nodes

# === Context Length ===
max_model_len: 8192                   # Max sequence length for serving (calibration was 2048, but serve can be higher)
# Note: MoE models have large KV caches; reduce if OOM

# === Memory Management ===
gpu_memory_utilization: 0.90          # Fraction of GPU memory to use (default: 0.9)
kv_cache_dtype: auto                  # Use "fp8_e5m2" on Ada/Hopper GPUs to save KV cache memory

# === Performance ===
max_num_seqs: 64                      # Max concurrent sequences
max_num_batched_tokens: 32768         # Max tokens per batch — tune up/down for throughput vs latency
enforce_eager: false                  # Set true only if CUDA graph capture fails
# cuda_graph_sizes: [1, 2, 4, 8, 16, 32, 64, 128]  # Default CUDA graph capture sizes

# === Sampling Defaults ===
temperature: 1.0
top_p: 1.0
max_tokens: 4096                      # Default max tokens per completion

# === Server ===
host: 0.0.0.0
port: 8000
api_key: null                         # Set if you need auth

# === Optional: Speculative Decoding ===
# speculative_model: null             # Enable if you have a draft model
# speculative_max_tokens: 5

# === Optional: LoRA ===
# enable_lora: false
# max_loras: 1
# lora_modules: null

How to launch

vllm serve --config vllm_config.yaml

Or equivalently (without the config file):

vllm serve sarvam-30b-gptq-4bit \
  --trust-remote-code \
  --dtype float16 \
  --quantization gptq_marlin \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Key decisions explained

Parameter Why
quantization: gptq_marlin Marlin is the optimized kernel for GPTQ in vLLM. It's 2-4× faster than the plain gptq backend. vLLM will automatically convert the GPTQ checkpoints to Marlin format on first load.
dtype: float16 compression was W4A16 — weights are stored as int4 but activations compute in fp16. Don't use auto here since it may pick bfloat16 and cause dtype mismatches with the GPTQ checkpoints.
tensor_parallel_size: 2 Sarvam-30b is a MoE model. At W4A16, the weight footprint is ~15-20GB, but MoE models have large KV cache demands per expert. Adjust based on your GPU VRAM.
max_model_len: 8192 calibrated at 2048, but that only affects quantization quality — serving length is independent. If you hit OOM, drop to 4096 or 2048.
kv_cache_dtype: auto On Ada (A100/H100) GPUs, switch to fp8_e5m2 to nearly double the effective KV cache capacity, which is the bottleneck for MoE models.

Troubleshooting

If Marlin fails to load

quantization: gptq    # Fallback — slower but more compatible

If you get OOM

max_model_len: 4096              # Reduce context window
gpu_memory_utilization: 0.95    # Push GPU memory usage
kv_cache_dtype: fp8_e5m2        # Only on Ada/Hopper GPUs
max_num_seqs: 32                 # Fewer concurrent requests

If CUDA graph capture crashes

enforce_eager: true   # Disables CUDA graphs (slightly slower but stable)

Verify the model loads correctly

curl http://localhost:8000/v1/models
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b-gptq-4bit",
    "prompt": "भारत की राजधानी",
    "max_tokens": 50,
    "temperature": 0.7
  }'
Downloads last month
10
Safetensors
Model size
32B params
Tensor type
I64
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amir22010/sarvam-30b-gptq-4bit

Quantized
(19)
this model