You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Sarvam-30b GPTQ Quantized (W4A16)

This is a 4-bit quantized version of sarvamai/sarvam-30b, created using the LLM Compressor library.

Compression Details (Readme)

This model was quantized using Post-Training Quantization (PTQ) specifically using the GPTQ algorithm. No pruning or distillation was applied.

1. Algorithm Configuration

The following parameters were used for the GPTQ quantization process:

Quantization Method: GPTQ
Weight Precision: 4-bit (W4A16 - 4-bit weights, 16-bit activations)
Scheme: W4A16
Block Size / Group Size: 128
Sequential Update: True
Symmetry: True
Target Modules: All Linear layers
Ignored Modules: lm_head and re:.*gate.* (MoE router gates were kept in FP16 to preserve routing accuracy)

2. Calibration Dataset

Dataset: LinguaLift/IndicMMLU-Pro (Hindi subset)
Split: Validation
Number of Samples: 512
Maximum Sequence Length: 2048
Preprocessing: Samples were formatted using the model's chat template (User/Assistant format combining the question, options, and cot_content fields) before tokenization.

3. Software & Hardware

Framework: LLM Compressor (by Neural Magic)
Modifier: GPTQModifier
Hardware: NVIDIA H200 GPU

YAML Config File Details

The YAML block at the top of this README serves as the configuration file, containing the model weights metadata and the relevant quantization parameters required to load and interpret the model correctly. Specifically:

quant_method: gptq: Tells the inference engine (vLLM/Transformers) which kernel to use.
bits: 4 / scheme: W4A16: Defines the precision of weights and activations.
group_size: 128 / block_size: 128: Defines the granularity of quantization.
desc_act: true / sequential_update: true: Indicates the order of weight processing during quantization (improves accuracy).
ignore_modules: Specifies which layers were excluded from quantization (critical for MoE routing layers).

Usage

This model can be deployed efficiently using vLLM or Hugging Face Transformers.

vLLM Example (Recommended for L4/T4 GPUs):

python -m vllm.entrypoints.openai.api_server \
    --model amir22010/sarvam-30b-gptq-4bit \
    --quantization gptq \
    --dtype float16 \
    --max-model-len 2048 \
    --trust-remote-code

Transformers Example:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "amir22010/sarvam-30b-gptq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)


# `vllm_config.yaml` for GPTQ W4A16 Quantized Sarvam-30b

Based on compression setup, here's a well-tuned configuration:

```yaml
# === Model ===
model: sarvam-30b-gptq-4bit          # Path to your quantized model directory
trust_remote_code: true               # Required for sarvam models
dtype: float16                        # W4A16 → weights are int4, activations are fp16

# === Quantization ===
quantization: gptq_marlin             # Use Marlin kernel for GPTQ (much faster than plain gptq)
# If Marlin doesn't work, fall back to:
# quantization: gptq

# === Parallelism ===
tensor_parallel_size: 2               # Adjust: 1 for A100-80GB, 2 for A100-40GB / A6000, etc.
# pipeline_parallel_size: 1           # Increase if spanning multiple nodes

# === Context Length ===
max_model_len: 8192                   # Max sequence length for serving (calibration was 2048, but serve can be higher)
# Note: MoE models have large KV caches; reduce if OOM

# === Memory Management ===
gpu_memory_utilization: 0.90          # Fraction of GPU memory to use (default: 0.9)
kv_cache_dtype: auto                  # Use "fp8_e5m2" on Ada/Hopper GPUs to save KV cache memory

# === Performance ===
max_num_seqs: 64                      # Max concurrent sequences
max_num_batched_tokens: 32768         # Max tokens per batch — tune up/down for throughput vs latency
enforce_eager: false                  # Set true only if CUDA graph capture fails
# cuda_graph_sizes: [1, 2, 4, 8, 16, 32, 64, 128]  # Default CUDA graph capture sizes

# === Sampling Defaults ===
temperature: 1.0
top_p: 1.0
max_tokens: 4096                      # Default max tokens per completion

# === Server ===
host: 0.0.0.0
port: 8000
api_key: null                         # Set if you need auth

# === Optional: Speculative Decoding ===
# speculative_model: null             # Enable if you have a draft model
# speculative_max_tokens: 5

# === Optional: LoRA ===
# enable_lora: false
# max_loras: 1
# lora_modules: null

How to launch

vllm serve --config vllm_config.yaml

Or equivalently (without the config file):

vllm serve sarvam-30b-gptq-4bit \
  --trust-remote-code \
  --dtype float16 \
  --quantization gptq_marlin \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9

Key decisions explained

Parameter	Why
`quantization: gptq_marlin`	Marlin is the optimized kernel for GPTQ in vLLM. It's 2-4× faster than the plain `gptq` backend. vLLM will automatically convert the GPTQ checkpoints to Marlin format on first load.
`dtype: float16`	compression was `W4A16` — weights are stored as int4 but activations compute in fp16. Don't use `auto` here since it may pick bfloat16 and cause dtype mismatches with the GPTQ checkpoints.
`tensor_parallel_size: 2`	Sarvam-30b is a MoE model. At W4A16, the weight footprint is ~15-20GB, but MoE models have large KV cache demands per expert. Adjust based on your GPU VRAM.
`max_model_len: 8192`	calibrated at 2048, but that only affects quantization quality — serving length is independent. If you hit OOM, drop to 4096 or 2048.
`kv_cache_dtype: auto`	On Ada (A100/H100) GPUs, switch to `fp8_e5m2` to nearly double the effective KV cache capacity, which is the bottleneck for MoE models.

Troubleshooting

If Marlin fails to load

quantization: gptq    # Fallback — slower but more compatible

If you get OOM

max_model_len: 4096              # Reduce context window
gpu_memory_utilization: 0.95    # Push GPU memory usage
kv_cache_dtype: fp8_e5m2        # Only on Ada/Hopper GPUs
max_num_seqs: 32                 # Fewer concurrent requests

If CUDA graph capture crashes

enforce_eager: true   # Disables CUDA graphs (slightly slower but stable)

Verify the model loads correctly

curl http://localhost:8000/v1/models

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b-gptq-4bit",
    "prompt": "भारत की राजधानी",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Downloads last month: 10

Safetensors

Model size

32B params

Tensor type

I64

I32

F16

Model tree for amir22010/sarvam-30b-gptq-4bit

Base model

sarvamai/sarvam-30b

Quantized

(19)

this model