sarvam-30b-compressed โ CompressED @ ARRC 2026
Compressed version of sarvamai/sarvam-30b submitted by team CompressED for the Resilient AI Challenge 2026 Text-to-Text category (Sarvam-30B track).
Compression Summary
| Property | Value |
|---|---|
| Method | AWQ W4A16 (MoE experts) + FP8 Dynamic (attention + layer 0) |
| Original size | ~60 GB (BF16) |
| Compressed size | ~24 GB |
| Compression ratio | ~2.5ร |
| Format | compressed-tensors (mixed-precision) |
| Tool | llm-compressor |
| Speculative decoding | Eagle3 (sulabhkatiyar/eagle3-sarvam-30b) โ 2.75ร overall speedup, up to 3.59ร on Indic languages |
Method
Two-stage mixed-precision quantization applied in a single oneshot() pass:
Stage 0 โ AWQ W4A16 on MoE expert layers
- Targets: all
Linearlayers (128 MoE experts, layers 1โ18) - Ignore:
lm_head, layer 0 (dense), all attention layers, MoE router gates - Group size: 128, symmetric quantization
- SmoothQuant activation balancing: migrates outliers from activations into weights before quantization, preserving reasoning and mathematical quality
- Calibration:
sarvamai/indivibe(512 samples) +cais/mmlu(256 samples)
MoE expert layers tolerate INT4 well due to the natural redundancy across 128 experts (top-6 active per token). SmoothQuant is critical for maintaining quality on logical reasoning and mathematical benchmarks.
Stage 1 โ FP8 Dynamic on attention + dense layer 0
- Targets:
attention.query_key_value,attention.dense, layer-0 MLP projections - Scheme: FP8_DYNAMIC (per-token dynamic activation scaling, no calibration needed)
- Rationale: Attention with only 4 KV heads is more sensitive to quantization; FP8 preserves quality while reducing memory bandwidth vs BF16
What stays at BF16
lm_head(output projection)- MoE router gate weights (protecting expert routing decisions)
Precision Map
| Component | Precision | Why |
|---|---|---|
| MoE experts (layers 1โ18) | INT4 AWQ | 128 experts tolerate 4-bit; ~4ร bandwidth gain |
| Attention (all layers) | FP8 Dynamic | 4 KV heads are sensitive; FP8 preserves quality |
| Layer 0 MLP (dense) | FP8 Dynamic | Dense layer, more sensitive than MoE experts |
| Router gates | BF16 | Expert routing โ must remain at full precision |
| lm_head | BF16 | Output layer โ always full precision |
Usage with vLLM
vllm serve --config vllm_config.yaml
vllm_config.yaml is included in this repository. Key settings:
model: CompressEDai4good/sarvam-30b-compressed
tensor_parallel_size: 1
max_model_len: 65536
quantization: compressed-tensors
Energy Efficiency
Three compounding optimisations drive energy reduction:
| Optimisation | Mechanism | Contribution |
|---|---|---|
| AWQ W4A16 on MoE experts | ~4ร memory bandwidth reduction | ~2.5ร throughput gain |
| FP8 Dynamic on attention | ~2ร bandwidth reduction on attention | Additional ~10โ15% gain |
| Eagle3 speculative decoding | 7 draft tokens verified per step | 2.2โ2.75ร additional speedup |
Combined effect on CodeCarbon wall-clock energy measurement (A100 80GB, single GPU):
- Model size:
24 GB vs ~60 GB baseline (60% weight reduction) - Throughput gain: ~5โ7ร over BF16 baseline (compression ร Eagle3)
- Estimated energy reduction: ~80โ86% vs BF16 baseline
Eagle3 per-task speedups (measured by sulabhkatiyar, 60-prompt eval):
| Task type | Speedup |
|---|---|
| Indic language generation (Hindi, Bengali, Tamil) | 3.19โ3.59ร |
| English generation | 2.63ร |
| Mathematical reasoning | 2.22ร |
| Long context / Code | 1.71โ1.75ร |
Hardware Requirements
- Minimum: 1ร NVIDIA A100 40GB (model fits in 24 GB; KV cache uses remaining VRAM)
- Recommended eval hardware: 1ร NVIDIA A100 80GB (as per challenge specification)
- FP8 inference: requires NVIDIA Ampere or newer (A100, A10G, H100)
Files
| File | Description |
|---|---|
model-*.safetensors |
Compressed model weights (mixed INT4/FP8/BF16) |
config.json |
Model config with quantization_config |
vllm_config.yaml |
vLLM serving configuration for evaluation |
recipe.yaml |
Full llm-compressor recipe used for compression |
chat_template.jinja |
Sarvam-30B chat template |
License
Apache License 2.0 โ same as sarvamai/sarvam-30b.
- Downloads last month
- 56
Model tree for CompressEDai4good/sarvam-30b-compressed
Base model
sarvamai/sarvam-30b