You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CompressED: Sarvam-30B AutoRound INT4 W4A16 (top-k=3)

Team: CompressED | Challenge: UNESCO Resilient AI Challenge 2026 — Text-to-Text
Round 1 result: #1 — 390.999 Wh, 100% Quality Recovery
Submission repo: CompressEDai4good/sarvam-30b-compressed


Compression Technique

Method: Intel AutoRound INT4 (GPTQ-format W4A16) + MoE Top-k=3 Routing Reduction

  • Quantizer: Intel AutoRound 0.13.0 — gradient-based (signed-SGD) weight-rounding optimizer
  • Weight precision: 4-bit INT4, group_size=128, symmetric, desc_act=false, damp_percent=0.01
  • Calibration: 512 samples (nsamples=512)
  • Output format: GPTQ-format INT4 packed weights, loaded via compressed-tensors in vLLM
  • Activation precision: 16-bit (W4A16 scheme)
  • Top-k routing: reduced from 6 → 3 experts per token (num_experts_per_tok: 3)
  • Compression ratio: 5.5× vs BF16 (120 GB → ~19.8 GB)

Key Innovation 1: MoE Gate Router Protection

Sarvam-30B is a Mixture-of-Experts model with 128 experts and top-6 routing. Quantizing the routing gate layer causes routing-cascade errors that collapse quality on reasoning tasks. AutoRound's dynamic rule keeps every mlp.gate layer in full precision (BF16):

"dynamic": { "-:.*mlp\\.gate.*": {} }

Verified post-quantization: all 36 gate tensors carry zero quantization parameters (no weight_scale / weight_packed) — they remain genuine BF16. Most competitors quantize all Linear layers; preserving the gate is the critical difference that keeps the routing distribution intact and enables high quality recovery.

Key Innovation 2: Top-k Routing Reduction

We reduce the number of active experts per token from 6 → 3 (num_experts_per_tok: 3 in config.json). This cuts expert-layer FLOPs by ~50%, reducing both inference latency and energy.

Why AutoRound

AutoRound applies a few hundred steps of gradient-based optimization to the rounding of each weight block, recovering accuracy that naïve round-to-nearest GPTQ loses at 4-bit — at the same bit-width and inference cost. The output is the standard vLLM-native GPTQ / compressed-tensors format, so there is no inference-time overhead versus ordinary GPTQ.


Model Size

Component BF16 (baseline) This model
Total size ~120 GB ~19.8 GB
VRAM required ~120 GB ~22 GB
Experts per token 6 3
Compression ratio ~5.5×

Quality Results (Internal Evaluation)

Measured on the exact uploaded weights (vLLM 0.19.1, k=3 active, thinking enabled, max_tokens set high enough that reasoning traces are never truncated — Math 8192, MCQ 4096). These are internal proxy benchmarks. The official Round-2 score is measured by the organizers on their own single A100 and task set — that is the figure that counts.

Benchmark 150q/cat run 50q/cat re-check (Jun 15) Calibrated Official*
GSM8K (Math proxy) 79.3% (119/150) 74.0% (37/50) ~0.75–0.92
MedMCQA (Medical proxy) 67.3% (101/150) 60.0% (30/50) ~0.57–0.65
ARC-Challenge (Questions) 90.0% (135/150) 90.0% (45/50) ~0.90
Writing 0.814 (carried from W8A16 official)
Mean Recovery ~97% ~90% organizer-measured

* In both runs all three re-measured categories clear the 80% official quality gate with margin. Proxy recovery brackets ~90–97% depending on sample size; we treat ~90% as the conservative floor. Writing is carried from the W8A16 official value (not re-evaluated). The calibration vs the Round-1 W8A16 reference is an estimate, not an exact official figure.

Energy (self-measured, inference-only, CodeCarbon NVML):

  • ~226 Wh / 300-question run (A100 80GB SXM4); Jun-15 re-check: ~147 Wh / 150-question run (A100 80GB PCIe)
  • vs Round 1 W8A16 baseline (390.999 Wh) and BF16 baseline (647 Wh) — large reductions on the same proxy set
  • Note: the official Round 2 figure is measured by the organizers on their hardware and task set, and may differ.

How to Run

pip install -r requirements.txt
vllm serve --config vllm_config.yaml

vllm_config.yaml

model: "CompressEDai4good/sarvam-30b-compressed"
served_model_name: "sarvam-30b"
trust_remote_code: true
gpu_memory_utilization: 0.92
max_model_len: 32768
max_num_seqs: 4
max_num_batched_tokens: 16384
quantization: "compressed-tensors"
enable_prefix_caching: true
enable_chunked_prefill: true

requirements.txt

vllm==0.19.1
torch>=2.1.0
transformers>=4.40.0
accelerate>=0.27.0
safetensors>=0.4.0
sentencepiece>=0.1.99

# vLLM 0.19.1 fails to boot with fastapi>=0.116 or
# prometheus-fastapi-instrumentator>=8.0 ("'_IncludedRouter' object has no attribute 'path'").
fastapi>=0.115,<0.116
starlette>=0.46,<0.47
prometheus-fastapi-instrumentator>=7,<8

Reproduction Protocol

To reproduce our internal numbers exactly (organizers may differ on hardware/task set):

  1. Hardware: single NVIDIA A100 80GB. Engine: vLLM 0.19.1 (pip install -r requirements.txt).
  2. Serve: vllm serve --config vllm_config.yaml (k=3 routing is baked into config.json's num_experts_per_tok: 3 — no flag needed).
  3. Thinking is ON (default chat template). Sarvam-30B emits <think>…</think> before answers, so set generous max_tokens or reasoning traces truncate and answers go missing: Math max_tokens=8192, MCQ max_tokens=4096.
  4. Quality: GSM8K (Math), MedMCQA (Medical), ARC-Challenge (Questions) proxies; Writing via LLM-as-judge. Parse the final answer after the </think> tag.
  5. Energy: CodeCarbon (NVML), inference-only (model-load excluded). Self-measured Wh on non-Yotta hardware is not directly comparable to the organizers' single-A100 figure.
  6. Verify gate protection: run verify_gate_layers.py — all 36 mlp.gate tensors must carry no quantization params (genuine BF16).

Tools Used

Tool Version Purpose
Intel AutoRound 0.13.0 INT4 weight quantization (gradient-based rounding)
vLLM 0.19.1 Inference engine (GPTQ / compressed-tensors)
Hugging Face Transformers 4.55.4 Model loading
CodeCarbon ≥2.3.0 Energy measurement

References

  • AutoRound — Intel, Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMsgithub.com/intel/auto-round, arXiv:2309.05516
  • GPTQ: Accurate Post-Training Quantization — arXiv:2210.17323
  • Sarvam-30B: sarvamai/sarvam-30b

Acknowledgments

  • Paul Li — for recommending Intel's AutoRound quantization toolkit, which produced this final submission. His pointer to gradient-based rounding was the decisive step in reaching submission-grade 4-bit quality.
  • Intel AutoRound team — github.com/intel/auto-round
  • Sarvam AI — base model and mid-challenge technical guidance
  • Replit agentic AI — compression and evaluation pipeline development

About the Author

Dr Simon Wang
Lecturer in English and Innovation Officer
The Language Centre, Hong Kong Baptist University


It is my great pleasure to join the Resilient AI Challenge and I learned how to compress models from scratch while partnering with Replit agentic AI. I hope my work can contribute to the collective efforts of developing greener and more accessible large language models. Feel free to reach out via email if you have questions or wish to explore collaboration.

CompressED Team — UNESCO Resilient AI Challenge 2026

Downloads last month
235
Safetensors
Model size
33B params
Tensor type
I64
·
I32
·
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CompressEDai4good/sarvam-30b-compressed

Quantized
(28)
this model

Papers for CompressEDai4good/sarvam-30b-compressed