Instructions to use CompressEDai4good/sarvam-30b-compressed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CompressEDai4good/sarvam-30b-compressed with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="CompressEDai4good/sarvam-30b-compressed", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("CompressEDai4good/sarvam-30b-compressed", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use CompressEDai4good/sarvam-30b-compressed with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CompressEDai4good/sarvam-30b-compressed" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CompressEDai4good/sarvam-30b-compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/CompressEDai4good/sarvam-30b-compressed
- SGLang
How to use CompressEDai4good/sarvam-30b-compressed with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CompressEDai4good/sarvam-30b-compressed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CompressEDai4good/sarvam-30b-compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CompressEDai4good/sarvam-30b-compressed" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CompressEDai4good/sarvam-30b-compressed", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use CompressEDai4good/sarvam-30b-compressed with Docker Model Runner:
docker model run hf.co/CompressEDai4good/sarvam-30b-compressed
CompressED: Sarvam-30B AutoRound INT4 W4A16 (top-k=3)
Team: CompressED | Challenge: UNESCO Resilient AI Challenge 2026 — Text-to-Text
Round 1 result: #1 — 390.999 Wh, 100% Quality Recovery
Submission repo: CompressEDai4good/sarvam-30b-compressed
Compression Technique
Method: Intel AutoRound INT4 (GPTQ-format W4A16) + MoE Top-k=3 Routing Reduction
- Quantizer: Intel AutoRound 0.13.0 — gradient-based (signed-SGD) weight-rounding optimizer
- Weight precision: 4-bit INT4, group_size=128, symmetric, desc_act=false, damp_percent=0.01
- Calibration: 512 samples (
nsamples=512) - Output format: GPTQ-format INT4 packed weights, loaded via
compressed-tensorsin vLLM - Activation precision: 16-bit (W4A16 scheme)
- Top-k routing: reduced from 6 → 3 experts per token (
num_experts_per_tok: 3) - Compression ratio:
5.5× vs BF16 (120 GB → ~19.8 GB)
Key Innovation 1: MoE Gate Router Protection
Sarvam-30B is a Mixture-of-Experts model with 128 experts and top-6 routing. Quantizing the
routing gate layer causes routing-cascade errors that collapse quality on reasoning tasks.
AutoRound's dynamic rule keeps every mlp.gate layer in full precision (BF16):
"dynamic": { "-:.*mlp\\.gate.*": {} }
Verified post-quantization: all 36 gate tensors carry zero quantization parameters (no
weight_scale / weight_packed) — they remain genuine BF16. Most competitors quantize all
Linear layers; preserving the gate is the critical difference that keeps the routing
distribution intact and enables high quality recovery.
Key Innovation 2: Top-k Routing Reduction
We reduce the number of active experts per token from 6 → 3 (num_experts_per_tok: 3 in
config.json). This cuts expert-layer FLOPs by ~50%, reducing both inference latency and energy.
Why AutoRound
AutoRound applies a few hundred steps of gradient-based optimization to the rounding of each
weight block, recovering accuracy that naïve round-to-nearest GPTQ loses at 4-bit — at the
same bit-width and inference cost. The output is the standard vLLM-native GPTQ /
compressed-tensors format, so there is no inference-time overhead versus ordinary GPTQ.
Model Size
| Component | BF16 (baseline) | This model |
|---|---|---|
| Total size | ~120 GB | ~19.8 GB |
| VRAM required | ~120 GB | ~22 GB |
| Experts per token | 6 | 3 |
| Compression ratio | 1× | ~5.5× |
Quality Results (Internal Evaluation)
Measured on the exact uploaded weights (vLLM 0.19.1, k=3 active, thinking enabled,
max_tokens set high enough that reasoning traces are never truncated — Math 8192, MCQ 4096).
These are internal proxy benchmarks. The official Round-2 score is measured by the
organizers on their own single A100 and task set — that is the figure that counts.
| Benchmark | 150q/cat run | 50q/cat re-check (Jun 15) | Calibrated Official* |
|---|---|---|---|
| GSM8K (Math proxy) | 79.3% (119/150) | 74.0% (37/50) | ~0.75–0.92 |
| MedMCQA (Medical proxy) | 67.3% (101/150) | 60.0% (30/50) | ~0.57–0.65 |
| ARC-Challenge (Questions) | 90.0% (135/150) | 90.0% (45/50) | ~0.90 |
| Writing | — | — | 0.814 (carried from W8A16 official) |
| Mean Recovery | ~97% | ~90% | organizer-measured |
* In both runs all three re-measured categories clear the 80% official quality gate with margin. Proxy recovery brackets ~90–97% depending on sample size; we treat ~90% as the conservative floor. Writing is carried from the W8A16 official value (not re-evaluated). The calibration vs the Round-1 W8A16 reference is an estimate, not an exact official figure.
Energy (self-measured, inference-only, CodeCarbon NVML):
- ~226 Wh / 300-question run (A100 80GB SXM4); Jun-15 re-check: ~147 Wh / 150-question run (A100 80GB PCIe)
- vs Round 1 W8A16 baseline (390.999 Wh) and BF16 baseline (647 Wh) — large reductions on the same proxy set
- Note: the official Round 2 figure is measured by the organizers on their hardware and task set, and may differ.
How to Run
pip install -r requirements.txt
vllm serve --config vllm_config.yaml
vllm_config.yaml
model: "CompressEDai4good/sarvam-30b-compressed"
served_model_name: "sarvam-30b"
trust_remote_code: true
gpu_memory_utilization: 0.92
max_model_len: 32768
max_num_seqs: 4
max_num_batched_tokens: 16384
quantization: "compressed-tensors"
enable_prefix_caching: true
enable_chunked_prefill: true
requirements.txt
vllm==0.19.1
torch>=2.1.0
transformers>=4.40.0
accelerate>=0.27.0
safetensors>=0.4.0
sentencepiece>=0.1.99
# vLLM 0.19.1 fails to boot with fastapi>=0.116 or
# prometheus-fastapi-instrumentator>=8.0 ("'_IncludedRouter' object has no attribute 'path'").
fastapi>=0.115,<0.116
starlette>=0.46,<0.47
prometheus-fastapi-instrumentator>=7,<8
Reproduction Protocol
To reproduce our internal numbers exactly (organizers may differ on hardware/task set):
- Hardware: single NVIDIA A100 80GB. Engine: vLLM 0.19.1 (
pip install -r requirements.txt). - Serve:
vllm serve --config vllm_config.yaml(k=3 routing is baked intoconfig.json'snum_experts_per_tok: 3— no flag needed). - Thinking is ON (default chat template). Sarvam-30B emits
<think>…</think>before answers, so set generousmax_tokensor reasoning traces truncate and answers go missing: Mathmax_tokens=8192, MCQmax_tokens=4096. - Quality: GSM8K (Math), MedMCQA (Medical), ARC-Challenge (Questions) proxies; Writing via
LLM-as-judge. Parse the final answer after the
</think>tag. - Energy: CodeCarbon (NVML), inference-only (model-load excluded). Self-measured Wh on non-Yotta hardware is not directly comparable to the organizers' single-A100 figure.
- Verify gate protection: run
verify_gate_layers.py— all 36mlp.gatetensors must carry no quantization params (genuine BF16).
Tools Used
| Tool | Version | Purpose |
|---|---|---|
| Intel AutoRound | 0.13.0 | INT4 weight quantization (gradient-based rounding) |
| vLLM | 0.19.1 | Inference engine (GPTQ / compressed-tensors) |
| Hugging Face Transformers | 4.55.4 | Model loading |
| CodeCarbon | ≥2.3.0 | Energy measurement |
References
- AutoRound — Intel, Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs — github.com/intel/auto-round, arXiv:2309.05516
- GPTQ: Accurate Post-Training Quantization — arXiv:2210.17323
- Sarvam-30B: sarvamai/sarvam-30b
Acknowledgments
- Paul Li — for recommending Intel's AutoRound quantization toolkit, which produced this final submission. His pointer to gradient-based rounding was the decisive step in reaching submission-grade 4-bit quality.
- Intel AutoRound team — github.com/intel/auto-round
- Sarvam AI — base model and mid-challenge technical guidance
- Replit agentic AI — compression and evaluation pipeline development
About the Author
Dr Simon Wang
Lecturer in English and Innovation Officer
The Language Centre, Hong Kong Baptist University
It is my great pleasure to join the Resilient AI Challenge and I learned how to compress models from scratch while partnering with Replit agentic AI. I hope my work can contribute to the collective efforts of developing greener and more accessible large language models. Feel free to reach out via email if you have questions or wish to explore collaboration.
CompressED Team — UNESCO Resilient AI Challenge 2026
- Downloads last month
- 235
Model tree for CompressEDai4good/sarvam-30b-compressed
Base model
sarvamai/sarvam-30b