You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

sarvam-30b-compressed — CompressED @ ARRC 2026

Compressed version of sarvamai/sarvam-30b submitted by team CompressED for the Resilient AI Challenge 2026 Text-to-Text category (Sarvam-30B track).

Compression Summary

Property	Value
Method	AWQ W4A16 (MoE experts) + FP8 Dynamic (attention + layer 0)
Original size	~60 GB (BF16)
Compressed size	~24 GB
Compression ratio	~2.5×
Format	compressed-tensors (mixed-precision)
Tool	llm-compressor
Speculative decoding	Eagle3 (sulabhkatiyar/eagle3-sarvam-30b) — 2.75× overall speedup, up to 3.59× on Indic languages

Method

Two-stage mixed-precision quantization applied in a single oneshot() pass:

Stage 0 — AWQ W4A16 on MoE expert layers

Targets: all Linear layers (128 MoE experts, layers 1–18)
Ignore: lm_head, layer 0 (dense), all attention layers, MoE router gates
Group size: 128, symmetric quantization
SmoothQuant activation balancing: migrates outliers from activations into weights before quantization, preserving reasoning and mathematical quality
Calibration: sarvamai/indivibe (512 samples) + cais/mmlu (256 samples)

MoE expert layers tolerate INT4 well due to the natural redundancy across 128 experts (top-6 active per token). SmoothQuant is critical for maintaining quality on logical reasoning and mathematical benchmarks.

Stage 1 — FP8 Dynamic on attention + dense layer 0

Targets: attention.query_key_value, attention.dense, layer-0 MLP projections
Scheme: FP8_DYNAMIC (per-token dynamic activation scaling, no calibration needed)
Rationale: Attention with only 4 KV heads is more sensitive to quantization; FP8 preserves quality while reducing memory bandwidth vs BF16

What stays at BF16

lm_head (output projection)
MoE router gate weights (protecting expert routing decisions)

Precision Map

Component	Precision	Why
MoE experts (layers 1–18)	INT4 AWQ	128 experts tolerate 4-bit; ~4× bandwidth gain
Attention (all layers)	FP8 Dynamic	4 KV heads are sensitive; FP8 preserves quality
Layer 0 MLP (dense)	FP8 Dynamic	Dense layer, more sensitive than MoE experts
Router gates	BF16	Expert routing — must remain at full precision
lm_head	BF16	Output layer — always full precision

Usage with vLLM

vllm serve --config vllm_config.yaml

vllm_config.yaml is included in this repository. Key settings:

model: CompressEDai4good/sarvam-30b-compressed
tensor_parallel_size: 1
max_model_len: 65536
quantization: compressed-tensors

Energy Efficiency

Three compounding optimisations drive energy reduction:

Optimisation	Mechanism	Contribution
AWQ W4A16 on MoE experts	~4× memory bandwidth reduction	~2.5× throughput gain
FP8 Dynamic on attention	~2× bandwidth reduction on attention	Additional ~10–15% gain
Eagle3 speculative decoding	7 draft tokens verified per step	2.2–2.75× additional speedup

Combined effect on CodeCarbon wall-clock energy measurement (A100 80GB, single GPU):

Model size: ~~24 GB vs ~60 GB baseline (~~60% weight reduction)
Throughput gain: ~5–7× over BF16 baseline (compression × Eagle3)
Estimated energy reduction: ~80–86% vs BF16 baseline

Eagle3 per-task speedups (measured by sulabhkatiyar, 60-prompt eval):

Task type	Speedup
Indic language generation (Hindi, Bengali, Tamil)	3.19–3.59×
English generation	2.63×
Mathematical reasoning	2.22×
Long context / Code	1.71–1.75×

Hardware Requirements

Minimum: 1× NVIDIA A100 40GB (model fits in 24 GB; KV cache uses remaining VRAM)
Recommended eval hardware: 1× NVIDIA A100 80GB (as per challenge specification)
FP8 inference: requires NVIDIA Ampere or newer (A100, A10G, H100)

Files

File	Description
`model-*.safetensors`	Compressed model weights (mixed INT4/FP8/BF16)
`config.json`	Model config with quantization_config
`vllm_config.yaml`	vLLM serving configuration for evaluation
`recipe.yaml`	Full llm-compressor recipe used for compression
`chat_template.jinja`	Sarvam-30B chat template

License

Apache License 2.0 — same as sarvamai/sarvam-30b.

Downloads last month: 56

Safetensors

Model size

10B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CompressEDai4good/sarvam-30b-compressed

Base model

sarvamai/sarvam-30b

Quantized

(20)

this model