You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

sarvam-30b-compressed โ€” CompressED @ ARRC 2026

Compressed version of sarvamai/sarvam-30b submitted by team CompressED for the Resilient AI Challenge 2026 Text-to-Text category (Sarvam-30B track).

Compression Summary

Property Value
Method AWQ W4A16 (MoE experts) + FP8 Dynamic (attention + layer 0)
Original size ~60 GB (BF16)
Compressed size ~24 GB
Compression ratio ~2.5ร—
Format compressed-tensors (mixed-precision)
Tool llm-compressor
Speculative decoding Eagle3 (sulabhkatiyar/eagle3-sarvam-30b) โ€” 2.75ร— overall speedup, up to 3.59ร— on Indic languages

Method

Two-stage mixed-precision quantization applied in a single oneshot() pass:

Stage 0 โ€” AWQ W4A16 on MoE expert layers

  • Targets: all Linear layers (128 MoE experts, layers 1โ€“18)
  • Ignore: lm_head, layer 0 (dense), all attention layers, MoE router gates
  • Group size: 128, symmetric quantization
  • SmoothQuant activation balancing: migrates outliers from activations into weights before quantization, preserving reasoning and mathematical quality
  • Calibration: sarvamai/indivibe (512 samples) + cais/mmlu (256 samples)

MoE expert layers tolerate INT4 well due to the natural redundancy across 128 experts (top-6 active per token). SmoothQuant is critical for maintaining quality on logical reasoning and mathematical benchmarks.

Stage 1 โ€” FP8 Dynamic on attention + dense layer 0

  • Targets: attention.query_key_value, attention.dense, layer-0 MLP projections
  • Scheme: FP8_DYNAMIC (per-token dynamic activation scaling, no calibration needed)
  • Rationale: Attention with only 4 KV heads is more sensitive to quantization; FP8 preserves quality while reducing memory bandwidth vs BF16

What stays at BF16

  • lm_head (output projection)
  • MoE router gate weights (protecting expert routing decisions)

Precision Map

Component Precision Why
MoE experts (layers 1โ€“18) INT4 AWQ 128 experts tolerate 4-bit; ~4ร— bandwidth gain
Attention (all layers) FP8 Dynamic 4 KV heads are sensitive; FP8 preserves quality
Layer 0 MLP (dense) FP8 Dynamic Dense layer, more sensitive than MoE experts
Router gates BF16 Expert routing โ€” must remain at full precision
lm_head BF16 Output layer โ€” always full precision

Usage with vLLM

vllm serve --config vllm_config.yaml

vllm_config.yaml is included in this repository. Key settings:

model: CompressEDai4good/sarvam-30b-compressed
tensor_parallel_size: 1
max_model_len: 65536
quantization: compressed-tensors

Energy Efficiency

Three compounding optimisations drive energy reduction:

Optimisation Mechanism Contribution
AWQ W4A16 on MoE experts ~4ร— memory bandwidth reduction ~2.5ร— throughput gain
FP8 Dynamic on attention ~2ร— bandwidth reduction on attention Additional ~10โ€“15% gain
Eagle3 speculative decoding 7 draft tokens verified per step 2.2โ€“2.75ร— additional speedup

Combined effect on CodeCarbon wall-clock energy measurement (A100 80GB, single GPU):

  • Model size: 24 GB vs ~60 GB baseline (60% weight reduction)
  • Throughput gain: ~5โ€“7ร— over BF16 baseline (compression ร— Eagle3)
  • Estimated energy reduction: ~80โ€“86% vs BF16 baseline

Eagle3 per-task speedups (measured by sulabhkatiyar, 60-prompt eval):

Task type Speedup
Indic language generation (Hindi, Bengali, Tamil) 3.19โ€“3.59ร—
English generation 2.63ร—
Mathematical reasoning 2.22ร—
Long context / Code 1.71โ€“1.75ร—

Hardware Requirements

  • Minimum: 1ร— NVIDIA A100 40GB (model fits in 24 GB; KV cache uses remaining VRAM)
  • Recommended eval hardware: 1ร— NVIDIA A100 80GB (as per challenge specification)
  • FP8 inference: requires NVIDIA Ampere or newer (A100, A10G, H100)

Files

File Description
model-*.safetensors Compressed model weights (mixed INT4/FP8/BF16)
config.json Model config with quantization_config
vllm_config.yaml vLLM serving configuration for evaluation
recipe.yaml Full llm-compressor recipe used for compression
chat_template.jinja Sarvam-30B chat template

License

Apache License 2.0 โ€” same as sarvamai/sarvam-30b.

Downloads last month
56
Safetensors
Model size
10B params
Tensor type
I64
ยท
I32
ยท
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for CompressEDai4good/sarvam-30b-compressed

Quantized
(20)
this model