---
language:
  - en
  - hi
  - bn
  - ta
  - te
  - mr
  - gu
  - kn
  - ml
  - pa
  - or
  - as
  - ur
  - sa
  - ne
  - sd
  - kok
  - mai
  - doi
  - mni
  - sat
  - ks
  - bo
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
---

![image](https://cdn-uploads.huggingface.co/production/uploads/60270a7c32856987162c641a/SivoCJWJqex41oprnwyuK.png)

Want a bigger model? Download [Sarvam-105B](https://huggingface.co/sarvamai/sarvam-105b-fp8)!

## Index

1. [Introduction](#introduction)  
2. [Architecture](#architecture)  
3. [Inference](#inference)  
   - [SGLang](https://github.com/sgl-project/sglang)
   - [vLLM](https://github.com/vllm-project/vllm)
4. [Citation](#citation)  

## Introduction

**Sarvam-30B** is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.

This repository provides **FP8 quantized weights** for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.

A major focus during training was the Indian context and languages, resulting in **state-of-the-art performance across 22 Indian languages** for its model size.

Sarvam-30B is open-sourced under the **Apache License**. For more details, see our [blog](https://www.sarvam.ai/blogs/sarvam-30b-105b).

## Architecture

The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN `intermediate_size` of 8192, `moe_intermediate_size` of 1024, top-6 routing, grouped KV heads (`num_key_value_heads=4`), and an extremely high rope_theta (`8e6`) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.

## Inference

<details>
  <summary>SGLang</summary>

**Install latest SGLang from source**

```bash
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
```

**Launch Server**

```bash
sglang serve --model-path sarvamai/sov_30b_fp8 \
  --port 3002 --host 0.0.0.0 \
  --mem-fraction-static 0.70 \
  --trust-remote-code \
  --tp 2 \
  --enable-dp-attention --dp 2 \
  --prefill-attention-backend fa3 \
  --decode-attention-backend fa3 \
  --ep 2 \
  --tool-call-parser glm45 \
  --reasoning-parser glm45 \
  --quantization modelopt_fp8 \
  --kv-cache-dtype fp8_e4m3
```
</details>

<details>
  <summary>vLLM</summary>

Note: currently a PR is open for native support for the Sarvam models in vLLM ([link](https://github.com/vllm-project/vllm/pull/33942)). Therefore, we have 2 options here.

#### Option 1: install from source (hard)

* Use the custom fork here: [link](https://github.com/rahul-sarvam/vllm)
* Follow the instructions here to install from source: [link](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source)

#### Option 2: hot-patch (easy)

* Run [hotpatch_vllm.py](./hotpatch_vllm.py)
* This will do the following:
  * install vllm=0.15.0
  * add 2 model entries to `registry.py`
  * download the model executors for `sarvam-105b` and `sarvam-30b`

Once this is done, you can launch the vLLM server.

> **Important**: You must set `VLLM_USE_FLASHINFER_MOE_FP8=0` as an environment variable, otherwise the server will get stuck during compilation and crash.

```bash
VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \
  --trust-remote-code \
  --tensor-parallel-size 2 \
  --quantization modelopt \
  --kv-cache-dtype fp8 \
  --port 3002
```
</details>

## Citation
```
@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}
```