---
language:
- en
- hi
- bn
- ta
- te
- mr
- gu
- kn
- ml
- pa
- or
- as
- ur
- sa
- ne
- sd
- kok
- mai
- doi
- mni
- sat
- ks
- bo
library_name: transformers
license: apache-2.0
pipeline_tag: text-generation
---

Want a bigger model? Download [Sarvam-105B](https://huggingface.co/sarvamai/sarvam-105b-fp8)!
## Index
1. [Introduction](#introduction)
2. [Architecture](#architecture)
3. [Inference](#inference)
- [SGLang](https://github.com/sgl-project/sglang)
- [vLLM](https://github.com/vllm-project/vllm)
4. [Citation](#citation)
## Introduction
**Sarvam-30B** is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.
This repository provides **FP8 quantized weights** for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.
A major focus during training was the Indian context and languages, resulting in **state-of-the-art performance across 22 Indian languages** for its model size.
Sarvam-30B is open-sourced under the **Apache License**. For more details, see our [blog](https://www.sarvam.ai/blogs/sarvam-30b-105b).
## Architecture
The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN `intermediate_size` of 8192, `moe_intermediate_size` of 1024, top-6 routing, grouped KV heads (`num_key_value_heads=4`), and an extremely high rope_theta (`8e6`) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.
## Inference
SGLang
**Install latest SGLang from source**
```bash
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
```
**Launch Server**
```bash
sglang serve --model-path sarvamai/sov_30b_fp8 \
--port 3002 --host 0.0.0.0 \
--mem-fraction-static 0.70 \
--trust-remote-code \
--tp 2 \
--enable-dp-attention --dp 2 \
--prefill-attention-backend fa3 \
--decode-attention-backend fa3 \
--ep 2 \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--quantization modelopt_fp8 \
--kv-cache-dtype fp8_e4m3
```
vLLM
Note: currently a PR is open for native support for the Sarvam models in vLLM ([link](https://github.com/vllm-project/vllm/pull/33942)). Therefore, we have 2 options here.
#### Option 1: install from source (hard)
* Use the custom fork here: [link](https://github.com/rahul-sarvam/vllm)
* Follow the instructions here to install from source: [link](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source)
#### Option 2: hot-patch (easy)
* Run [hotpatch_vllm.py](./hotpatch_vllm.py)
* This will do the following:
* install vllm=0.15.0
* add 2 model entries to `registry.py`
* download the model executors for `sarvam-105b` and `sarvam-30b`
Once this is done, you can launch the vLLM server.
> **Important**: You must set `VLLM_USE_FLASHINFER_MOE_FP8=0` as an environment variable, otherwise the server will get stuck during compilation and crash.
```bash
VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \
--trust-remote-code \
--tensor-parallel-size 2 \
--quantization modelopt \
--kv-cache-dtype fp8 \
--port 3002
```
## Citation
```
@misc{sarvam_sovereign_models,
title = {Introducing Sarvam's Sovereign Models},
author = {{Sarvam Foundation Models Team}},
year = {2026},
howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
note = {Accessed: 2026-03-03}
}
```