--- language: - en - hi - bn - ta - te - mr - gu - kn - ml - pa - or - as - ur - sa - ne - sd - kok - mai - doi - mni - sat - ks - bo library_name: transformers license: apache-2.0 pipeline_tag: text-generation --- ![image](https://cdn-uploads.huggingface.co/production/uploads/60270a7c32856987162c641a/SivoCJWJqex41oprnwyuK.png) Want a bigger model? Download [Sarvam-105B](https://huggingface.co/sarvamai/sarvam-105b-fp8)! ## Index 1. [Introduction](#introduction) 2. [Architecture](#architecture) 3. [Inference](#inference) - [SGLang](https://github.com/sgl-project/sglang) - [vLLM](https://github.com/vllm-project/vllm) 4. [Citation](#citation) ## Introduction **Sarvam-30B** is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls. This repository provides **FP8 quantized weights** for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit. A major focus during training was the Indian context and languages, resulting in **state-of-the-art performance across 22 Indian languages** for its model size. Sarvam-30B is open-sourced under the **Apache License**. For more details, see our [blog](https://www.sarvam.ai/blogs/sarvam-30b-105b). ## Architecture The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN `intermediate_size` of 8192, `moe_intermediate_size` of 1024, top-6 routing, grouped KV heads (`num_key_value_heads=4`), and an extremely high rope_theta (`8e6`) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts. ## Inference
SGLang **Install latest SGLang from source** ```bash git clone https://github.com/sgl-project/sglang.git cd sglang pip install -e "python[all]" ``` **Launch Server** ```bash sglang serve --model-path sarvamai/sov_30b_fp8 \ --port 3002 --host 0.0.0.0 \ --mem-fraction-static 0.70 \ --trust-remote-code \ --tp 2 \ --enable-dp-attention --dp 2 \ --prefill-attention-backend fa3 \ --decode-attention-backend fa3 \ --ep 2 \ --tool-call-parser glm45 \ --reasoning-parser glm45 \ --quantization modelopt_fp8 \ --kv-cache-dtype fp8_e4m3 ```
vLLM Note: currently a PR is open for native support for the Sarvam models in vLLM ([link](https://github.com/vllm-project/vllm/pull/33942)). Therefore, we have 2 options here. #### Option 1: install from source (hard) * Use the custom fork here: [link](https://github.com/rahul-sarvam/vllm) * Follow the instructions here to install from source: [link](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source) #### Option 2: hot-patch (easy) * Run [hotpatch_vllm.py](./hotpatch_vllm.py) * This will do the following: * install vllm=0.15.0 * add 2 model entries to `registry.py` * download the model executors for `sarvam-105b` and `sarvam-30b` Once this is done, you can launch the vLLM server. > **Important**: You must set `VLLM_USE_FLASHINFER_MOE_FP8=0` as an environment variable, otherwise the server will get stuck during compilation and crash. ```bash VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \ --trust-remote-code \ --tensor-parallel-size 2 \ --quantization modelopt \ --kv-cache-dtype fp8 \ --port 3002 ```
## Citation ``` @misc{sarvam_sovereign_models, title = {Introducing Sarvam's Sovereign Models}, author = {{Sarvam Foundation Models Team}}, year = {2026}, howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}}, note = {Accessed: 2026-03-03} } ```