Instructions to use mygitphase/guhan-30b-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use mygitphase/guhan-30b-fp8 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="mygitphase/guhan-30b-fp8", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("mygitphase/guhan-30b-fp8", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use mygitphase/guhan-30b-fp8 with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "mygitphase/guhan-30b-fp8" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mygitphase/guhan-30b-fp8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/mygitphase/guhan-30b-fp8
- SGLang
How to use mygitphase/guhan-30b-fp8 with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "mygitphase/guhan-30b-fp8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mygitphase/guhan-30b-fp8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "mygitphase/guhan-30b-fp8" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "mygitphase/guhan-30b-fp8", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use mygitphase/guhan-30b-fp8 with Docker Model Runner:
docker model run hf.co/mygitphase/guhan-30b-fp8
| language: | |
| - en | |
| - hi | |
| - bn | |
| - ta | |
| - te | |
| - mr | |
| - gu | |
| - kn | |
| - ml | |
| - pa | |
| - or | |
| - as | |
| - ur | |
| - sa | |
| - ne | |
| - sd | |
| - kok | |
| - mai | |
| - doi | |
| - mni | |
| - sat | |
| - ks | |
| - bo | |
| library_name: transformers | |
| license: apache-2.0 | |
| pipeline_tag: text-generation | |
|  | |
| Want a bigger model? Download [Sarvam-105B](https://huggingface.co/sarvamai/sarvam-105b-fp8)! | |
| ## Index | |
| 1. [Introduction](#introduction) | |
| 2. [Architecture](#architecture) | |
| 3. [Inference](#inference) | |
| - [SGLang](https://github.com/sgl-project/sglang) | |
| - [vLLM](https://github.com/vllm-project/vllm) | |
| 4. [Citation](#citation) | |
| ## Introduction | |
| **Sarvam-30B** is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls. | |
| This repository provides **FP8 quantized weights** for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit. | |
| A major focus during training was the Indian context and languages, resulting in **state-of-the-art performance across 22 Indian languages** for its model size. | |
| Sarvam-30B is open-sourced under the **Apache License**. For more details, see our [blog](https://www.sarvam.ai/blogs/sarvam-30b-105b). | |
| ## Architecture | |
| The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN `intermediate_size` of 8192, `moe_intermediate_size` of 1024, top-6 routing, grouped KV heads (`num_key_value_heads=4`), and an extremely high rope_theta (`8e6`) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts. | |
| ## Inference | |
| <details> | |
| <summary>SGLang</summary> | |
| **Install latest SGLang from source** | |
| ```bash | |
| git clone https://github.com/sgl-project/sglang.git | |
| cd sglang | |
| pip install -e "python[all]" | |
| ``` | |
| **Launch Server** | |
| ```bash | |
| sglang serve --model-path sarvamai/sov_30b_fp8 \ | |
| --port 3002 --host 0.0.0.0 \ | |
| --mem-fraction-static 0.70 \ | |
| --trust-remote-code \ | |
| --tp 2 \ | |
| --enable-dp-attention --dp 2 \ | |
| --prefill-attention-backend fa3 \ | |
| --decode-attention-backend fa3 \ | |
| --ep 2 \ | |
| --tool-call-parser glm45 \ | |
| --reasoning-parser glm45 \ | |
| --quantization modelopt_fp8 \ | |
| --kv-cache-dtype fp8_e4m3 | |
| ``` | |
| </details> | |
| <details> | |
| <summary>vLLM</summary> | |
| Note: currently a PR is open for native support for the Sarvam models in vLLM ([link](https://github.com/vllm-project/vllm/pull/33942)). Therefore, we have 2 options here. | |
| #### Option 1: install from source (hard) | |
| * Use the custom fork here: [link](https://github.com/rahul-sarvam/vllm) | |
| * Follow the instructions here to install from source: [link](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source) | |
| #### Option 2: hot-patch (easy) | |
| * Run [hotpatch_vllm.py](./hotpatch_vllm.py) | |
| * This will do the following: | |
| * install vllm=0.15.0 | |
| * add 2 model entries to `registry.py` | |
| * download the model executors for `sarvam-105b` and `sarvam-30b` | |
| Once this is done, you can launch the vLLM server. | |
| > **Important**: You must set `VLLM_USE_FLASHINFER_MOE_FP8=0` as an environment variable, otherwise the server will get stuck during compilation and crash. | |
| ```bash | |
| VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \ | |
| --trust-remote-code \ | |
| --tensor-parallel-size 2 \ | |
| --quantization modelopt \ | |
| --kv-cache-dtype fp8 \ | |
| --port 3002 | |
| ``` | |
| </details> | |
| ## Citation | |
| ``` | |
| @misc{sarvam_sovereign_models, | |
| title = {Introducing Sarvam's Sovereign Models}, | |
| author = {{Sarvam Foundation Models Team}}, | |
| year = {2026}, | |
| howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}}, | |
| note = {Accessed: 2026-03-03} | |
| } | |
| ``` |