Instructions to use mygitphase/guhan-30b-fp8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use mygitphase/guhan-30b-fp8 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="mygitphase/guhan-30b-fp8", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("mygitphase/guhan-30b-fp8", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use mygitphase/guhan-30b-fp8 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "mygitphase/guhan-30b-fp8"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mygitphase/guhan-30b-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/mygitphase/guhan-30b-fp8

SGLang

How to use mygitphase/guhan-30b-fp8 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "mygitphase/guhan-30b-fp8" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mygitphase/guhan-30b-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "mygitphase/guhan-30b-fp8" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "mygitphase/guhan-30b-fp8",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use mygitphase/guhan-30b-fp8 with Docker Model Runner:
```
docker model run hf.co/mygitphase/guhan-30b-fp8
```

guhan-30b-fp8 / README.md

mygitphase

Duplicate from sarvamai/sarvam-30b-fp8

3f7d43c 12 days ago

preview code

raw

history blame contribute delete

4.34 kB

	---
	language:
	- en
	- hi
	- bn
	- ta
	- te
	- mr
	- gu
	- kn
	- ml
	- pa
	- or
	- as
	- ur
	- sa
	- ne
	- sd
	- kok
	- mai
	- doi
	- mni
	- sat
	- ks
	- bo
	library_name: transformers
	license: apache-2.0
	pipeline_tag: text-generation
	---

	![image](https://cdn-uploads.huggingface.co/production/uploads/60270a7c32856987162c641a/SivoCJWJqex41oprnwyuK.png)

	Want a bigger model? Download [Sarvam-105B](https://huggingface.co/sarvamai/sarvam-105b-fp8)!

	## Index

	1. [Introduction](#introduction)
	2. [Architecture](#architecture)
	3. [Inference](#inference)
	- [SGLang](https://github.com/sgl-project/sglang)
	- [vLLM](https://github.com/vllm-project/vllm)
	4. [Citation](#citation)

	## Introduction

	Sarvam-30B is an advanced Mixture-of-Experts (MoE) model with 2.4B non-embedding active parameters, designed primarily for practical deployment. It combines strong reasoning, reliable coding ability, and best-in-class conversational quality across Indian languages. Sarvam-30B is built to run reliably in resource-constrained environments and can handle multilingual voice calls while performing tool calls.

	This repository provides FP8 quantized weights for Sarvam-30B, enabling efficient deployment with reduced memory footprint and faster inference while preserving model quality. The weights are quantized using FP8 (E4M3) format via NVIDIA's ModelOpt toolkit.

	A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

	Sarvam-30B is open-sourced under the Apache License. For more details, see our [blog](https://www.sarvam.ai/blogs/sarvam-30b-105b).

	## Architecture

	The 30B MoE model is designed for throughput and memory efficiency. It uses 19 layers, a dense FFN `intermediate_size` of 8192, `moe_intermediate_size` of 1024, top-6 routing, grouped KV heads (`num_key_value_heads=4`), and an extremely high rope_theta (`8e6`) for long-context stability without RoPE scaling. It has 128 experts with a shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing. The 30B model focuses on throughput and memory efficiency through fewer layers, grouped KV attention, and smaller experts.

	## Inference

	<details>
	<summary>SGLang</summary>

	Install latest SGLang from source

	```bash
	git clone https://github.com/sgl-project/sglang.git
	cd sglang
	pip install -e "python[all]"
	```

	Launch Server

	```bash
	sglang serve --model-path sarvamai/sov_30b_fp8 \
	--port 3002 --host 0.0.0.0 \
	--mem-fraction-static 0.70 \
	--trust-remote-code \
	--tp 2 \
	--enable-dp-attention --dp 2 \
	--prefill-attention-backend fa3 \
	--decode-attention-backend fa3 \
	--ep 2 \
	--tool-call-parser glm45 \
	--reasoning-parser glm45 \
	--quantization modelopt_fp8 \
	--kv-cache-dtype fp8_e4m3
	```
	</details>

	<details>
	<summary>vLLM</summary>

	Note: currently a PR is open for native support for the Sarvam models in vLLM ([link](https://github.com/vllm-project/vllm/pull/33942)). Therefore, we have 2 options here.

	#### Option 1: install from source (hard)

	* Use the custom fork here: [link](https://github.com/rahul-sarvam/vllm)
	* Follow the instructions here to install from source: [link](https://docs.vllm.ai/en/latest/getting_started/installation/gpu/index.html#build-wheel-from-source)

	#### Option 2: hot-patch (easy)

	* Run [hotpatch_vllm.py](./hotpatch_vllm.py)
	* This will do the following:
	* install vllm=0.15.0
	* add 2 model entries to `registry.py`
	* download the model executors for `sarvam-105b` and `sarvam-30b`

	Once this is done, you can launch the vLLM server.

	> Important: You must set `VLLM_USE_FLASHINFER_MOE_FP8=0` as an environment variable, otherwise the server will get stuck during compilation and crash.

	```bash
	VLLM_USE_FLASHINFER_MOE_FP8=0 vllm serve sarvamai/sarvam-30b-fp8 \
	--trust-remote-code \
	--tensor-parallel-size 2 \
	--quantization modelopt \
	--kv-cache-dtype fp8 \
	--port 3002
	```
	</details>

	## Citation
	```
	@misc{sarvam_sovereign_models,
	title = {Introducing Sarvam's Sovereign Models},
	author = {{Sarvam Foundation Models Team}},
	year = {2026},
	howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
	note = {Accessed: 2026-03-03}
	}
	```