Instructions to use rsparihar1508/sarvam-105b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rsparihar1508/sarvam-105b with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="rsparihar1508/sarvam-105b", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("rsparihar1508/sarvam-105b", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use rsparihar1508/sarvam-105b with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rsparihar1508/sarvam-105b"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rsparihar1508/sarvam-105b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rsparihar1508/sarvam-105b

SGLang

How to use rsparihar1508/sarvam-105b with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rsparihar1508/sarvam-105b" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rsparihar1508/sarvam-105b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rsparihar1508/sarvam-105b" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rsparihar1508/sarvam-105b",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use rsparihar1508/sarvam-105b with Docker Model Runner:
```
docker model run hf.co/rsparihar1508/sarvam-105b
```

Want a smaller model? Download Sarvam-30B!

Index

Introduction
Architecture
Benchmarks
- Knowledge & Coding
- Reasoning & Math
- Agentic
Inference
- Hugging Face
- vLLM
Footnote
Citation

Introduction

Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.

Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.

A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.

Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.

Architecture

The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.

Benchmarks

Knowledge & Coding

Benchmark	Sarvam-105B	GLM-4.5-Air	GPT-OSS-120B	Qwen3-Next-80B-A3B-Thinking
Math500	98.6	97.2	97.0	98.2
Live Code Bench v6	71.7	59.5	72.3	68.7
MMLU	90.6	87.3	90.0	90.0
MMLU Pro	81.7	81.4	80.8	82.7
Writing Bench	80.5	83.8	86.5	84.6
Arena Hard v2	71.0	68.1	88.5	68.2
IF Eval	84.8	83.5	85.4	88.9

Reasoning & Math

Benchmark	Sarvam-105B	GLM-4.5-Air	GPT-OSS-120B	Qwen3-Next-80B-A3B-Thinking
GPQA Diamond	78.7	75.0	80.1	77.2
AIME 25 (w/ Tools)	88.3 (96.7)	83.3	90.0	87.8
Beyond AIME	69.1	61.5	51.0	68.0
HMMT (Feb 25)	85.8	69.2	90.0	73.9
HMMT (Nov 25)	85.8	75.0	90.0	80.0

Agentic

Benchmark	Sarvam-105B	GLM-4.5-Air	GPT-OSS-120B	Qwen3-Next-80B-A3B-Thinking
BrowseComp	49.5	21.3	-	38.0
SWE Bench Verified (SWE-Agent Harness)	45.0	57.6	50.6	60.9
τ² Bench (avg.)	68.3	53.2	65.8	55.0

See footnote for evaluation details.

Inference

Huggingface

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig

model_name = "sarvamai/sarvam-105b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")

def generate_text(
    prompt: str,
    max_new_tokens: int = 2048,
    temperature: float = 0.8,
    top_p: float = 0.95,
    repetition_penalty: float = 1.0,
) -> None:
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")

    generation_config = GenerationConfig(
        max_new_tokens=max_new_tokens,
        repetition_penalty=repetition_penalty,
        temperature=temperature,
        top_p=top_p,
        do_sample=True,
    )

    with torch.no_grad():
        output_ids = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            generation_config=generation_config,
        )
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

prompts = [
    "Which country won the FIFA World Cup in 2012?",
]

for prompt in prompts:
    templated_prompt = tokenizer.apply_chat_template(
      [{"role": "user", "content": prompt}],
      tokenize=False,
      add_generation_prompt=True,
      enable_thinking=True
    )
    output = generate_text(templated_prompt, max_new_tokens=512)
    print("Prompt: ", prompt)
    print("Generated text: ", output)
    print("=" * 100)

SGLang

Install latest SGLang from source

git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"

Instantiate model and Run

import sglang as sgl
from transformers import AutoTokenizer

model_path = "sarvamai/sarvam-105b"
engine = sgl.Engine(
    model_path=model_path,
    tp_size=4,
    mem_fraction_static=0.70,
    trust_remote_code=True,
    dtype="bfloat16",
    moe_runner_backend="flashinfer_cutedsl",
    prefill_attention_backend="fa3",
    decode_attention_backend="flashmla",
    disable_radix_cache=False,
)

sampling_params = {
    "temperature": 0.8,
    "max_new_tokens": 2048,
    "repetition_penalty": 1.0,
}

prompts = [
    "Which band released the album Dark Side of the Moon in 1973?",
]

outputs = engine.generate([
    tokenizer.apply_chat_template([
        {"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True)
        for prompt in prompts], 
    sampling_params)
for p, o in zip(prompts, outputs):
    print("Prompt: ", p)
    print("Generated text: ", o['text'])
    print("=" * 100)

vLLM

Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.

Option 1: install from source (hard)

Use the custom fork here: link
Follow the instructions here to install from source: link

Option 2: hot-patch (easy)

Run hotpatch_vllm.py
This will do the following:
- install vllm=0.15.0
- add 2 model entries to registry.py
- download the model executors for sarvam-105b and sarvam-30b

Once this is done, you can run vLLM as usual

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_path = "sarvamai/sarvam-105b"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model=model_path, 
            trust_remote_code=True, 
            max_model_len=2048, 
            tensor_parallel_size=8, 
            max_num_seqs=16,
        )
sampling_params = SamplingParams(
                    temperature=0.8, 
                    max_tokens=2048, 
                    repetition_penalty=1.0,
                    spaces_between_special_tokens=True
                )

prompts = [
    "Which artist painted The Persistence of Memory (the melting clocks)?",
]

outputs = llm.generate([
    tokenizer.apply_chat_template([
        {"role": "user", "content": prompt}],
        tokenize=False,
        add_generation_prompt=True,
        enable_thinking=True)
        for prompt in prompts], 
    sampling_params)
for p, o in zip(prompts, outputs):
    print("Prompt: ", p)
    print("Generated text: ", o.outputs[0].text)
    print("=" * 100)

Footnote

General settings: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
Reasoning & Math benchmarks (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
Coding & Knowledge benchmarks (Live Code Bench v6, Arena Hard v2, IF Eval): Evaluated with temperature=1.0, top_p=1.0, max_new_tokens=65536.
Writing Bench: Responses generated using official Writing-Bench parameters: temperature=0.7, top_p=0.8, top_k=20, max_length=16000. Scoring performed using the official Writing-Bench critic model with: temperature=1.0, top_p=0.95, max_length=2048.
Agentic benchmarks (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with temperature=0.5, top_p=1.0, max_new_tokens=32768.

Citation

@misc{sarvam_sovereign_models,
  title        = {Introducing Sarvam's Sovereign Models},
  author       = {{Sarvam Foundation Models Team}},
  year         = {2026},
  howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
  note         = {Accessed: 2026-03-03}
}

Downloads last month: 7

Safetensors

Model size

106B params

Tensor type

F32