The original model was converted from FP32 to mixed precision, with most weights cast to BF16 for memory and inference efficiency while keeping the MoE router (MoEGate) in FP32 to preserve routing stability and avoid precision-related issues.
Want a smaller model? Download Sarvam-30B!
Index
- Introduction
- Architecture
- Benchmarks
- Knowledge & Coding
- Reasoning & Math
- Agentic
- Inference
- Hugging Face
- vLLM
- Footnote
- Citation
Introduction
Sarvam-105B is an advanced Mixture-of-Experts (MoE) model with 10.3B active parameters, designed for superior performance across a wide range of complex tasks. It is highly optimized for complex reasoning, with particular strength in agentic tasks, mathematics, and coding.
Sarvam-105B is a top-tier performer, consistently matching or surpassing several major closed-source models and staying within a narrow margin of frontier models across diverse reasoning and agentic benchmarks. It demonstrates exceptional agentic and reasoning capabilities in real-world applications such as web search and technical troubleshooting.
A major focus during training was the Indian context and languages, resulting in state-of-the-art performance across 22 Indian languages for its model size.
Sarvam-105B is open-sourced under the Apache License. For more details, see our blog.
Architecture
The 105B model adopts an MLA-style attention stack with decoupled QK head dimensions (q_head_dim=192 split into RoPE and noPE components, v_head_dim=128) and a large head_dim of 576, enabling higher representational bandwidth per head while keeping the hidden size at 4096. This approach improves attention expressivity and long-context extrapolation (via YaRN scaling with a factor of 40 and 128K context). It has an intermediate_size (16384) and moe_intermediate_size (2048), combined with top-8 routing over 128 experts, which increases per-token active capacity while keeping activation cost manageable. The model has one shared expert, a routed scaling factor of 2.5, and auxiliary-loss-free router balancing.
Benchmarks
Knowledge & Coding
| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
|---|---|---|---|---|
| Math500 | 98.6 | 97.2 | 97.0 | 98.2 |
| Live Code Bench v6 | 71.7 | 59.5 | 72.3 | 68.7 |
| MMLU | 90.6 | 87.3 | 90.0 | 90.0 |
| MMLU Pro | 81.7 | 81.4 | 80.8 | 82.7 |
| Writing Bench | 80.5 | 83.8 | 86.5 | 84.6 |
| Arena Hard v2 | 71.0 | 68.1 | 88.5 | 68.2 |
| IF Eval | 84.8 | 83.5 | 85.4 | 88.9 |
Reasoning & Math
| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
|---|---|---|---|---|
| GPQA Diamond | 78.7 | 75.0 | 80.1 | 77.2 |
| AIME 25 (w/ Tools) | 88.3 (96.7) | 83.3 | 90.0 | 87.8 |
| Beyond AIME | 69.1 | 61.5 | 51.0 | 68.0 |
| HMMT (Feb 25) | 85.8 | 69.2 | 90.0 | 73.9 |
| HMMT (Nov 25) | 85.8 | 75.0 | 90.0 | 80.0 |
Agentic
| Benchmark | Sarvam-105B | GLM-4.5-Air | GPT-OSS-120B | Qwen3-Next-80B-A3B-Thinking |
|---|---|---|---|---|
| BrowseComp | 49.5 | 21.3 | - | 38.0 |
| SWE Bench Verified (SWE-Agent Harness) | 45.0 | 57.6 | 50.6 | 60.9 |
| τ² Bench (avg.) | 68.3 | 53.2 | 65.8 | 55.0 |
See footnote for evaluation details.
Inference
Huggingface
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
model_name = "sarvamai/sarvam-105b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, device_map="auto")
def generate_text(
prompt: str,
max_new_tokens: int = 2048,
temperature: float = 0.8,
top_p: float = 0.95,
repetition_penalty: float = 1.0,
) -> None:
inputs = tokenizer(prompt, return_tensors="pt").to("cuda:0")
generation_config = GenerationConfig(
max_new_tokens=max_new_tokens,
repetition_penalty=repetition_penalty,
temperature=temperature,
top_p=top_p,
do_sample=True,
)
with torch.no_grad():
output_ids = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
generation_config=generation_config,
)
return tokenizer.decode(output_ids[0], skip_special_tokens=True)
prompts = [
"Which country won the FIFA World Cup in 2012?",
]
for prompt in prompts:
templated_prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
output = generate_text(templated_prompt, max_new_tokens=512)
print("Prompt: ", prompt)
print("Generated text: ", output)
print("=" * 100)
SGLang
Install latest SGLang from source
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
Instantiate model and Run
import sglang as sgl
from transformers import AutoTokenizer
model_path = "abhinand/sarvam-105b-bf16"
engine = sgl.Engine(
model_path=model_path,
tp_size=4,
mem_fraction_static=0.70,
trust_remote_code=True,
dtype="bfloat16",
moe_runner_backend="flashinfer_cutedsl",
prefill_attention_backend="fa3",
decode_attention_backend="flashmla",
disable_radix_cache=False,
)
sampling_params = {
"temperature": 0.8,
"max_new_tokens": 2048,
"repetition_penalty": 1.0,
}
prompts = [
"Which band released the album Dark Side of the Moon in 1973?",
]
outputs = engine.generate([
tokenizer.apply_chat_template([
{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True)
for prompt in prompts],
sampling_params)
for p, o in zip(prompts, outputs):
print("Prompt: ", p)
print("Generated text: ", o['text'])
print("=" * 100)
vLLM
Note: currently a PR is open for native support for the Sarvam models in vLLM (link). Therefore, we have 2 options here.
Option 1: install from source (hard)
Option 2: hot-patch (easy)
- Run hotpatch_vllm.py
- This will do the following:
- install vllm=0.15.0
- add 2 model entries to
registry.py - download the model executors for
sarvam-105bandsarvam-30b
Once this is done, you can run vLLM as usual
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_path = "abhinand/sarvam-105b-bf16"
tokenizer = AutoTokenizer.from_pretrained(model_path)
llm = LLM(model=model_path,
trust_remote_code=True,
max_model_len=2048,
tensor_parallel_size=8,
max_num_seqs=16,
)
sampling_params = SamplingParams(
temperature=0.8,
max_tokens=2048,
repetition_penalty=1.0,
spaces_between_special_tokens=True
)
prompts = [
"Which artist painted The Persistence of Memory (the melting clocks)?",
]
outputs = llm.generate([
tokenizer.apply_chat_template([
{"role": "user", "content": prompt}],
tokenize=False,
add_generation_prompt=True,
enable_thinking=True)
for prompt in prompts],
sampling_params)
for p, o in zip(prompts, outputs):
print("Prompt: ", p)
print("Generated text: ", o.outputs[0].text)
print("=" * 100)
Footnote
- General settings: All benchmarks are evaluated with a maximum context length of 65,536 tokens.
- Reasoning & Math benchmarks (Math500, MMLU, MMLU Pro, GPQA Diamond, AIME 25, Beyond AIME, HMMT): Evaluated with
temperature=1.0, top_p=1.0, max_new_tokens=65536. - Coding & Knowledge benchmarks (Live Code Bench v6, Arena Hard v2, IF Eval):
Evaluated with
temperature=1.0, top_p=1.0, max_new_tokens=65536. - Writing Bench:
Responses generated using official Writing-Bench parameters:
temperature=0.7, top_p=0.8, top_k=20, max_length=16000. Scoring performed using the official Writing-Bench critic model with:temperature=1.0, top_p=0.95, max_length=2048. - Agentic benchmarks (BrowseComp, SWE Bench Verified, τ² Bench): Evaluated with
temperature=0.5, top_p=1.0, max_new_tokens=32768.
Citation
@misc{sarvam_sovereign_models,
title = {Introducing Sarvam's Sovereign Models},
author = {{Sarvam Foundation Models Team}},
year = {2026},
howpublished = {\url{https://www.sarvam.ai/blogs/sarvam-30b-105b}},
note = {Accessed: 2026-03-03}
}
- Downloads last month
- -
