How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Sumitc13/sarvam-30b-GGUF",
	filename="",
)
llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

sarvam-30b-GGUF

GGUF quantizations of sarvamai/sarvam-30b for use with llama.cpp.

Compatibility: These GGUFs use the bailingmoe2 architecture. Make sure you're on a recent version of llama.cpp (or any tool that bundles llama.cpp like LM Studio / Ollama / koboldcpp) for full tokenizer support across English + 22 Indian languages.

Available Quantizations

File Quant Size BPW Description
sarvam-30B-full-BF16.gguf BF16 ~64 GB 16.00 Full precision, no quantization
sarvam-30B-Q8_0.gguf Q8_0 ~34 GB 8.50 Highest quality quantization
sarvam-30B-Q6_K.gguf Q6_K ~26 GB 6.57 Great quality, fits in 32GB VRAM
sarvam-30B-Q4_K_M.gguf Q4_K_M ~19 GB 4.87 Good balance of quality and size

Model Details

  • Architecture: SarvamMoEForCausalLM โ€” converted as bailingmoe2 (the architectures are equivalent; Sarvam uses full rotary and zero-mean normalized expert bias, handled at conversion time)
  • Parameters: ~30B total
  • Layers: 19 (1 dense FFN + 18 MoE)
  • Experts: 128 routed (top-6 routing) + 1 shared expert
  • Gating: Sigmoid with zero-mean normalized expert bias, routed_scaling_factor=2.5
  • Attention: GQA with 64 heads, 4 KV heads, head_dim=64, combined QKV with QK RMSNorm
  • Activation: SwiGLU
  • Normalization: RMSNorm (eps=1e-6)
  • Vocab size: 262,144
  • Context length: up to 131,072 (model trained context); chat templates use 16,384 by default
  • RoPE theta: 8,000,000
  • Tokenizer: SentencePiece-style BPE, supports English + 22 Indian languages with byte fallback

Tokenizer parity

Verified 50/50 probes match Hugging Face reference tokenizer exactly across: English, Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, Sindhi, Nepali, Sanskrit, Maithili, Konkani, Manipuri, Bodo, Santali, Kashmiri, Dogri, plus mixed-script and edge-case probes.

Usage

# Interactive chat
llama-cli -m sarvam-30B-Q6_K.gguf -p "Hello, how are you?" -n 512 -ngl 99

# Server mode (use --jinja for the embedded chat template)
llama-server -m sarvam-30B-Q6_K.gguf -ngl 99 -c 16384 --jinja

Thinking mode

The model supports enable_thinking via the chat template. When using llama-server --jinja:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
  "messages": [{"role": "user", "content": "What is 25 * 37?"}],
  "max_tokens": 1024,
  "chat_template_kwargs": {"enable_thinking": true}
}'

Note: enable_thinking=false adds the <|nothink|> token but the base model may still emit <think>...</think> (upstream model issue, see sarvamai/sarvam-30b#11). reasoning_effort is not in the public chat template and is silently ignored.

VRAM Requirements

Quant Full GPU Offload Partial Offload (24GB)
Q4_K_M ~19 GB All layers on GPU
Q6_K ~26 GB All layers on GPU (32GB cards)
Q8_0 ~34 GB ~70% layers on GPU (32GB cards)
BF16 ~64 GB ~50% layers on GPU (32GB cards)

Tested On

  • NVIDIA RTX 5090 (32GB VRAM), CUDA 13.0
  • All quantizations produce coherent output across English and Indian languages
  • Q6_K runs at ~305 tokens/sec generation speed

Credits

  • Original model by Sarvam AI
  • Quantized by Sumitc13
  • Architecture reuses bailingmoe2 already in llama.cpp (PR #20275 adds the Sarvam-specific tokenizer)
Downloads last month
1,561
GGUF
Model size
32B params
Architecture
bailingmoe2
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Sumitc13/sarvam-30b-GGUF

Quantized
(20)
this model