sarvam-30b-GGUF

GGUF quantizations of sarvamai/sarvam-30b for use with llama.cpp.

Note: This model requires a custom build of llama.cpp with sarvam_moe architecture support. See PR #20275 or build from the add-sarvam-moe branch.

Available Quantizations

File Quant Size BPW Description
sarvam-30B-full-BF16.gguf BF16 ~64 GB 16.00 Full precision, no quantization
sarvam-30B-Q8_0.gguf Q8_0 ~34 GB 8.50 Highest quality quantization
sarvam-30B-Q6_K.gguf Q6_K ~26 GB 6.57 Great quality, fits in 32GB VRAM
sarvam-30B-Q4_K_M.gguf Q4_K_M ~19 GB 4.87 Good balance of quality and size

Model Details

  • Architecture: SarvamMoEForCausalLM (extension of BailingMoeForCausalLM)
  • Parameters: ~30B total
  • Layers: 19 (1 dense FFN + 18 MoE)
  • Experts: 128 routed (top-6 routing) + 1 shared expert
  • Gating: Sigmoid with zero-mean normalized expert bias, routed_scaling_factor=2.5
  • Attention: GQA with 64 heads, 4 KV heads, head_dim=64, combined QKV with QK RMSNorm
  • Activation: SwiGLU
  • Normalization: RMSNorm (eps=1e-6)
  • Vocab size: 262,144
  • Context length: 4,096 (base)
  • RoPE theta: 8,000,000

Usage

# Interactive chat
llama-cli -m sarvam-30B-Q6_K.gguf -p "Hello, how are you?" -n 512 -ngl 99

# Server mode
llama-server -m sarvam-30B-Q6_K.gguf -ngl 99 -c 4096

VRAM Requirements

Quant Full GPU Offload Partial Offload (24GB)
Q4_K_M ~19 GB All layers on GPU
Q6_K ~26 GB All layers on GPU (32GB cards)
Q8_0 ~34 GB ~70% layers on GPU (32GB cards)
BF16 ~64 GB ~50% layers on GPU (32GB cards)

Tested On

  • NVIDIA RTX 5090 (32GB VRAM), CUDA 13.0
  • All quantizations produce coherent output

Credits

  • Original model by Sarvam AI
  • Quantized by Sumitc13
  • llama.cpp architecture support based on BailingMoe implementation
Downloads last month
2,092
GGUF
Model size
32B params
Architecture
sarvam_moe
Hardware compatibility
Log In to add your hardware

4-bit

6-bit

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Sumitc13/sarvam-30b-GGUF

Quantized
(8)
this model