sarvam-30b-GGUF
GGUF quantizations of sarvamai/sarvam-30b for use with llama.cpp.
Note: This model requires a custom build of llama.cpp with
sarvam_moearchitecture support. See PR #20275 or build from the add-sarvam-moe branch.
Available Quantizations
| File | Quant | Size | BPW | Description |
|---|---|---|---|---|
sarvam-30B-full-BF16.gguf |
BF16 | ~64 GB | 16.00 | Full precision, no quantization |
sarvam-30B-Q8_0.gguf |
Q8_0 | ~34 GB | 8.50 | Highest quality quantization |
sarvam-30B-Q6_K.gguf |
Q6_K | ~26 GB | 6.57 | Great quality, fits in 32GB VRAM |
sarvam-30B-Q4_K_M.gguf |
Q4_K_M | ~19 GB | 4.87 | Good balance of quality and size |
Model Details
- Architecture:
SarvamMoEForCausalLM(extension ofBailingMoeForCausalLM) - Parameters: ~30B total
- Layers: 19 (1 dense FFN + 18 MoE)
- Experts: 128 routed (top-6 routing) + 1 shared expert
- Gating: Sigmoid with zero-mean normalized expert bias,
routed_scaling_factor=2.5 - Attention: GQA with 64 heads, 4 KV heads, head_dim=64, combined QKV with QK RMSNorm
- Activation: SwiGLU
- Normalization: RMSNorm (eps=1e-6)
- Vocab size: 262,144
- Context length: 4,096 (base)
- RoPE theta: 8,000,000
Usage
# Interactive chat
llama-cli -m sarvam-30B-Q6_K.gguf -p "Hello, how are you?" -n 512 -ngl 99
# Server mode
llama-server -m sarvam-30B-Q6_K.gguf -ngl 99 -c 4096
VRAM Requirements
| Quant | Full GPU Offload | Partial Offload (24GB) |
|---|---|---|
| Q4_K_M | ~19 GB | All layers on GPU |
| Q6_K | ~26 GB | All layers on GPU (32GB cards) |
| Q8_0 | ~34 GB | ~70% layers on GPU (32GB cards) |
| BF16 | ~64 GB | ~50% layers on GPU (32GB cards) |
Tested On
- NVIDIA RTX 5090 (32GB VRAM), CUDA 13.0
- All quantizations produce coherent output
Credits
- Downloads last month
- 2,092
Hardware compatibility
Log In to add your hardware
4-bit
6-bit
8-bit
16-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support
Model tree for Sumitc13/sarvam-30b-GGUF
Base model
sarvamai/sarvam-30b