sarvam-105b-AWQ

Model Overview

  • Model Architecture: sarvamai/sarvam-105b
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: AWQ
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws).
  • Version: 1.0
  • Model Developers: QuantTrio

This model is quantized using llm-compressor. Calibration dataset sarvamai/indivibe

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

1: Hot-patch (easy)

Run hotpatch_vllm.py This will do the following:

  • install vllm=0.15.0
  • add 2 model entries to registry.py
  • download the model executors for sarvam-105b

2: Run vLLM

export OMP_NUM_THREADS=4

vllm serve 
    __YOUR_PATH__/QuantTrio/sarvam-105b-AWQ \
    --served-model-name MY_MODEL \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 32768  \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --enable-auto-tool-choice \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000

Model Files

File Size Last Updated
74GiB 2026-03-12

Logs

2026-03-12
1. Initial commit
Downloads last month
-
Safetensors
Model size
19B params
Tensor type
F32
I64
I32
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for QuantTrio/sarvam-105b-AWQ

Quantized
(4)
this model