Instructions to use evil-dreams/sarvam-runtime-optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use evil-dreams/sarvam-runtime-optimized with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="evil-dreams/sarvam-runtime-optimized")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("evil-dreams/sarvam-runtime-optimized", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use evil-dreams/sarvam-runtime-optimized with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "evil-dreams/sarvam-runtime-optimized"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evil-dreams/sarvam-runtime-optimized",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/evil-dreams/sarvam-runtime-optimized

SGLang

How to use evil-dreams/sarvam-runtime-optimized with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "evil-dreams/sarvam-runtime-optimized" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evil-dreams/sarvam-runtime-optimized",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "evil-dreams/sarvam-runtime-optimized" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evil-dreams/sarvam-runtime-optimized",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use evil-dreams/sarvam-runtime-optimized with Docker Model Runner:
```
docker model run hf.co/evil-dreams/sarvam-runtime-optimized
```

sarvam-runtime-optimized / README.md

evil-dreams

updated model card

5a97586 verified 19 days ago

preview code

raw

history blame contribute delete

5.86 kB

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: text-generation
	library_name: transformers
	base_model: sarvamai/sarvam-30b
	model_type: causal-lm
	tags:
	- llm
	- mixture-of-experts
	- vllm
	- inference-optimization
	- runtime-optimization
	- efficient-ai
	- production-ai
	---

	# Sarvam-30B Runtime-Optimized Inference System

	## 1. Overview

	This project presents a runtime-optimized deployment system for Sarvam-30B, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.

	The objective is to improve inference efficiency, stability, and output quality without modifying the model weights, making it suitable for real-world deployment scenarios.

	This work focuses on system-level optimization rather than model-level compression, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.

	---

	## 2. Base Model

	- Model: `sarvamai/sarvam-30b`
	- Architecture: Mixture-of-Experts (MoE)
	- Task: Text Generation
	- Inference Engine: vLLM
	- Hardware: Multi-GPU (Tensor Parallelism)

	---

	## 3. Problem Statement

	During experimentation, two critical challenges were identified:

	### 3.1 Reasoning Leakage

	The model generates internal reasoning traces such as `<think>` tokens, which:
	- Reduce readability
	- Break structured output requirements
	- Affect downstream usability

	---

	### 3.2 High Resource Consumption

	Due to the MoE architecture:
	- High GPU memory utilization (~45GB per GPU baseline)
	- Large KV-cache growth with sequence length
	- Reduced inference efficiency under default settings

	---

	## 4. Approach

	### 4.1 Inference-Time Optimization (Core Contribution)

	Instead of modifying weights (quantization/pruning), this system applies runtime-level optimization:

	- `gpu-memory-utilization = 0.85`
	- `max-model-len = 1024`
	- `max-num-seqs = 4`
	- `tensor-parallel-size = 4`

	### Impact:
	- Reduced KV-cache pressure
	- Improved GPU memory utilization
	- Stable multi-GPU execution
	- Consistent latency performance

	---

	### 4.2 Output Governance Pipeline

	A deterministic postprocessing layer (`postprocess.py`) is introduced to control model outputs.

	This module:
	- Removes internal reasoning traces (`<think>...</think>`)
	- Extracts final answer segments
	- Reformats output into structured bullet points

	### Impact:
	- Clean, production-ready responses
	- Improved readability
	- Deterministic output format

	---

	## 5. Compression Strategies Evaluated

	The following approaches were tested and rejected:

	### Quantization (AWQ / GPTQ)
	- Compatibility issues with MoE architecture
	- Output instability and degradation

	### Pruning
	- Severe degradation in generation quality
	- Early stopping and incomplete outputs

	### Distillation
	- Not feasible due to dataset and compute constraints

	---

	### Final Decision

	Runtime optimization was selected because:

	- Preserves original model accuracy
	- Avoids architectural incompatibility
	- Provides stable and reproducible results

	---

	## 6. System Architecture

	User Input
	→ vLLM Inference Engine
	→ Raw Model Output
	→ Postprocessing Layer
	→ Clean Structured Output

	This forms an Inference Optimization + Output Governance Pipeline.

	---

	## 7. Performance Results

	\| Metric \| Observation \|
	\|------\|------------\|
	\| Latency \| ~0.4s – 1.5s \|
	\| GPU Memory \| ~8% reduction \|
	\| Stability \| Consistent across runs \|
	\| Output Quality \| Clean and structured after postprocessing \|

	---

	## 8. How to Run

	```bash
	bash run.sh
	```

	---

	## 9. API Example

	```bash
	curl -s http://127.0.0.1:8000/v1/chat/completions \
	-H "Content-Type: application/json" \
	-d '{
	"model": "sarvam-30b",
	"messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
	"max_tokens": 200,
	"temperature": 0.2
	}'
	```

	---

	## 10. Files Included

	* `run.sh` → server startup script
	* `vllm_config.yaml` → optimized configuration
	* `postprocess.py` → output cleaning pipeline
	* `examples/` → raw vs cleaned outputs
	* `models/` → Sarvam-30B weights

	---

	## 11. Practical Impact

	This system is designed for real-world AI deployments where:

	* Large models must operate under GPU constraints
	* Outputs must be clean and user-facing
	* Internal reasoning traces are not acceptable

	The approach demonstrates:

	* Runtime optimization instead of weight modification
	* Output governance instead of prompt engineering
	* System-level control instead of model-level changes

	---

	## 12. Key Insight

	System-level optimization can outperform traditional compression techniques by:
	* This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
	* Preserving model accuracy
	* Improving inference efficiency
	* Ensuring stable deployment

	---

	## 13. Conclusion

	This work delivers a deployment-ready, reproducible, and efficient inference system for large-scale MoE models.

	It demonstrates that combining runtime optimization with output control provides a practical and scalable alternative to conventional model compression approaches.

	---

	## 14. Limitations

	* Does not reduce model size (weights remain unchanged)
	* Requires multi-GPU setup
	* Postprocessing is rule-based (not learned)

	---

	## 15. Future Work

	* MoE-aware quantization techniques
	* KV-cache compression methods
	* Adaptive decoding strategies
	* Edge-device compatible distillation

	---

	## 16. Real-World Relevance

	This system is designed for deployment scenarios where:

	- Large language models must operate under strict GPU constraints
	- Outputs must be clean and user-facing
	- Internal reasoning traces are not acceptable in production systems

	The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.

	---