Instructions to use evil-dreams/sarvam-runtime-optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use evil-dreams/sarvam-runtime-optimized with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="evil-dreams/sarvam-runtime-optimized")

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("evil-dreams/sarvam-runtime-optimized", dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use evil-dreams/sarvam-runtime-optimized with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "evil-dreams/sarvam-runtime-optimized"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evil-dreams/sarvam-runtime-optimized",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/evil-dreams/sarvam-runtime-optimized

SGLang

How to use evil-dreams/sarvam-runtime-optimized with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "evil-dreams/sarvam-runtime-optimized" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evil-dreams/sarvam-runtime-optimized",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "evil-dreams/sarvam-runtime-optimized" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "evil-dreams/sarvam-runtime-optimized",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use evil-dreams/sarvam-runtime-optimized with Docker Model Runner:
```
docker model run hf.co/evil-dreams/sarvam-runtime-optimized
```

sarvam-runtime-optimized / README.md

evil-dreams

updated model card

5a97586 verified 18 days ago

preview code

raw

history blame contribute delete

5.86 kB

metadata

license: apache-2.0
language:
  - en
pipeline_tag: text-generation
library_name: transformers
base_model: sarvamai/sarvam-30b
model_type: causal-lm
tags:
  - llm
  - mixture-of-experts
  - vllm
  - inference-optimization
  - runtime-optimization
  - efficient-ai
  - production-ai

Sarvam-30B Runtime-Optimized Inference System

1. Overview

This project presents a runtime-optimized deployment system for Sarvam-30B, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.

The objective is to improve inference efficiency, stability, and output quality without modifying the model weights, making it suitable for real-world deployment scenarios.

This work focuses on system-level optimization rather than model-level compression, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.

2. Base Model

Model: sarvamai/sarvam-30b
Architecture: Mixture-of-Experts (MoE)
Task: Text Generation
Inference Engine: vLLM
Hardware: Multi-GPU (Tensor Parallelism)

3. Problem Statement

During experimentation, two critical challenges were identified:

3.1 Reasoning Leakage

The model generates internal reasoning traces such as <think> tokens, which:

Reduce readability
Break structured output requirements
Affect downstream usability

3.2 High Resource Consumption

Due to the MoE architecture:

High GPU memory utilization (~45GB per GPU baseline)
Large KV-cache growth with sequence length
Reduced inference efficiency under default settings

4. Approach

4.1 Inference-Time Optimization (Core Contribution)

Instead of modifying weights (quantization/pruning), this system applies runtime-level optimization:

gpu-memory-utilization = 0.85
max-model-len = 1024
max-num-seqs = 4
tensor-parallel-size = 4

Impact:

Reduced KV-cache pressure
Improved GPU memory utilization
Stable multi-GPU execution
Consistent latency performance

4.2 Output Governance Pipeline

A deterministic postprocessing layer (postprocess.py) is introduced to control model outputs.

This module:

Removes internal reasoning traces (<think>...</think>)
Extracts final answer segments
Reformats output into structured bullet points

Impact:

Clean, production-ready responses
Improved readability
Deterministic output format

5. Compression Strategies Evaluated

The following approaches were tested and rejected:

Quantization (AWQ / GPTQ)

Compatibility issues with MoE architecture
Output instability and degradation

Pruning

Severe degradation in generation quality
Early stopping and incomplete outputs

Distillation

Not feasible due to dataset and compute constraints

Final Decision

Runtime optimization was selected because:

Preserves original model accuracy
Avoids architectural incompatibility
Provides stable and reproducible results

6. System Architecture

User Input
→ vLLM Inference Engine
→ Raw Model Output
→ Postprocessing Layer
→ Clean Structured Output

This forms an Inference Optimization + Output Governance Pipeline.

7. Performance Results

Metric	Observation
Latency	~0.4s – 1.5s
GPU Memory	~8% reduction
Stability	Consistent across runs
Output Quality	Clean and structured after postprocessing

8. How to Run

bash run.sh

9. API Example

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b",
    "messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
    "max_tokens": 200,
    "temperature": 0.2
  }'

10. Files Included

run.sh → server startup script
vllm_config.yaml → optimized configuration
postprocess.py → output cleaning pipeline
examples/ → raw vs cleaned outputs
models/ → Sarvam-30B weights

11. Practical Impact

This system is designed for real-world AI deployments where:

Large models must operate under GPU constraints
Outputs must be clean and user-facing
Internal reasoning traces are not acceptable

The approach demonstrates:

Runtime optimization instead of weight modification
Output governance instead of prompt engineering
System-level control instead of model-level changes

12. Key Insight

System-level optimization can outperform traditional compression techniques by:

This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
Preserving model accuracy
Improving inference efficiency
Ensuring stable deployment

13. Conclusion

This work delivers a deployment-ready, reproducible, and efficient inference system for large-scale MoE models.

It demonstrates that combining runtime optimization with output control provides a practical and scalable alternative to conventional model compression approaches.

14. Limitations

Does not reduce model size (weights remain unchanged)
Requires multi-GPU setup
Postprocessing is rule-based (not learned)

15. Future Work

MoE-aware quantization techniques
KV-cache compression methods
Adaptive decoding strategies
Edge-device compatible distillation

16. Real-World Relevance

This system is designed for deployment scenarios where:

Large language models must operate under strict GPU constraints
Outputs must be clean and user-facing
Internal reasoning traces are not acceptable in production systems

The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.