Text Generation
Transformers
Safetensors
English
llm
mixture-of-experts
vllm
inference-optimization
runtime-optimization
efficient-ai
production-ai
Instructions to use evil-dreams/sarvam-runtime-optimized with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use evil-dreams/sarvam-runtime-optimized with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="evil-dreams/sarvam-runtime-optimized")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("evil-dreams/sarvam-runtime-optimized", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use evil-dreams/sarvam-runtime-optimized with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "evil-dreams/sarvam-runtime-optimized" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "evil-dreams/sarvam-runtime-optimized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/evil-dreams/sarvam-runtime-optimized
- SGLang
How to use evil-dreams/sarvam-runtime-optimized with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "evil-dreams/sarvam-runtime-optimized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "evil-dreams/sarvam-runtime-optimized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "evil-dreams/sarvam-runtime-optimized" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "evil-dreams/sarvam-runtime-optimized", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use evil-dreams/sarvam-runtime-optimized with Docker Model Runner:
docker model run hf.co/evil-dreams/sarvam-runtime-optimized
| license: apache-2.0 | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| base_model: sarvamai/sarvam-30b | |
| model_type: causal-lm | |
| tags: | |
| - llm | |
| - mixture-of-experts | |
| - vllm | |
| - inference-optimization | |
| - runtime-optimization | |
| - efficient-ai | |
| - production-ai | |
| # Sarvam-30B Runtime-Optimized Inference System | |
| ## 1. Overview | |
| This project presents a **runtime-optimized deployment system for Sarvam-30B**, a large-scale Mixture-of-Experts (MoE) language model, using vLLM. | |
| The objective is to **improve inference efficiency, stability, and output quality** without modifying the model weights, making it suitable for real-world deployment scenarios. | |
| This work focuses on **system-level optimization rather than model-level compression**, demonstrating a practical and reliable approach to handling large LLMs under constrained environments. | |
| --- | |
| ## 2. Base Model | |
| - Model: `sarvamai/sarvam-30b` | |
| - Architecture: Mixture-of-Experts (MoE) | |
| - Task: Text Generation | |
| - Inference Engine: vLLM | |
| - Hardware: Multi-GPU (Tensor Parallelism) | |
| --- | |
| ## 3. Problem Statement | |
| During experimentation, two critical challenges were identified: | |
| ### 3.1 Reasoning Leakage | |
| The model generates internal reasoning traces such as `<think>` tokens, which: | |
| - Reduce readability | |
| - Break structured output requirements | |
| - Affect downstream usability | |
| --- | |
| ### 3.2 High Resource Consumption | |
| Due to the MoE architecture: | |
| - High GPU memory utilization (~45GB per GPU baseline) | |
| - Large KV-cache growth with sequence length | |
| - Reduced inference efficiency under default settings | |
| --- | |
| ## 4. Approach | |
| ### 4.1 Inference-Time Optimization (Core Contribution) | |
| Instead of modifying weights (quantization/pruning), this system applies **runtime-level optimization**: | |
| - `gpu-memory-utilization = 0.85` | |
| - `max-model-len = 1024` | |
| - `max-num-seqs = 4` | |
| - `tensor-parallel-size = 4` | |
| ### Impact: | |
| - Reduced KV-cache pressure | |
| - Improved GPU memory utilization | |
| - Stable multi-GPU execution | |
| - Consistent latency performance | |
| --- | |
| ### 4.2 Output Governance Pipeline | |
| A deterministic **postprocessing layer** (`postprocess.py`) is introduced to control model outputs. | |
| This module: | |
| - Removes internal reasoning traces (`<think>...</think>`) | |
| - Extracts final answer segments | |
| - Reformats output into structured bullet points | |
| ### Impact: | |
| - Clean, production-ready responses | |
| - Improved readability | |
| - Deterministic output format | |
| --- | |
| ## 5. Compression Strategies Evaluated | |
| The following approaches were tested and rejected: | |
| ### Quantization (AWQ / GPTQ) | |
| - Compatibility issues with MoE architecture | |
| - Output instability and degradation | |
| ### Pruning | |
| - Severe degradation in generation quality | |
| - Early stopping and incomplete outputs | |
| ### Distillation | |
| - Not feasible due to dataset and compute constraints | |
| --- | |
| ### Final Decision | |
| Runtime optimization was selected because: | |
| - Preserves original model accuracy | |
| - Avoids architectural incompatibility | |
| - Provides stable and reproducible results | |
| --- | |
| ## 6. System Architecture | |
| User Input | |
| β vLLM Inference Engine | |
| β Raw Model Output | |
| β Postprocessing Layer | |
| β Clean Structured Output | |
| This forms an **Inference Optimization + Output Governance Pipeline**. | |
| --- | |
| ## 7. Performance Results | |
| | Metric | Observation | | |
| |------|------------| | |
| | Latency | ~0.4s β 1.5s | | |
| | GPU Memory | ~8% reduction | | |
| | Stability | Consistent across runs | | |
| | Output Quality | Clean and structured after postprocessing | | |
| --- | |
| ## 8. How to Run | |
| ```bash | |
| bash run.sh | |
| ``` | |
| --- | |
| ## 9. API Example | |
| ```bash | |
| curl -s http://127.0.0.1:8000/v1/chat/completions \ | |
| -H "Content-Type: application/json" \ | |
| -d '{ | |
| "model": "sarvam-30b", | |
| "messages": [{"role": "user", "content": "Explain AI system resilience clearly."}], | |
| "max_tokens": 200, | |
| "temperature": 0.2 | |
| }' | |
| ``` | |
| --- | |
| ## 10. Files Included | |
| * `run.sh` β server startup script | |
| * `vllm_config.yaml` β optimized configuration | |
| * `postprocess.py` β output cleaning pipeline | |
| * `examples/` β raw vs cleaned outputs | |
| * `models/` β Sarvam-30B weights | |
| --- | |
| ## 11. Practical Impact | |
| This system is designed for real-world AI deployments where: | |
| * Large models must operate under GPU constraints | |
| * Outputs must be clean and user-facing | |
| * Internal reasoning traces are not acceptable | |
| The approach demonstrates: | |
| * Runtime optimization instead of weight modification | |
| * Output governance instead of prompt engineering | |
| * System-level control instead of model-level changes | |
| --- | |
| ## 12. Key Insight | |
| System-level optimization can outperform traditional compression techniques by: | |
| * This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency. | |
| * Preserving model accuracy | |
| * Improving inference efficiency | |
| * Ensuring stable deployment | |
| --- | |
| ## 13. Conclusion | |
| This work delivers a **deployment-ready, reproducible, and efficient inference system** for large-scale MoE models. | |
| It demonstrates that combining **runtime optimization with output control** provides a practical and scalable alternative to conventional model compression approaches. | |
| --- | |
| ## 14. Limitations | |
| * Does not reduce model size (weights remain unchanged) | |
| * Requires multi-GPU setup | |
| * Postprocessing is rule-based (not learned) | |
| --- | |
| ## 15. Future Work | |
| * MoE-aware quantization techniques | |
| * KV-cache compression methods | |
| * Adaptive decoding strategies | |
| * Edge-device compatible distillation | |
| --- | |
| ## 16. Real-World Relevance | |
| This system is designed for deployment scenarios where: | |
| - Large language models must operate under strict GPU constraints | |
| - Outputs must be clean and user-facing | |
| - Internal reasoning traces are not acceptable in production systems | |
| The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments. | |
| --- |