evil-dreams's picture
updated model card
5a97586 verified
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
library_name: transformers
base_model: sarvamai/sarvam-30b
model_type: causal-lm
tags:
- llm
- mixture-of-experts
- vllm
- inference-optimization
- runtime-optimization
- efficient-ai
- production-ai
---
# Sarvam-30B Runtime-Optimized Inference System
## 1. Overview
This project presents a **runtime-optimized deployment system for Sarvam-30B**, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.
The objective is to **improve inference efficiency, stability, and output quality** without modifying the model weights, making it suitable for real-world deployment scenarios.
This work focuses on **system-level optimization rather than model-level compression**, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.
---
## 2. Base Model
- Model: `sarvamai/sarvam-30b`
- Architecture: Mixture-of-Experts (MoE)
- Task: Text Generation
- Inference Engine: vLLM
- Hardware: Multi-GPU (Tensor Parallelism)
---
## 3. Problem Statement
During experimentation, two critical challenges were identified:
### 3.1 Reasoning Leakage
The model generates internal reasoning traces such as `<think>` tokens, which:
- Reduce readability
- Break structured output requirements
- Affect downstream usability
---
### 3.2 High Resource Consumption
Due to the MoE architecture:
- High GPU memory utilization (~45GB per GPU baseline)
- Large KV-cache growth with sequence length
- Reduced inference efficiency under default settings
---
## 4. Approach
### 4.1 Inference-Time Optimization (Core Contribution)
Instead of modifying weights (quantization/pruning), this system applies **runtime-level optimization**:
- `gpu-memory-utilization = 0.85`
- `max-model-len = 1024`
- `max-num-seqs = 4`
- `tensor-parallel-size = 4`
### Impact:
- Reduced KV-cache pressure
- Improved GPU memory utilization
- Stable multi-GPU execution
- Consistent latency performance
---
### 4.2 Output Governance Pipeline
A deterministic **postprocessing layer** (`postprocess.py`) is introduced to control model outputs.
This module:
- Removes internal reasoning traces (`<think>...</think>`)
- Extracts final answer segments
- Reformats output into structured bullet points
### Impact:
- Clean, production-ready responses
- Improved readability
- Deterministic output format
---
## 5. Compression Strategies Evaluated
The following approaches were tested and rejected:
### Quantization (AWQ / GPTQ)
- Compatibility issues with MoE architecture
- Output instability and degradation
### Pruning
- Severe degradation in generation quality
- Early stopping and incomplete outputs
### Distillation
- Not feasible due to dataset and compute constraints
---
### Final Decision
Runtime optimization was selected because:
- Preserves original model accuracy
- Avoids architectural incompatibility
- Provides stable and reproducible results
---
## 6. System Architecture
User Input
β†’ vLLM Inference Engine
β†’ Raw Model Output
β†’ Postprocessing Layer
β†’ Clean Structured Output
This forms an **Inference Optimization + Output Governance Pipeline**.
---
## 7. Performance Results
| Metric | Observation |
|------|------------|
| Latency | ~0.4s – 1.5s |
| GPU Memory | ~8% reduction |
| Stability | Consistent across runs |
| Output Quality | Clean and structured after postprocessing |
---
## 8. How to Run
```bash
bash run.sh
```
---
## 9. API Example
```bash
curl -s http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "sarvam-30b",
"messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
"max_tokens": 200,
"temperature": 0.2
}'
```
---
## 10. Files Included
* `run.sh` β†’ server startup script
* `vllm_config.yaml` β†’ optimized configuration
* `postprocess.py` β†’ output cleaning pipeline
* `examples/` β†’ raw vs cleaned outputs
* `models/` β†’ Sarvam-30B weights
---
## 11. Practical Impact
This system is designed for real-world AI deployments where:
* Large models must operate under GPU constraints
* Outputs must be clean and user-facing
* Internal reasoning traces are not acceptable
The approach demonstrates:
* Runtime optimization instead of weight modification
* Output governance instead of prompt engineering
* System-level control instead of model-level changes
---
## 12. Key Insight
System-level optimization can outperform traditional compression techniques by:
* This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
* Preserving model accuracy
* Improving inference efficiency
* Ensuring stable deployment
---
## 13. Conclusion
This work delivers a **deployment-ready, reproducible, and efficient inference system** for large-scale MoE models.
It demonstrates that combining **runtime optimization with output control** provides a practical and scalable alternative to conventional model compression approaches.
---
## 14. Limitations
* Does not reduce model size (weights remain unchanged)
* Requires multi-GPU setup
* Postprocessing is rule-based (not learned)
---
## 15. Future Work
* MoE-aware quantization techniques
* KV-cache compression methods
* Adaptive decoding strategies
* Edge-device compatible distillation
---
## 16. Real-World Relevance
This system is designed for deployment scenarios where:
- Large language models must operate under strict GPU constraints
- Outputs must be clean and user-facing
- Internal reasoning traces are not acceptable in production systems
The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.
---