evil-dreams's picture
updated model card
5a97586 verified
metadata
license: apache-2.0
language:
  - en
pipeline_tag: text-generation
library_name: transformers
base_model: sarvamai/sarvam-30b
model_type: causal-lm
tags:
  - llm
  - mixture-of-experts
  - vllm
  - inference-optimization
  - runtime-optimization
  - efficient-ai
  - production-ai

Sarvam-30B Runtime-Optimized Inference System

1. Overview

This project presents a runtime-optimized deployment system for Sarvam-30B, a large-scale Mixture-of-Experts (MoE) language model, using vLLM.

The objective is to improve inference efficiency, stability, and output quality without modifying the model weights, making it suitable for real-world deployment scenarios.

This work focuses on system-level optimization rather than model-level compression, demonstrating a practical and reliable approach to handling large LLMs under constrained environments.


2. Base Model

  • Model: sarvamai/sarvam-30b
  • Architecture: Mixture-of-Experts (MoE)
  • Task: Text Generation
  • Inference Engine: vLLM
  • Hardware: Multi-GPU (Tensor Parallelism)

3. Problem Statement

During experimentation, two critical challenges were identified:

3.1 Reasoning Leakage

The model generates internal reasoning traces such as <think> tokens, which:

  • Reduce readability
  • Break structured output requirements
  • Affect downstream usability

3.2 High Resource Consumption

Due to the MoE architecture:

  • High GPU memory utilization (~45GB per GPU baseline)
  • Large KV-cache growth with sequence length
  • Reduced inference efficiency under default settings

4. Approach

4.1 Inference-Time Optimization (Core Contribution)

Instead of modifying weights (quantization/pruning), this system applies runtime-level optimization:

  • gpu-memory-utilization = 0.85
  • max-model-len = 1024
  • max-num-seqs = 4
  • tensor-parallel-size = 4

Impact:

  • Reduced KV-cache pressure
  • Improved GPU memory utilization
  • Stable multi-GPU execution
  • Consistent latency performance

4.2 Output Governance Pipeline

A deterministic postprocessing layer (postprocess.py) is introduced to control model outputs.

This module:

  • Removes internal reasoning traces (<think>...</think>)
  • Extracts final answer segments
  • Reformats output into structured bullet points

Impact:

  • Clean, production-ready responses
  • Improved readability
  • Deterministic output format

5. Compression Strategies Evaluated

The following approaches were tested and rejected:

Quantization (AWQ / GPTQ)

  • Compatibility issues with MoE architecture
  • Output instability and degradation

Pruning

  • Severe degradation in generation quality
  • Early stopping and incomplete outputs

Distillation

  • Not feasible due to dataset and compute constraints

Final Decision

Runtime optimization was selected because:

  • Preserves original model accuracy
  • Avoids architectural incompatibility
  • Provides stable and reproducible results

6. System Architecture

User Input
β†’ vLLM Inference Engine
β†’ Raw Model Output
β†’ Postprocessing Layer
β†’ Clean Structured Output

This forms an Inference Optimization + Output Governance Pipeline.


7. Performance Results

Metric Observation
Latency ~0.4s – 1.5s
GPU Memory ~8% reduction
Stability Consistent across runs
Output Quality Clean and structured after postprocessing

8. How to Run

bash run.sh

9. API Example

curl -s http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sarvam-30b",
    "messages": [{"role": "user", "content": "Explain AI system resilience clearly."}],
    "max_tokens": 200,
    "temperature": 0.2
  }'

10. Files Included

  • run.sh β†’ server startup script
  • vllm_config.yaml β†’ optimized configuration
  • postprocess.py β†’ output cleaning pipeline
  • examples/ β†’ raw vs cleaned outputs
  • models/ β†’ Sarvam-30B weights

11. Practical Impact

This system is designed for real-world AI deployments where:

  • Large models must operate under GPU constraints
  • Outputs must be clean and user-facing
  • Internal reasoning traces are not acceptable

The approach demonstrates:

  • Runtime optimization instead of weight modification
  • Output governance instead of prompt engineering
  • System-level control instead of model-level changes

12. Key Insight

System-level optimization can outperform traditional compression techniques by:

  • This work highlights that system-level optimization can outperform traditional model compression techniques in maintaining output quality while improving efficiency.
  • Preserving model accuracy
  • Improving inference efficiency
  • Ensuring stable deployment

13. Conclusion

This work delivers a deployment-ready, reproducible, and efficient inference system for large-scale MoE models.

It demonstrates that combining runtime optimization with output control provides a practical and scalable alternative to conventional model compression approaches.


14. Limitations

  • Does not reduce model size (weights remain unchanged)
  • Requires multi-GPU setup
  • Postprocessing is rule-based (not learned)

15. Future Work

  • MoE-aware quantization techniques
  • KV-cache compression methods
  • Adaptive decoding strategies
  • Edge-device compatible distillation

16. Real-World Relevance

This system is designed for deployment scenarios where:

  • Large language models must operate under strict GPU constraints
  • Outputs must be clean and user-facing
  • Internal reasoning traces are not acceptable in production systems

The solution demonstrates a shift from model-centric optimization to system-centric optimization, which is critical for scaling AI systems in real-world environments.