ross / README.md
CharlesCAOO's picture
Update README.md
cf7398c verified
---
license: apache-2.0
tags:
- ross
- llm-serving
- simulation
- xgboost
- performance-prediction
---
# ROSS Stage-Wise Regression Models
Pre-trained XGBoost regression models for [ROSS](https://github.com/scitix/ross) -- a dual-plane simulator for LLM serving systems.
These models power ROSS's **data plane**: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into **pre-forward**, **forward**, and **post-forward** stages, explicitly capturing CPU-GPU pipeline overlap.
## Model Overview
| Component | Description |
|-----------|-------------|
| Algorithm | XGBoost regressor |
| Training data | Sparse profiling traces collected on NVIDIA H200 and B200 GPUs |
| Prediction target | Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward |
| Input features | Batch shape, model architecture features, platform performance features |
| Supported frameworks | vLLM, SGLang |
## Directory Structure
```
sgl/ # SGLang backend models
dense/
prefill/
pre_forward_trained_models/xgboost_model/
forward_trained_models/xgboost_model/
decode/
pre_forward_trained_models/xgboost_model/
forward_trained_models/xgboost_model/
post_forward_trained_models/xgboost_model/
moe_foward/
prefill/
forward_trained_models/xgboost_model/
decode/
forward_trained_models/xgboost_model/
vllm/ # vLLM backend models
dense/
pre_forward_trained_models/xgboost_model/
forward_trained_models/xgboost_model/
post_forward_trained_models/xgboost_model/
moe/
forward_trained_models/xgboost_model/
```
Each `xgboost_model/` directory contains:
- `model.json` -- the serialized XGBoost model
- `model_metadata.json` -- feature names, training metadata
## Supported Platforms
| GPU | Status |
|-----|--------|
| NVIDIA H200 | Pre-trained models included |
| NVIDIA B200 | Pre-trained models included |
New platforms can be added by running the profiling scripts in the ROSS repository's `collector/` directory.
## Validated LLM Families
| Family | Variants |
|--------|----------|
| Llama-3.1 | 8B, 70B |
| Qwen2.5 | 72B-Instruct |
| Qwen3 | 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B |
| DeepSeek-V3 | 671B (MoE) |
| gpt-oss | 20b (MoE), 120b (MoE) |
The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.
## Usage
### 1. Download
```bash
# Using huggingface-cli
huggingface-cli download CharlesCAOO/ross --local-dir modeling
```
### 2. Point ROSS to the downloaded models
In your ROSS config JSON:
```json
{
"modeling_dir": "/path/to/modeling",
...
}
```
Or via CLI:
```bash
python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json
```
### 3. Run simulation
```bash
python ross/ross_predict.py --config my_config.json --record-path results.csv
```
ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.
## License
Apache 2.0