| --- |
| license: apache-2.0 |
| tags: |
| - ross |
| - llm-serving |
| - simulation |
| - xgboost |
| - performance-prediction |
| --- |
| |
| # ROSS Stage-Wise Regression Models |
|
|
| Pre-trained XGBoost regression models for [ROSS](https://github.com/scitix/ross) -- a dual-plane simulator for LLM serving systems. |
|
|
| These models power ROSS's **data plane**: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into **pre-forward**, **forward**, and **post-forward** stages, explicitly capturing CPU-GPU pipeline overlap. |
|
|
| ## Model Overview |
|
|
| | Component | Description | |
| |-----------|-------------| |
| | Algorithm | XGBoost regressor | |
| | Training data | Sparse profiling traces collected on NVIDIA H200 and B200 GPUs | |
| | Prediction target | Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward | |
| | Input features | Batch shape, model architecture features, platform performance features | |
| | Supported frameworks | vLLM, SGLang | |
|
|
| ## Directory Structure |
|
|
| ``` |
| sgl/ # SGLang backend models |
| dense/ |
| prefill/ |
| pre_forward_trained_models/xgboost_model/ |
| forward_trained_models/xgboost_model/ |
| decode/ |
| pre_forward_trained_models/xgboost_model/ |
| forward_trained_models/xgboost_model/ |
| post_forward_trained_models/xgboost_model/ |
| moe_foward/ |
| prefill/ |
| forward_trained_models/xgboost_model/ |
| decode/ |
| forward_trained_models/xgboost_model/ |
| vllm/ # vLLM backend models |
| dense/ |
| pre_forward_trained_models/xgboost_model/ |
| forward_trained_models/xgboost_model/ |
| post_forward_trained_models/xgboost_model/ |
| moe/ |
| forward_trained_models/xgboost_model/ |
| ``` |
|
|
| Each `xgboost_model/` directory contains: |
| - `model.json` -- the serialized XGBoost model |
| - `model_metadata.json` -- feature names, training metadata |
|
|
| ## Supported Platforms |
|
|
| | GPU | Status | |
| |-----|--------| |
| | NVIDIA H200 | Pre-trained models included | |
| | NVIDIA B200 | Pre-trained models included | |
|
|
| New platforms can be added by running the profiling scripts in the ROSS repository's `collector/` directory. |
|
|
| ## Validated LLM Families |
|
|
| | Family | Variants | |
| |--------|----------| |
| | Llama-3.1 | 8B, 70B | |
| | Qwen2.5 | 72B-Instruct | |
| | Qwen3 | 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B | |
| | DeepSeek-V3 | 671B (MoE) | |
| | gpt-oss | 20b (MoE), 120b (MoE) | |
|
|
| The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box. |
|
|
| ## Usage |
|
|
| ### 1. Download |
|
|
| ```bash |
| # Using huggingface-cli |
| huggingface-cli download CharlesCAOO/ross --local-dir modeling |
| ``` |
|
|
| ### 2. Point ROSS to the downloaded models |
|
|
| In your ROSS config JSON: |
|
|
| ```json |
| { |
| "modeling_dir": "/path/to/modeling", |
| ... |
| } |
| ``` |
|
|
| Or via CLI: |
|
|
| ```bash |
| python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json |
| ``` |
|
|
| ### 3. Run simulation |
|
|
| ```bash |
| python ross/ross_predict.py --config my_config.json --record-path results.csv |
| ``` |
|
|
| ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking. |
|
|
|
|
| ## License |
|
|
| Apache 2.0 |
|
|