CharlesCAOO
/

ross

performance-prediction

Model card Files Files and versions

ross / README.md

CharlesCAOO's picture

Update README.md

cf7398c verified about 1 month ago

|

history blame contribute delete

3.34 kB

	---
	license: apache-2.0
	tags:
	- ross
	- llm-serving
	- simulation
	- xgboost
	- performance-prediction
	---

	# ROSS Stage-Wise Regression Models

	Pre-trained XGBoost regression models for [ROSS](https://github.com/scitix/ross) -- a dual-plane simulator for LLM serving systems.

	These models power ROSS's data plane: given a batch descriptor (request IDs, sequence lengths, model architecture features, and platform performance features), they predict per-iteration latency by decomposing each serving iteration into pre-forward, forward, and post-forward stages, explicitly capturing CPU-GPU pipeline overlap.

	## Model Overview

	\| Component \| Description \|
	\|-----------\|-------------\|
	\| Algorithm \| XGBoost regressor \|
	\| Training data \| Sparse profiling traces collected on NVIDIA H200 and B200 GPUs \|
	\| Prediction target \| Per-stage iteration latency (ms) for each of pre-forward, forward, and post-forward \|
	\| Input features \| Batch shape, model architecture features, platform performance features \|
	\| Supported frameworks \| vLLM, SGLang \|

	## Directory Structure

	```
	sgl/ # SGLang backend models
	dense/
	prefill/
	pre_forward_trained_models/xgboost_model/
	forward_trained_models/xgboost_model/
	decode/
	pre_forward_trained_models/xgboost_model/
	forward_trained_models/xgboost_model/
	post_forward_trained_models/xgboost_model/
	moe_foward/
	prefill/
	forward_trained_models/xgboost_model/
	decode/
	forward_trained_models/xgboost_model/
	vllm/ # vLLM backend models
	dense/
	pre_forward_trained_models/xgboost_model/
	forward_trained_models/xgboost_model/
	post_forward_trained_models/xgboost_model/
	moe/
	forward_trained_models/xgboost_model/
	```

	Each `xgboost_model/` directory contains:
	- `model.json` -- the serialized XGBoost model
	- `model_metadata.json` -- feature names, training metadata

	## Supported Platforms

	\| GPU \| Status \|
	\|-----\|--------\|
	\| NVIDIA H200 \| Pre-trained models included \|
	\| NVIDIA B200 \| Pre-trained models included \|

	New platforms can be added by running the profiling scripts in the ROSS repository's `collector/` directory.

	## Validated LLM Families

	\| Family \| Variants \|
	\|--------\|----------\|
	\| Llama-3.1 \| 8B, 70B \|
	\| Qwen2.5 \| 72B-Instruct \|
	\| Qwen3 \| 32B, 30B-A3B (MoE), 235B-A22B (MoE), QwQ 32B \|
	\| DeepSeek-V3 \| 671B (MoE) \|
	\| gpt-oss \| 20b (MoE), 120b (MoE) \|

	The stage-wise regressor takes model configuration features as input rather than per-model kernel calibration, so new models within supported families generally work out of the box.

	## Usage

	### 1. Download

	```bash
	# Using huggingface-cli
	huggingface-cli download CharlesCAOO/ross --local-dir modeling
	```

	### 2. Point ROSS to the downloaded models

	In your ROSS config JSON:

	```json
	{
	"modeling_dir": "/path/to/modeling",
	...
	}
	```

	Or via CLI:

	```bash
	python ross/ross_predict.py --modeling-dir /path/to/modeling --config my_config.json
	```

	### 3. Run simulation

	```bash
	python ross/ross_predict.py --config my_config.json --record-path results.csv
	```

	ROSS achieves median prediction errors below 6% for E2E latency and TPOT across the validated models and platforms, while sustaining >11x simulation speedup over on-hardware benchmarking.


	## License

	Apache 2.0