zk521/LLM-Prof-Dataset / dataset_description.md
zk521's picture
|
download
raw
4.81 kB
# LLM-Prof Dataset Description
Dataset link: https://huggingface.co/buckets/zk521/LLM-Prof-Dataset/
## Overview
The LLM-Prof dataset contains 207 GPU kernel traces collected for the paper "LLM-Prof: A Hierarchical Cross-Stack Performance Profiling Framework for Production LLM Inference Services". The dataset supports cross-layer analysis of LLM inference services at the service, model, and operator levels.
The traces cover three inference frameworks:
| Framework | Cases | Source |
|---|---:|---|
| RTP-LLM | 51 | Production MaaS services with real runtime telemetry |
| SGLang | 73 | Controlled benchmark traces across models, GPUs, and workloads |
| vLLM | 83 | Controlled benchmark traces across models, GPUs, and workloads |
| Total | 207 | Production plus controlled benchmark traces |
## Dataset Files
The released dataset is organized as three compressed archives:
| Archive | Content |
|---|---|
| `original_trace_RTP-LLM_51.tar.gz` | RTP-LLM production traces and related runtime metadata |
| `original_trace_SGLang_73.tar.gz` | SGLang benchmark traces |
| `original_trace_vLLM_83.tar.gz` | vLLM benchmark traces |
All traces are stored in Chrome Trace Event JSON format and compressed as `.json.gz` or framework-specific trace files. A typical trace contains GPU kernels, memory copies, CUDA runtime calls, CUDA driver calls, synchronization events, and related metadata.
## Framework-Specific Data
### RTP-LLM
The RTP-LLM subset contains traces from real production services. Each case may include:
- GPU kernel trace files collected from production inference execution.
- `prefill_metrics_with_config.txt`, which records model configuration, hardware information, QPS time series, GPU utilization, token-size information, and service metadata.
- Profiling metadata and optional analysis outputs.
This subset reflects real deployment diversity, including varying business workloads, request rates, token sizes, GPU utilization, model sizes, and hardware configurations.
### SGLang and vLLM
The SGLang and vLLM subsets are controlled benchmark traces. They systematically vary:
- Model family and model size.
- Batch size.
- Input length and output length.
- GPU type.
- Inference framework.
These traces provide controlled comparisons across frameworks and hardware platforms, complementing the production diversity of the RTP-LLM subset.
## Hardware and Model Coverage
The dataset covers six NVIDIA GPU types:
- A10
- A100
- A800
- H20
- H800
- L20
The evaluated services include 21 model variants from three major model families: Qwen, LLaMA, and BERT-family services. The model sizes range from sub-billion scale to 70B-scale deployments, including dense and MoE-style models.
## Trace Format
Each trace follows the Chrome Trace Event format. The top-level structure usually contains:
```json
{
"schemaVersion": 1,
"deviceProperties": [],
"traceEvents": [],
"traceName": "...",
"displayTimeUnit": "ns"
}
```
The `traceEvents` array is the main analysis input. Important event categories include:
| Event category | Description |
|---|---|
| `kernel` | GPU kernel execution events used for operator-level analysis |
| `gpu_memcpy` | Host-device and device-device memory copy events used for iteration anchoring |
| `gpu_memset` | GPU memory initialization events |
| `cuda_runtime` | CUDA Runtime API calls |
| `cuda_driver` | CUDA Driver API calls |
| `synchronization` | CUDA synchronization events |
| `cuda_event` | CUDA event records |
For GPU kernel events, important fields include:
| Field | Description |
|---|---|
| `name` | Compiled CUDA kernel name |
| `ts` | Start timestamp |
| `dur` | Duration |
| `tid` | CUDA stream or thread identifier |
| `args.grid` | Grid configuration |
| `args.block` | Block configuration |
| `args.shared memory` | Shared memory usage |
| `args.registers per thread` | Register count per thread |
| `args.est. achieved occupancy %` | Estimated occupancy |
## Analysis Layers Supported by the Dataset
The dataset supports the three LLM-Prof analysis layers:
| Layer | Purpose | Key metrics |
|---|---|---|
| SEA | Service-level hotspot detection | QPS, FPR, token size, GPU utilization |
| MEA | Model-level iteration analysis | IIPS, MIE, iteration duration |
| OEA | Operator-level bottleneck analysis | operator efficiency, BottleScore, time proportion, Roofline position |
## Notes
- RTP-LLM traces reflect real production workloads and may contain more heterogeneous runtime behavior than controlled benchmarks.
- SGLang and vLLM traces are better suited for controlled model-framework-hardware comparisons.
- Fine-grained operator attribution may be affected by framework-specific kernel fusion, custom kernels, and CUDA Graph-based execution.
- Current profiling support is based on NVIDIA/CUPTI traces.

Xet Storage Details

Size:
4.81 kB
·
Xet hash:
83116f7cb9be9db035d8e126b5c5cc4cf79cf0d9f06b79fdb1467a68ae455015

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.