Buckets:
LLM-Prof Dataset Description
Dataset link: https://huggingface.co/buckets/zk521/LLM-Prof-Dataset/
Overview
The LLM-Prof dataset contains 207 GPU kernel traces collected for the paper "LLM-Prof: A Hierarchical Cross-Stack Performance Profiling Framework for Production LLM Inference Services". The dataset supports cross-layer analysis of LLM inference services at the service, model, and operator levels.
The traces cover three inference frameworks:
| Framework | Cases | Source |
|---|---|---|
| RTP-LLM | 51 | Production MaaS services with real runtime telemetry |
| SGLang | 73 | Controlled benchmark traces across models, GPUs, and workloads |
| vLLM | 83 | Controlled benchmark traces across models, GPUs, and workloads |
| Total | 207 | Production plus controlled benchmark traces |
Dataset Files
The released dataset is organized as three compressed archives:
| Archive | Content |
|---|---|
original_trace_RTP-LLM_51.tar.gz |
RTP-LLM production traces and related runtime metadata |
original_trace_SGLang_73.tar.gz |
SGLang benchmark traces |
original_trace_vLLM_83.tar.gz |
vLLM benchmark traces |
All traces are stored in Chrome Trace Event JSON format and compressed as .json.gz or framework-specific trace files. A typical trace contains GPU kernels, memory copies, CUDA runtime calls, CUDA driver calls, synchronization events, and related metadata.
Framework-Specific Data
RTP-LLM
The RTP-LLM subset contains traces from real production services. Each case may include:
- GPU kernel trace files collected from production inference execution.
prefill_metrics_with_config.txt, which records model configuration, hardware information, QPS time series, GPU utilization, token-size information, and service metadata.- Profiling metadata and optional analysis outputs.
This subset reflects real deployment diversity, including varying business workloads, request rates, token sizes, GPU utilization, model sizes, and hardware configurations.
SGLang and vLLM
The SGLang and vLLM subsets are controlled benchmark traces. They systematically vary:
- Model family and model size.
- Batch size.
- Input length and output length.
- GPU type.
- Inference framework.
These traces provide controlled comparisons across frameworks and hardware platforms, complementing the production diversity of the RTP-LLM subset.
Hardware and Model Coverage
The dataset covers six NVIDIA GPU types:
- A10
- A100
- A800
- H20
- H800
- L20
The evaluated services include 21 model variants from three major model families: Qwen, LLaMA, and BERT-family services. The model sizes range from sub-billion scale to 70B-scale deployments, including dense and MoE-style models.
Trace Format
Each trace follows the Chrome Trace Event format. The top-level structure usually contains:
{
"schemaVersion": 1,
"deviceProperties": [],
"traceEvents": [],
"traceName": "...",
"displayTimeUnit": "ns"
}
The traceEvents array is the main analysis input. Important event categories include:
| Event category | Description |
|---|---|
kernel |
GPU kernel execution events used for operator-level analysis |
gpu_memcpy |
Host-device and device-device memory copy events used for iteration anchoring |
gpu_memset |
GPU memory initialization events |
cuda_runtime |
CUDA Runtime API calls |
cuda_driver |
CUDA Driver API calls |
synchronization |
CUDA synchronization events |
cuda_event |
CUDA event records |
For GPU kernel events, important fields include:
| Field | Description |
|---|---|
name |
Compiled CUDA kernel name |
ts |
Start timestamp |
dur |
Duration |
tid |
CUDA stream or thread identifier |
args.grid |
Grid configuration |
args.block |
Block configuration |
args.shared memory |
Shared memory usage |
args.registers per thread |
Register count per thread |
args.est. achieved occupancy % |
Estimated occupancy |
Analysis Layers Supported by the Dataset
The dataset supports the three LLM-Prof analysis layers:
| Layer | Purpose | Key metrics |
|---|---|---|
| SEA | Service-level hotspot detection | QPS, FPR, token size, GPU utilization |
| MEA | Model-level iteration analysis | IIPS, MIE, iteration duration |
| OEA | Operator-level bottleneck analysis | operator efficiency, BottleScore, time proportion, Roofline position |
Notes
- RTP-LLM traces reflect real production workloads and may contain more heterogeneous runtime behavior than controlled benchmarks.
- SGLang and vLLM traces are better suited for controlled model-framework-hardware comparisons.
- Fine-grained operator attribution may be affected by framework-specific kernel fusion, custom kernels, and CUDA Graph-based execution.
- Current profiling support is based on NVIDIA/CUPTI traces.
Xet Storage Details
- Size:
- 4.81 kB
- Xet hash:
- 83116f7cb9be9db035d8e126b5c5cc4cf79cf0d9f06b79fdb1467a68ae455015
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.