zk521/LLM-Prof-Dataset / dataset_description.md
zk521's picture
|
download
raw
4.81 kB

LLM-Prof Dataset Description

Dataset link: https://huggingface.co/buckets/zk521/LLM-Prof-Dataset/

Overview

The LLM-Prof dataset contains 207 GPU kernel traces collected for the paper "LLM-Prof: A Hierarchical Cross-Stack Performance Profiling Framework for Production LLM Inference Services". The dataset supports cross-layer analysis of LLM inference services at the service, model, and operator levels.

The traces cover three inference frameworks:

Framework Cases Source
RTP-LLM 51 Production MaaS services with real runtime telemetry
SGLang 73 Controlled benchmark traces across models, GPUs, and workloads
vLLM 83 Controlled benchmark traces across models, GPUs, and workloads
Total 207 Production plus controlled benchmark traces

Dataset Files

The released dataset is organized as three compressed archives:

Archive Content
original_trace_RTP-LLM_51.tar.gz RTP-LLM production traces and related runtime metadata
original_trace_SGLang_73.tar.gz SGLang benchmark traces
original_trace_vLLM_83.tar.gz vLLM benchmark traces

All traces are stored in Chrome Trace Event JSON format and compressed as .json.gz or framework-specific trace files. A typical trace contains GPU kernels, memory copies, CUDA runtime calls, CUDA driver calls, synchronization events, and related metadata.

Framework-Specific Data

RTP-LLM

The RTP-LLM subset contains traces from real production services. Each case may include:

  • GPU kernel trace files collected from production inference execution.
  • prefill_metrics_with_config.txt, which records model configuration, hardware information, QPS time series, GPU utilization, token-size information, and service metadata.
  • Profiling metadata and optional analysis outputs.

This subset reflects real deployment diversity, including varying business workloads, request rates, token sizes, GPU utilization, model sizes, and hardware configurations.

SGLang and vLLM

The SGLang and vLLM subsets are controlled benchmark traces. They systematically vary:

  • Model family and model size.
  • Batch size.
  • Input length and output length.
  • GPU type.
  • Inference framework.

These traces provide controlled comparisons across frameworks and hardware platforms, complementing the production diversity of the RTP-LLM subset.

Hardware and Model Coverage

The dataset covers six NVIDIA GPU types:

  • A10
  • A100
  • A800
  • H20
  • H800
  • L20

The evaluated services include 21 model variants from three major model families: Qwen, LLaMA, and BERT-family services. The model sizes range from sub-billion scale to 70B-scale deployments, including dense and MoE-style models.

Trace Format

Each trace follows the Chrome Trace Event format. The top-level structure usually contains:

{
  "schemaVersion": 1,
  "deviceProperties": [],
  "traceEvents": [],
  "traceName": "...",
  "displayTimeUnit": "ns"
}

The traceEvents array is the main analysis input. Important event categories include:

Event category Description
kernel GPU kernel execution events used for operator-level analysis
gpu_memcpy Host-device and device-device memory copy events used for iteration anchoring
gpu_memset GPU memory initialization events
cuda_runtime CUDA Runtime API calls
cuda_driver CUDA Driver API calls
synchronization CUDA synchronization events
cuda_event CUDA event records

For GPU kernel events, important fields include:

Field Description
name Compiled CUDA kernel name
ts Start timestamp
dur Duration
tid CUDA stream or thread identifier
args.grid Grid configuration
args.block Block configuration
args.shared memory Shared memory usage
args.registers per thread Register count per thread
args.est. achieved occupancy % Estimated occupancy

Analysis Layers Supported by the Dataset

The dataset supports the three LLM-Prof analysis layers:

Layer Purpose Key metrics
SEA Service-level hotspot detection QPS, FPR, token size, GPU utilization
MEA Model-level iteration analysis IIPS, MIE, iteration duration
OEA Operator-level bottleneck analysis operator efficiency, BottleScore, time proportion, Roofline position

Notes

  • RTP-LLM traces reflect real production workloads and may contain more heterogeneous runtime behavior than controlled benchmarks.
  • SGLang and vLLM traces are better suited for controlled model-framework-hardware comparisons.
  • Fine-grained operator attribution may be affected by framework-specific kernel fusion, custom kernels, and CUDA Graph-based execution.
  • Current profiling support is based on NVIDIA/CUPTI traces.

Xet Storage Details

Size:
4.81 kB
·
Xet hash:
83116f7cb9be9db035d8e126b5c5cc4cf79cf0d9f06b79fdb1467a68ae455015

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.