Buckets:

zk521
/

LLM-Prof-Dataset

Files

xet

zk521/LLM-Prof-Dataset / dataset_description.md

zk521

30 days ago

preview code

download

raw

4.81 kB

	# LLM-Prof Dataset Description

	Dataset link: https://huggingface.co/buckets/zk521/LLM-Prof-Dataset/

	## Overview

	The LLM-Prof dataset contains 207 GPU kernel traces collected for the paper "LLM-Prof: A Hierarchical Cross-Stack Performance Profiling Framework for Production LLM Inference Services". The dataset supports cross-layer analysis of LLM inference services at the service, model, and operator levels.

	The traces cover three inference frameworks:

	\| Framework \| Cases \| Source \|
	\|---\|---:\|---\|
	\| RTP-LLM \| 51 \| Production MaaS services with real runtime telemetry \|
	\| SGLang \| 73 \| Controlled benchmark traces across models, GPUs, and workloads \|
	\| vLLM \| 83 \| Controlled benchmark traces across models, GPUs, and workloads \|
	\| Total \| 207 \| Production plus controlled benchmark traces \|

	## Dataset Files

	The released dataset is organized as three compressed archives:

	\| Archive \| Content \|
	\|---\|---\|
	\| `original_trace_RTP-LLM_51.tar.gz` \| RTP-LLM production traces and related runtime metadata \|
	\| `original_trace_SGLang_73.tar.gz` \| SGLang benchmark traces \|
	\| `original_trace_vLLM_83.tar.gz` \| vLLM benchmark traces \|

	All traces are stored in Chrome Trace Event JSON format and compressed as `.json.gz` or framework-specific trace files. A typical trace contains GPU kernels, memory copies, CUDA runtime calls, CUDA driver calls, synchronization events, and related metadata.

	## Framework-Specific Data

	### RTP-LLM

	The RTP-LLM subset contains traces from real production services. Each case may include:

	- GPU kernel trace files collected from production inference execution.
	- `prefill_metrics_with_config.txt`, which records model configuration, hardware information, QPS time series, GPU utilization, token-size information, and service metadata.
	- Profiling metadata and optional analysis outputs.

	This subset reflects real deployment diversity, including varying business workloads, request rates, token sizes, GPU utilization, model sizes, and hardware configurations.

	### SGLang and vLLM

	The SGLang and vLLM subsets are controlled benchmark traces. They systematically vary:

	- Model family and model size.
	- Batch size.
	- Input length and output length.
	- GPU type.
	- Inference framework.

	These traces provide controlled comparisons across frameworks and hardware platforms, complementing the production diversity of the RTP-LLM subset.

	## Hardware and Model Coverage

	The dataset covers six NVIDIA GPU types:

	- A10
	- A100
	- A800
	- H20
	- H800
	- L20

	The evaluated services include 21 model variants from three major model families: Qwen, LLaMA, and BERT-family services. The model sizes range from sub-billion scale to 70B-scale deployments, including dense and MoE-style models.

	## Trace Format

	Each trace follows the Chrome Trace Event format. The top-level structure usually contains:

	```json
	{
	"schemaVersion": 1,
	"deviceProperties": [],
	"traceEvents": [],
	"traceName": "...",
	"displayTimeUnit": "ns"
	}
	```

	The `traceEvents` array is the main analysis input. Important event categories include:

	\| Event category \| Description \|
	\|---\|---\|
	\| `kernel` \| GPU kernel execution events used for operator-level analysis \|
	\| `gpu_memcpy` \| Host-device and device-device memory copy events used for iteration anchoring \|
	\| `gpu_memset` \| GPU memory initialization events \|
	\| `cuda_runtime` \| CUDA Runtime API calls \|
	\| `cuda_driver` \| CUDA Driver API calls \|
	\| `synchronization` \| CUDA synchronization events \|
	\| `cuda_event` \| CUDA event records \|

	For GPU kernel events, important fields include:

	\| Field \| Description \|
	\|---\|---\|
	\| `name` \| Compiled CUDA kernel name \|
	\| `ts` \| Start timestamp \|
	\| `dur` \| Duration \|
	\| `tid` \| CUDA stream or thread identifier \|
	\| `args.grid` \| Grid configuration \|
	\| `args.block` \| Block configuration \|
	\| `args.shared memory` \| Shared memory usage \|
	\| `args.registers per thread` \| Register count per thread \|
	\| `args.est. achieved occupancy %` \| Estimated occupancy \|

	## Analysis Layers Supported by the Dataset

	The dataset supports the three LLM-Prof analysis layers:

	\| Layer \| Purpose \| Key metrics \|
	\|---\|---\|---\|
	\| SEA \| Service-level hotspot detection \| QPS, FPR, token size, GPU utilization \|
	\| MEA \| Model-level iteration analysis \| IIPS, MIE, iteration duration \|
	\| OEA \| Operator-level bottleneck analysis \| operator efficiency, BottleScore, time proportion, Roofline position \|

	## Notes

	- RTP-LLM traces reflect real production workloads and may contain more heterogeneous runtime behavior than controlled benchmarks.
	- SGLang and vLLM traces are better suited for controlled model-framework-hardware comparisons.
	- Fine-grained operator attribution may be affected by framework-specific kernel fusion, custom kernels, and CUDA Graph-based execution.
	- Current profiling support is based on NVIDIA/CUPTI traces.

Xet Storage Details

Size:: 4.81 kB
Xet hash:: 83116f7cb9be9db035d8e126b5c5cc4cf79cf0d9f06b79fdb1467a68ae455015

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.