Buckets:
| # LLM-Prof Dataset Description | |
| Dataset link: https://huggingface.co/buckets/zk521/LLM-Prof-Dataset/ | |
| ## Overview | |
| The LLM-Prof dataset contains 207 GPU kernel traces collected for the paper "LLM-Prof: A Hierarchical Cross-Stack Performance Profiling Framework for Production LLM Inference Services". The dataset supports cross-layer analysis of LLM inference services at the service, model, and operator levels. | |
| The traces cover three inference frameworks: | |
| | Framework | Cases | Source | | |
| |---|---:|---| | |
| | RTP-LLM | 51 | Production MaaS services with real runtime telemetry | | |
| | SGLang | 73 | Controlled benchmark traces across models, GPUs, and workloads | | |
| | vLLM | 83 | Controlled benchmark traces across models, GPUs, and workloads | | |
| | Total | 207 | Production plus controlled benchmark traces | | |
| ## Dataset Files | |
| The released dataset is organized as three compressed archives: | |
| | Archive | Content | | |
| |---|---| | |
| | `original_trace_RTP-LLM_51.tar.gz` | RTP-LLM production traces and related runtime metadata | | |
| | `original_trace_SGLang_73.tar.gz` | SGLang benchmark traces | | |
| | `original_trace_vLLM_83.tar.gz` | vLLM benchmark traces | | |
| All traces are stored in Chrome Trace Event JSON format and compressed as `.json.gz` or framework-specific trace files. A typical trace contains GPU kernels, memory copies, CUDA runtime calls, CUDA driver calls, synchronization events, and related metadata. | |
| ## Framework-Specific Data | |
| ### RTP-LLM | |
| The RTP-LLM subset contains traces from real production services. Each case may include: | |
| - GPU kernel trace files collected from production inference execution. | |
| - `prefill_metrics_with_config.txt`, which records model configuration, hardware information, QPS time series, GPU utilization, token-size information, and service metadata. | |
| - Profiling metadata and optional analysis outputs. | |
| This subset reflects real deployment diversity, including varying business workloads, request rates, token sizes, GPU utilization, model sizes, and hardware configurations. | |
| ### SGLang and vLLM | |
| The SGLang and vLLM subsets are controlled benchmark traces. They systematically vary: | |
| - Model family and model size. | |
| - Batch size. | |
| - Input length and output length. | |
| - GPU type. | |
| - Inference framework. | |
| These traces provide controlled comparisons across frameworks and hardware platforms, complementing the production diversity of the RTP-LLM subset. | |
| ## Hardware and Model Coverage | |
| The dataset covers six NVIDIA GPU types: | |
| - A10 | |
| - A100 | |
| - A800 | |
| - H20 | |
| - H800 | |
| - L20 | |
| The evaluated services include 21 model variants from three major model families: Qwen, LLaMA, and BERT-family services. The model sizes range from sub-billion scale to 70B-scale deployments, including dense and MoE-style models. | |
| ## Trace Format | |
| Each trace follows the Chrome Trace Event format. The top-level structure usually contains: | |
| ```json | |
| { | |
| "schemaVersion": 1, | |
| "deviceProperties": [], | |
| "traceEvents": [], | |
| "traceName": "...", | |
| "displayTimeUnit": "ns" | |
| } | |
| ``` | |
| The `traceEvents` array is the main analysis input. Important event categories include: | |
| | Event category | Description | | |
| |---|---| | |
| | `kernel` | GPU kernel execution events used for operator-level analysis | | |
| | `gpu_memcpy` | Host-device and device-device memory copy events used for iteration anchoring | | |
| | `gpu_memset` | GPU memory initialization events | | |
| | `cuda_runtime` | CUDA Runtime API calls | | |
| | `cuda_driver` | CUDA Driver API calls | | |
| | `synchronization` | CUDA synchronization events | | |
| | `cuda_event` | CUDA event records | | |
| For GPU kernel events, important fields include: | |
| | Field | Description | | |
| |---|---| | |
| | `name` | Compiled CUDA kernel name | | |
| | `ts` | Start timestamp | | |
| | `dur` | Duration | | |
| | `tid` | CUDA stream or thread identifier | | |
| | `args.grid` | Grid configuration | | |
| | `args.block` | Block configuration | | |
| | `args.shared memory` | Shared memory usage | | |
| | `args.registers per thread` | Register count per thread | | |
| | `args.est. achieved occupancy %` | Estimated occupancy | | |
| ## Analysis Layers Supported by the Dataset | |
| The dataset supports the three LLM-Prof analysis layers: | |
| | Layer | Purpose | Key metrics | | |
| |---|---|---| | |
| | SEA | Service-level hotspot detection | QPS, FPR, token size, GPU utilization | | |
| | MEA | Model-level iteration analysis | IIPS, MIE, iteration duration | | |
| | OEA | Operator-level bottleneck analysis | operator efficiency, BottleScore, time proportion, Roofline position | | |
| ## Notes | |
| - RTP-LLM traces reflect real production workloads and may contain more heterogeneous runtime behavior than controlled benchmarks. | |
| - SGLang and vLLM traces are better suited for controlled model-framework-hardware comparisons. | |
| - Fine-grained operator attribution may be affected by framework-specific kernel fusion, custom kernels, and CUDA Graph-based execution. | |
| - Current profiling support is based on NVIDIA/CUPTI traces. | |
Xet Storage Details
- Size:
- 4.81 kB
- Xet hash:
- 83116f7cb9be9db035d8e126b5c5cc4cf79cf0d9f06b79fdb1467a68ae455015
·
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.