arithmetic-grpo / docs /perf /torch_profiling.md
LeTue09's picture
initial clean commit
1faccd4
# PyTorch Profiling in verl
Last updated: 01/13/2026.
This guide explains how to use the native [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) for profiling verl training runs.
## Configuration
Profiling in verl can be configured through parameters in the trainer configuration file (e.g., `ppo_trainer.yaml`).
### Global Profiling Control
In `global_profiler`, you can control when and how profiling occurs globally:
* **`global_profiler.steps`**: List of step numbers to profile. E.g., `[1, 2, 5]` profiles steps 1, 2, and 5. Set to `null` to disable.
* **`global_profiler.save_path`**: Directory to save the profiling results. Default is `outputs/profile`.
### Role Profiling Control
Each RL role (Actor, Critic, etc.) has its own `profiler` configuration:
* **`enable`**: Whether to enable profiling for this role.
* **`all_ranks`**: If `True`, profiles all ranks.
* **`ranks`**: List of specific ranks to profile if `all_ranks` is `False`.
* **`tool_config.torch`**: Configuration specific to the PyTorch Profiler.
#### PyTorch Profiler Options (`tool_config.torch`)
You can customize the PyTorch Profiler behavior using the following fields under `tool_config.torch`:
* **`contents`**: List of contents to profile.
* **`cpu`**: Profile CPU activities.
* **`cuda`**: Profile CUDA activities.
* **`memory`**: Track tensor memory allocation/free.
* **`shapes`**: Record shapes of operator inputs.
* **`stack`**: Record source code file and line number.
* **`schedule`**: (Advanced) configuration for `wait`, `warmup`, `active`, `repeat` cycles.
## Examples
### 1. End-to-End Collection
Collects performance data for all steps in a single trace file.
```yaml
global_profiler:
steps: [1, 2, 5]
save_path: ./outputs/profile
actor_rollout_ref:
actor:
profiler:
enable: True
all_ranks: True
tool_config:
torch:
discrete: False
contents: [cpu, cuda]
# rollout & ref follow actor settings
```
### 2. Discrete Mode Collection
Discrete mode saves separate trace files for each step. This is useful for detailed analysis and is **mandatory** when using Agent Loop.
**Configuration Example**
This configuration supports profiling both Training (Actor) and Inference (Rollout). You can enable/disable them independently.
```yaml
actor_rollout_ref:
actor:
profiler:
enable: True # Set to True to profile training
all_ranks: False
ranks: [0] # Global Rank 0
tool_config:
torch:
discrete: True
contents: [cpu, cuda]
rollout:
profiler:
enable: True # Set to True to profile inference
all_ranks: False
ranks: [0] # In Agent Loop, this is the Replica Rank (e.g. 0-th instance)
tool_config:
torch:
discrete: True # REQUIRED
# ref follow actor settings
```
**Agent Loop Mode Description**
When Rollout runs in [Agent Loop](../advance/agent_loop.rst) mode, performance data for the Rollout phase **must be collected using discrete mode**. In this case, the Profiler is triggered by the inference engine backend.
1. Rank Definition: ranks in the Rollout configuration refers to Replica Rank (inference instance index), not Global Rank.
2. Inference Engine Support: Currently, vLLM and SGLang engines are supported without additional settings. Specific details are as follows:
* **vLLM Engine**: Automatically collects AsyncLLM scheduling stacks and inference process performance data.
* **SGLang Engine**: Automatically collects inference process performance data. Does not support the memory option in contents.
## Visualization
Collected trace files (usually `.json` or `.json.gz`) are stored in the configured `save_path`.
You can visualize them using:
1. **Chrome Tracing**: Open `chrome://tracing` in a Chrome browser and load the JSON file.
2. **Perfetto**: Open [ui.perfetto.dev](https://ui.perfetto.dev/) and load the file (recommended for large traces).
3. **TensorBoard**: If using the TensorBoard plugin for PyTorch Profiler.