arithmetic-grpo / docs /perf /torch_profiling.md

initial clean commit

1faccd4 29 days ago

4.1 kB

	# PyTorch Profiling in verl

	Last updated: 01/13/2026.

	This guide explains how to use the native [PyTorch Profiler](https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html) for profiling verl training runs.

	## Configuration

	Profiling in verl can be configured through parameters in the trainer configuration file (e.g., `ppo_trainer.yaml`).

	### Global Profiling Control

	In `global_profiler`, you can control when and how profiling occurs globally:

	* `global_profiler.steps`: List of step numbers to profile. E.g., `[1, 2, 5]` profiles steps 1, 2, and 5. Set to `null` to disable.
	* `global_profiler.save_path`: Directory to save the profiling results. Default is `outputs/profile`.

	### Role Profiling Control

	Each RL role (Actor, Critic, etc.) has its own `profiler` configuration:

	* `enable`: Whether to enable profiling for this role.
	* `all_ranks`: If `True`, profiles all ranks.
	* `ranks`: List of specific ranks to profile if `all_ranks` is `False`.
	* `tool_config.torch`: Configuration specific to the PyTorch Profiler.

	#### PyTorch Profiler Options (`tool_config.torch`)

	You can customize the PyTorch Profiler behavior using the following fields under `tool_config.torch`:

	* `contents`: List of contents to profile.
	* `cpu`: Profile CPU activities.
	* `cuda`: Profile CUDA activities.
	* `memory`: Track tensor memory allocation/free.
	* `shapes`: Record shapes of operator inputs.
	* `stack`: Record source code file and line number.
	* `schedule`: (Advanced) configuration for `wait`, `warmup`, `active`, `repeat` cycles.

	## Examples

	### 1. End-to-End Collection

	Collects performance data for all steps in a single trace file.

	```yaml
	global_profiler:
	steps: [1, 2, 5]
	save_path: ./outputs/profile

	actor_rollout_ref:
	actor:
	profiler:
	enable: True
	all_ranks: True
	tool_config:
	torch:
	discrete: False
	contents: [cpu, cuda]
	# rollout & ref follow actor settings
	```

	### 2. Discrete Mode Collection

	Discrete mode saves separate trace files for each step. This is useful for detailed analysis and is mandatory when using Agent Loop.

	Configuration Example

	This configuration supports profiling both Training (Actor) and Inference (Rollout). You can enable/disable them independently.

	```yaml
	actor_rollout_ref:
	actor:
	profiler:
	enable: True # Set to True to profile training
	all_ranks: False
	ranks: [0] # Global Rank 0
	tool_config:
	torch:
	discrete: True
	contents: [cpu, cuda]
	rollout:
	profiler:
	enable: True # Set to True to profile inference
	all_ranks: False
	ranks: [0] # In Agent Loop, this is the Replica Rank (e.g. 0-th instance)
	tool_config:
	torch:
	discrete: True # REQUIRED
	# ref follow actor settings
	```

	Agent Loop Mode Description

	When Rollout runs in [Agent Loop](../advance/agent_loop.rst) mode, performance data for the Rollout phase must be collected using discrete mode. In this case, the Profiler is triggered by the inference engine backend.

	1. Rank Definition: ranks in the Rollout configuration refers to Replica Rank (inference instance index), not Global Rank.

	2. Inference Engine Support: Currently, vLLM and SGLang engines are supported without additional settings. Specific details are as follows:

	* vLLM Engine: Automatically collects AsyncLLM scheduling stacks and inference process performance data.
	* SGLang Engine: Automatically collects inference process performance data. Does not support the memory option in contents.

	## Visualization

	Collected trace files (usually `.json` or `.json.gz`) are stored in the configured `save_path`.

	You can visualize them using:

	1. Chrome Tracing: Open `chrome://tracing` in a Chrome browser and load the JSON file.
	2. Perfetto: Open [ui.perfetto.dev](https://ui.perfetto.dev/) and load the file (recommended for large traces).
	3. TensorBoard: If using the TensorBoard plugin for PyTorch Profiler.