README: GRPO Internal Mode Execution Scripts
Known Issues
Bugs in vLLM >= 0.8
DeepSpeed ZeRO-3 Mode : When using DeepSpeed's ZeRO-3 configuration, gradients may become zero during training.
Async Mode In certain scenarios, the asynchronous mode (Async Mode) may hang, causing the program to become unresponsive.
To ensure stability and compatibility, it is recommended to use vLLM 0.7.3 to avoid the above issues.
Introduction
The GRPO (Gradient-based Reinforcement Policy Optimization) training framework supports integrating high-performance inference engines like vLLM to accelerate the sampling process. The Internal Mode allows the inference service to be directly launched within the Trainer, reducing external dependencies and simplifying deployment.
This folder contains scripts and instructions for running GRPO in Internal Mode, where the model training and inference are tightly integrated with flexible resource allocation strategies.
Resource Allocation Strategies
GRPO provides two resource allocation strategies under the Internal mode:
1. Colocate Mode
- Description: Training and inference share GPU resources.
- Recommended Setting:
- Set
sleep_level=1to release vLLM memory during training steps.
- Set
- Resource Allocation Rules:
NPROC_PER_NODE = Total number of GPUs num_infer_workers = Total number of GPUs
2. Async Mode
- Description: Training and inference use independent GPU resources.
- Recommended Setting:
- Set
sleep_level=1to release vLLM memory during training steps.
- Set
- Resource Allocation Rules:
NPROC_PER_NODE = Number of training GPUs num_infer_workers = Number of inference GPUs Must satisfy: Number of training GPUs + Number of inference GPUs = Total GPU count