README: GRPO Internal Mode Execution Scripts

Known Issues

Bugs in vLLM >= 0.8

DeepSpeed ZeRO-3 Mode : When using DeepSpeed's ZeRO-3 configuration, gradients may become zero during training.
Async Mode In certain scenarios, the asynchronous mode (Async Mode) may hang, causing the program to become unresponsive.

To ensure stability and compatibility, it is recommended to use vLLM 0.7.3 to avoid the above issues.

Introduction

The GRPO (Gradient-based Reinforcement Policy Optimization) training framework supports integrating high-performance inference engines like vLLM to accelerate the sampling process. The Internal Mode allows the inference service to be directly launched within the Trainer, reducing external dependencies and simplifying deployment.

This folder contains scripts and instructions for running GRPO in Internal Mode, where the model training and inference are tightly integrated with flexible resource allocation strategies.

Resource Allocation Strategies

GRPO provides two resource allocation strategies under the Internal mode:

1. Colocate Mode

Description: Training and inference share GPU resources.
Recommended Setting:
- Set sleep_level=1 to release vLLM memory during training steps.

Resource Allocation Rules:

NPROC_PER_NODE = Total number of GPUs
num_infer_workers = Total number of GPUs

2. Async Mode

Description: Training and inference use independent GPU resources.
Recommended Setting:
- Set sleep_level=1 to release vLLM memory during training steps.

Resource Allocation Rules:

  NPROC_PER_NODE = Number of training GPUs
  num_infer_workers = Number of inference GPUs
  Must satisfy: Number of training GPUs + Number of inference GPUs = Total GPU count