| Agent Loop | |
| ========== | |
| Last updated: 07/17/2025. | |
| .. versionadded:: 0.4.2 | |
| [status: alpha] | |
| .. warning:: | |
| Agent Loop is ready for use, but the API may change in future releaes. | |
| Agent Loop is designed as general interface for multi-turn rollout and agentic reinforcement learning. | |
| **Design goal**: | |
| - Plugable user defined agent loop | |
| - Provide standard request generate api with different inference frameworks | |
| - Provide request level load balance between multiple inference servers | |
| **Non-goal**: | |
| - How tool is defined and how to call tool | |
| In high level overview, agent loop is given a prompt, run user defined loop: call LLM generate api, call tools, ... | |
| and return the final output. The final output is then calculated reward and used as trajectory for RL training. | |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_overview.svg?raw=true | |
| API Design | |
| ---------- | |
| ``AgentLoopBase`` class is the abstraction of agent loop, and ``run`` method is the only interface that user need to implement. | |
| The run method, given prompt messages in format: [{"role": "user"}, {"content": "..."}], and additional sampling params, | |
| could do whatever user wants, such as | |
| - call LLM generate api | |
| - call tools: web search, database query, code sandbox, ... | |
| - environment interaction | |
| - reflection | |
| - ... | |
| .. code:: python | |
| class AgentLoopBase(ABC): | |
| @abstractmethod | |
| async def run(self, sampling_params: dict[str, Any], **kwargs) -> AgentLoopOutput: | |
| """Run agent loop to interact with LLM server and environment. | |
| Args: | |
| sampling_params (Dict[str, Any]): LLM sampling params. | |
| **kwargs: dataset fields from `verl.utils.dataset.RLHFDataset`. | |
| Returns: | |
| AgentLoopOutput: Agent loop output. | |
| """ | |
| raise NotImplementedError | |
| After running user defined loop, run method should return ``AgentLoopOutput``, including prompt token ids, | |
| response token ids, and response mask. | |
| .. code:: python | |
| class AgentLoopOutput(BaseModel): | |
| """Agent loop output.""" | |
| prompt_ids: list[int] | |
| """Prompt token ids.""" | |
| response_ids: list[int] | |
| """Response token ids including LLM generated token, tool response token.""" | |
| response_mask: list[int] | |
| """Response mask, 1 for LLM generated token, 0 for tool response token.""" | |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_output.svg?raw=true | |
| .. note:: AgentLoopOutput only output one trajectory for a given prompt, multiple trajectories output is still under discussion. | |
| Architecture Design | |
| ------------------- | |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/agent_loop_architecture.png?raw=true | |
| A single PPO step contain two phase: rollout and train. In rollout phase: | |
| 1. PPOTrainer sample a batch from dataset and call ``AgentLoopManager.generate_sequences``. | |
| 2. AgentLoopManager ``wake_up`` all async LLM server instances, which will sync weights between inference engine(vLLM/SGLang) and training engine(FSDP/Megatron-LM). | |
| 3. AgentLoopManager split batch into chunks and send each chunk to ``AgentLoopWorker``. | |
| 4. AgentLoopWorker receive chunk and for each prompt, spawn a user defined ``AgentLoopBase`` instance, run ``run`` coroutine until end and get ``AgentLoopOutput``. | |
| .. tip:: | |
| AgentLoopWorker schedules multiple coroutines concurrently. If number of AgentLoopWorker equals batch_size, then each worker is response for one prompt. | |
| In agent loop, when user need LLM generate response: | |
| 5. Call ``AsyncLLMServerManager.generate`` with prompt_ids. | |
| 6. AsyncLLMServerManager select a server instance with least request in first turn and send request to it. (In following turns, the request will be sent to the same server instance). | |
| 7. AsyncLLMServer receive a request, issue ipc/rpc with model_runner, and generate response. (There's slight differences between vLLM and SGLang, see below). | |
| When all prompts in all AgentLoopWorker finish, AgentLoopManager gather results and return to PPOTrainer. | |
| 8. AgentLoopManager ``sleep`` all server instances, which will free kv cache and offload weights to CPU memory. | |
| AsyncLLMServer | |
| ~~~~~~~~~~~~~~ | |
| AsyncLLMServer is the abstraction of LLM server with two types of generation api: | |
| - `OpenAI chat completion <https://platform.openai.com/docs/api-reference/chat>`_: generate response for the given chat conversation. | |
| - Token in token out: generate response ids for the given token ids. | |
| We have officially supported vLLM and SGLang AsyncLLMServer, both of them implement the two api and are well tested. | |
| Other inference engine should be easy to plug-in by implement the ``AsyncServerBase`` class. | |
| .. code:: python | |
| class AsyncServerBase(ABC): | |
| @abstractmethod | |
| async def chat_completion(self, raw_request: Request) -> JSONResponse: | |
| """OpenAI chat completion API. | |
| Args: | |
| raw_request (Request): raw json request | |
| Returns: | |
| JSONResponse: json response | |
| API reference: https://platform.openai.com/docs/api-reference/chat/create | |
| """ | |
| raise NotImplementedError | |
| @abstractmethod | |
| async def generate(self, prompt_ids: list[int], sampling_params: dict[str, Any], request_id: str) -> list[int]: | |
| """Generate response ids given prompt ids. | |
| Args: | |
| prompt_ids (List[int]): prompt ids | |
| sampling_params (Dict[str, Any]): sampling params | |
| request_id (str): request id | |
| Returns: | |
| List[int]: response ids | |
| """ | |
| raise NotImplementedError | |
| Chat completion vs Token in token out | |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | |
| .. warning:: | |
| The following conclusion is based on our recent experience and is still open to investigation and discussion. | |
| Almost all agent frameworks (LangGraph, CrewAI, LlamaIndex, etc) call LLM with OpenAI chat completion api, and | |
| keep chat history as messages. So user may expect that we should use the chat completion api in multi-turn rollout. | |
| But based on our recent experience on single-turn training on DAPO and multi-turn training on `retool <https://github.com/volcengine/verl-recipe/tree/main/retool>`_, | |
| we found the token_ids from apply the final messages may not equal to the token_ids by concat prompt_ids and response_ids in each turn. | |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/multi_turn.png?raw=true | |
| **Where does this inconsistency happened?** | |
| First, the tool parser may alter the content. For example | |
| .. code:: json | |
| {"role": "assistant", "content": "Let me call a <tool_call>...</tool_call> and get the result"} | |
| After tool_calls extraction, the messages is like this: | |
| .. code:: json | |
| {"role": "assistant", "content": "Let me call a and get the result", "tool_calls": [{"name": "foo", "arguments": "{}"}]} | |
| Encode the extracted message back is not equal to the original LLM generated response_ids. | |
| Second, the `decode-encode` may also lead to inconsistency: `Agent-R1 issue#30 <https://github.com/0russwest0/Agent-R1/issues/30#issuecomment-2826155367>`_. | |
| **What is the impact of this inconsistency?** | |
| This inconsistency is not a big problem for serving/agent system, but is critical to RL training. | |
| It causes the trajectory deviate from the policy model distribution. We have observed that apply_chat_template | |
| to the final chat history messages make PPO training not even converged in single-turn. | |
| vLLM | |
| ^^^^ | |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/async_vllm.png?raw=true | |
| For vLLM, the Async LLM Engine is running in same process as the server, and ModelRunner is running in same process as FSDP/Megatron-LM workers. | |
| Async LLM Engine communicate with ModelRunner through ZeroMQ. When server receive a request, it directly call engine to generate response_ids. | |
| SGLang | |
| ^^^^^^ | |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/async_sglang.png?raw=true | |
| For SGLang, the Async LLM Engine is running in same process as FSDP/Megatron-LM worker-0, and it spawn multiple subprocesses as ModelRunner. | |
| Also, Async LLM Engine communicate with ModelRunner through ZeroMQ. When server receive a request, it remote call the worker-0 and get response_ids. | |
| AsyncLLMServerManager | |
| ~~~~~~~~~~~~~~~~~~~~~ | |
| AsyncLLMServerManager serve as proxy to multiple AsyncLLMServer instances, provides: | |
| - load balance: select a server instance with least request in first turn and send request to it. | |
| - sticky session: bind request_id to server instance, so that the same request_id will be sent to the same server instance in following turns. | |
| AsyncLLMServerManager is passed to ``AgentLoopBase.__init__``, whenever user want to interact with LLM in agent loop, | |
| they can call ``AsyncLLMServerManager.generate`` to generate response_ids. | |
| .. code:: python | |
| class AsyncLLMServerManager: | |
| async def generate( | |
| self, | |
| request_id, | |
| *, | |
| prompt_ids: list[int], | |
| sampling_params: dict[str, Any], | |
| ) -> list[int]: | |
| """Generate tokens from prompt ids. | |
| Args: | |
| request_id (str): request id for sticky session. | |
| prompt_ids (List[int]): List of prompt token ids. | |
| sampling_params (Dict[str, Any]): Sampling parameters for the chat completion. | |
| Returns: | |
| List[int]: List of generated token ids. | |
| """ | |
| ... | |
| Next | |
| ---- | |
| - :doc:`Agentic RL Training<../start/agentic_rl>`: Quick start agentic RL training with gsm8k dataset. | |
| - `LangGraph MathExpression <https://github.com/volcengine/verl-recipe/tree/main/langgraph_agent/example>`_: Demonstrate how to use LangGraph to build agent loop. | |
| - `Retool <https://github.com/volcengine/verl-recipe/tree/main/retool>`_: End-to-end retool paper reproduction using tool agent. | |