File size: 4,211 Bytes
ff8fd11 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 | SkyPilot Examples
=================
Last updated: 09/04/2025.
This guide provides examples of running VERL reinforcement learning training on Kubernetes clusters or cloud platforms with GPU nodes using `SkyPilot <https://github.com/skypilot-org/skypilot>`_.
Installation and Configuration
-------------------------------
Step 1: Install SkyPilot
~~~~~~~~~~~~~~~~~~~~~~~~~
Choose the installation based on your target platform:
.. code-block:: bash
# For Kubernetes only
pip install "skypilot[kubernetes]"
# For AWS
pip install "skypilot[aws]"
# For Google Cloud Platform
pip install "skypilot[gcp]"
# For Azure
pip install "skypilot[azure]"
# For multiple platforms
pip install "skypilot[kubernetes,aws,gcp,azure]"
Step 2: Configure Your Platform
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
See https://docs.skypilot.co/en/latest/getting-started/installation.html
Step 3: Set Up Environment Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Export necessary API keys for experiment tracking:
.. code-block:: bash
# For Weights & Biases tracking
export WANDB_API_KEY="your-wandb-api-key"
# For HuggingFace gated models (if needed)
export HF_TOKEN="your-huggingface-token"
Examples
--------
All example configurations are available in the `examples/skypilot/ <https://github.com/volcengine/verl/tree/main/examples/skypilot>`_ directory on GitHub. See the `README <https://github.com/volcengine/verl/blob/main/examples/skypilot/README.md>`_ for additional details.
PPO Training
~~~~~~~~~~~~
.. code-block:: bash
sky launch -c verl-ppo verl-ppo.yaml --secret WANDB_API_KEY -y
Runs PPO training on GSM8K dataset using Qwen2.5-0.5B-Instruct model across 2 nodes with H100 GPUs. Based on examples in ``examples/ppo_trainer/``.
`View verl-ppo.yaml on GitHub <https://github.com/volcengine/verl/blob/main/examples/skypilot/verl-ppo.yaml>`_
GRPO Training
~~~~~~~~~~~~~
.. code-block:: bash
sky launch -c verl-grpo verl-grpo.yaml --secret WANDB_API_KEY -y
Runs GRPO (Group Relative Policy Optimization) training on MATH dataset using Qwen2.5-7B-Instruct model. Memory-optimized configuration for 2 nodes. Based on examples in ``examples/grpo_trainer/``.
`View verl-grpo.yaml on GitHub <https://github.com/volcengine/verl/blob/main/examples/skypilot/verl-grpo.yaml>`_
Multi-turn Tool Usage Training
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
sky launch -c verl-multiturn verl-multiturn-tools.yaml \
--secret WANDB_API_KEY --secret HF_TOKEN -y
Single-node training with 8xH100 GPUs for multi-turn tool usage with Qwen2.5-3B-Instruct. Includes tool and interaction configurations for GSM8K. Based on examples in ``examples/sglang_multiturn/`` but uses vLLM instead of sglang.
`View verl-multiturn-tools.yaml on GitHub <https://github.com/volcengine/verl/blob/main/examples/skypilot/verl-multiturn-tools.yaml>`_
Configuration
-------------
The example YAML files are pre-configured with:
- **Infrastructure**: Kubernetes clusters (``infra: k8s``) - can be changed to ``infra: aws`` or ``infra: gcp``, etc.
- **Docker Image**: VERL's official Docker image with CUDA 12.6 support
- **Setup**: Automatically clones and installs VERL from source
- **Datasets**: Downloads required datasets during setup phase
- **Ray Cluster**: Configures distributed training across nodes
- **Logging**: Supports Weights & Biases via ``--secret WANDB_API_KEY``
- **Models**: Supports gated HuggingFace models via ``--secret HF_TOKEN``
Launch Command Options
----------------------
- ``-c <name>``: Cluster name for managing the job
- ``--secret KEY``: Pass secrets for API keys (can be used multiple times)
- ``-y``: Skip confirmation prompt
Monitoring Your Jobs
--------------------
Check Cluster Status
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
sky status
View Logs
~~~~~~~~~
.. code-block:: bash
sky logs verl-ppo # View logs for the PPO job
SSH into Head Node
~~~~~~~~~~~~~~~~~~
.. code-block:: bash
ssh verl-ppo
Access Ray Dashboard
~~~~~~~~~~~~~~~~~~~~
.. code-block:: bash
sky status --endpoint 8265 verl-ppo # Get dashboard URL
Stop a Cluster
~~~~~~~~~~~~~~
.. code-block:: bash
sky down verl-ppo
|