Buckets:

hf-doc-build/doc-dev / trl /pr_3582 /en /logging.html
rtrm's picture
download
raw
21.2 kB
<meta charset="utf-8" /><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Logging&quot;,&quot;local&quot;:&quot;logging&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;PPO Logging&quot;,&quot;local&quot;:&quot;ppo-logging&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Crucial values&quot;,&quot;local&quot;:&quot;crucial-values&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2},{&quot;title&quot;:&quot;GRPO Logging&quot;,&quot;local&quot;:&quot;grpo-logging&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Crucial GRPO values&quot;,&quot;local&quot;:&quot;crucial-grpo-values&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2}],&quot;depth&quot;:1}">
<link href="/docs/trl/pr_3582/en/_app/immutable/assets/0.e3b0c442.css" rel="modulepreload">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/entry/start.0f0f318c.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/scheduler.d627b047.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/singletons.affb0d47.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/index.a57a1c33.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/paths.15dc14db.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/entry/app.b27a462f.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/index.73c51727.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/nodes/0.8cd8e450.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/each.e59479a4.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/nodes/27.7132fe90.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/CodeBlock.5f78c87f.js">
<link rel="modulepreload" href="/docs/trl/pr_3582/en/_app/immutable/chunks/getInferenceSnippets.256dfbf1.js"><!-- HEAD_svelte-u9bgzb_START --><meta name="hf:doc:metadata" content="{&quot;title&quot;:&quot;Logging&quot;,&quot;local&quot;:&quot;logging&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;PPO Logging&quot;,&quot;local&quot;:&quot;ppo-logging&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Crucial values&quot;,&quot;local&quot;:&quot;crucial-values&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2},{&quot;title&quot;:&quot;GRPO Logging&quot;,&quot;local&quot;:&quot;grpo-logging&quot;,&quot;sections&quot;:[{&quot;title&quot;:&quot;Crucial GRPO values&quot;,&quot;local&quot;:&quot;crucial-grpo-values&quot;,&quot;sections&quot;:[],&quot;depth&quot;:3}],&quot;depth&quot;:2}],&quot;depth&quot;:1}"><!-- HEAD_svelte-u9bgzb_END --> <p></p> <h1 class="relative group"><a id="logging" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#logging"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Logging</span></h1> <p data-svelte-h="svelte-epstor">As reinforcement learning algorithms are historically challenging to debug, it’s important to pay careful attention to logging.
By default, TRL trainers like <a href="/docs/trl/pr_3582/en/ppo_trainer#trl.PPOTrainer">PPOTrainer</a> and <a href="/docs/trl/pr_3582/en/grpo_trainer#trl.GRPOTrainer">GRPOTrainer</a> save a lot of relevant information to supported experiment trackers like Weights &amp; Biases (wandb) or TensorBoard.</p> <p data-svelte-h="svelte-1ja4qyt">Upon initialization, pass the <code>report_to</code> argument to the respective configuration object (e.g., <a href="/docs/trl/pr_3582/en/ppo_trainer#trl.PPOConfig">PPOConfig</a> for <code>PPOTrainer</code>, or <a href="/docs/trl/pr_3582/en/grpo_trainer#trl.GRPOConfig">GRPOConfig</a> for <code>GRPOTrainer</code>):</p> <div class="code-block relative "><div class="absolute top-2.5 right-4"><button class="inline-flex items-center relative text-sm focus:text-green-500 cursor-pointer focus:outline-none transition duration-200 ease-in-out opacity-0 mx-0.5 text-gray-600 " title="code excerpt" type="button"><svg class="" xmlns="http://www.w3.org/2000/svg" aria-hidden="true" fill="currentColor" focusable="false" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 32 32"><path d="M28,10V28H10V10H28m0-2H10a2,2,0,0,0-2,2V28a2,2,0,0,0,2,2H28a2,2,0,0,0,2-2V10a2,2,0,0,0-2-2Z" transform="translate(0)"></path><path d="M4,18H2V4A2,2,0,0,1,4,2H18V4H4Z" transform="translate(0)"></path><rect fill="none" width="32" height="32"></rect></svg> <div class="absolute pointer-events-none transition-opacity bg-black text-white py-1 px-2 leading-tight rounded font-normal shadow left-1/2 top-full transform -translate-x-1/2 translate-y-2 opacity-0"><div class="absolute bottom-full left-1/2 transform -translate-x-1/2 w-0 h-0 border-black border-4 border-t-0" style="border-left-color: transparent; border-right-color: transparent; "></div> Copied</div></button></div> <pre class=""><!-- HTML_TAG_START --><span class="hljs-comment"># For PPOTrainer</span>
ppo_config = PPOConfig(
<span class="hljs-comment"># ...,</span>
report_to=<span class="hljs-string">&quot;wandb&quot;</span> <span class="hljs-comment"># or &quot;tensorboard&quot;</span>
)
<span class="hljs-comment"># For GRPOTrainer</span>
grpc_config = GRPOConfig(
<span class="hljs-comment"># ...,</span>
report_to=<span class="hljs-string">&quot;wandb&quot;</span> <span class="hljs-comment"># or &quot;tensorboard&quot;</span>
)<!-- HTML_TAG_END --></pre></div> <p data-svelte-h="svelte-1q9avyo">If you want to log with TensorBoard, you might also need to specify logging directories, for example, by adding <code>logging_dir=PATH_TO_LOGS</code> to the configuration object (e.g., <code>PPOConfig</code> or <code>GRPOConfig</code>).</p> <h2 class="relative group"><a id="ppo-logging" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#ppo-logging"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>PPO Logging</span></h2> <p data-svelte-h="svelte-50qxhy">Here’s a brief explanation for the logged metrics provided in the data:</p> <ul data-svelte-h="svelte-1q7noi2"><li><code>eps</code>: Tracks the number of episodes per second.</li> <li><code>objective/kl</code>: The mean Kullback-Leibler (KL) divergence between the current policy and reference policy.</li> <li><code>objective/entropy</code>: The mean entropy of the policy, indicating the randomness of the actions chosen by the policy.</li> <li><code>objective/non_score_reward</code>: The mean reward from non-score-related sources, basically <code>beta * kl.sum(1)</code>, where <code>beta</code> is the KL penalty coefficient and <code>kl</code> is the per-token KL divergence.</li> <li><code>objective/rlhf_reward</code>: The mean RLHF reward, which is <code>score - non_score_reward</code>.</li> <li><code>objective/scores</code>: The mean scores returned by the reward model / environment.</li> <li><code>policy/approxkl_avg</code>: The average approximate KL divergence between consecutive PPO policies. Note that this is not the same as <code>objective/kl</code>.</li> <li><code>policy/clipfrac_avg</code>: The average fraction of policy updates that are clipped, indicating how often the policy updates are constrained to prevent large changes.</li> <li><code>loss/policy_avg</code>: The average policy loss, indicating how well the policy is performing.</li> <li><code>loss/value_avg</code>: The average value loss, indicating the difference between the predicted value and the actual reward.</li> <li><code>val/clipfrac_avg</code>: The average fraction of value function updates that are clipped, similar to <code>policy/clipfrac_avg</code> but for the value function.</li> <li><code>policy/entropy_avg</code>: The average entropy of the policy during training, indicating how diverse the policy’s actions are.</li> <li><code>val/ratio</code>: The mean ratio of the current policy probability to the old policy probability, providing a measure of how much the policy has changed.</li> <li><code>val/ratio_var</code>: The variance of the <code>val/ratio</code>, indicating the variability in policy changes.</li> <li><code>val/num_eos_tokens</code>: The number of end-of-sequence (EOS) tokens generated, which can indicate the number of complete responses.</li> <li><code>lr</code>: The current learning rate used by the optimizer.</li> <li><code>episode</code>: The current episode count in the training process.</li></ul> <h3 class="relative group"><a id="crucial-values" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#crucial-values"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Crucial values</span></h3> <p data-svelte-h="svelte-189bxez">During training, many values are logged, here are the most important ones:</p> <ol data-svelte-h="svelte-7tziyu"><li><code>objective/scores</code>: The mean scores returned by the reward model / environment.</li> <li><code>objective/rlhf_reward</code>: The mean RLHF reward. This is the ultimate objective of the RLHF training. If training works as intended, this metric should keep going up.</li> <li><code>objective/non_score_reward</code>: The mean reward from non-score-related sources (e.g., KL penalty).</li></ol> <p data-svelte-h="svelte-1yomxw6">Here are some parameters that are useful to monitor for stability (when these diverge or collapse to 0, try tuning variables):</p> <ol data-svelte-h="svelte-q4qqka"><li><code>loss/value_avg</code>: The average value loss. It will spike / NaN when not going well.</li> <li><code>val/ratio</code>: The mean ratio of the current policy probability to the old policy probability. This number should float around 1.0. If this <code>ratio</code> is too high (e.g., 2.0 or 1000.0) or too small (e.g., 0.1), it means the updates between consecutive policies are too drastic.</li> <li><code>policy/clipfrac_avg</code> and <code>policy/approxkl_avg</code>: If <code>val/ratio</code> is too high, the <code>ratio</code> is going to get clipped, resulting in high <code>policy/clipfrac_avg</code> and high <code>policy/approxkl_avg</code> as well.</li> <li><code>objective/kl</code>: The mean KL divergence. It should stay positive and ideally not too large, so that the policy is not too far away from the reference policy.</li></ol> <h2 class="relative group"><a id="grpo-logging" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#grpo-logging"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>GRPO Logging</span></h2> <p data-svelte-h="svelte-1drzntr">Here’s a brief explanation for the logged metrics provided in the data for the GRPO trainer:</p> <ul data-svelte-h="svelte-2jchqz"><li><code>num_tokens</code>: Total number of input tokens processed during training so far.</li></ul> <p data-svelte-h="svelte-1pr4i50"><strong>Completions:</strong></p> <ul data-svelte-h="svelte-i223a7"><li><code>completions/mean_length</code>: Mean length of all generated completions (including those not ending with an EOS token).</li> <li><code>completions/min_length</code>: Minimum length among all generated completions.</li> <li><code>completions/max_length</code>: Maximum length among all generated completions.</li> <li><code>completions/clipped_ratio</code>: The ratio of completions that did not end with an EOS token before reaching the maximum generation length (i.e., they were truncated).</li> <li><code>completions/mean_terminated_length</code>: Mean length of only those completions that successfully ended with an EOS token.</li> <li><code>completions/min_terminated_length</code>: Minimum length among completions that ended with an EOS token.</li> <li><code>completions/max_terminated_length</code>: Maximum length among completions that ended with an EOS token.</li></ul> <p data-svelte-h="svelte-2jceyj"><strong>Rewards:</strong></p> <ul data-svelte-h="svelte-1rkm922"><li><code>rewards/{reward_func_name}/mean</code>: The mean reward obtained from a specific, named reward function (e.g., <code>rewards/my_custom_reward/mean</code>). This is logged for each reward function used.</li> <li><code>rewards/{reward_func_name}/std</code>: The standard deviation of rewards from a specific, named reward function.</li> <li><code>reward</code>: The overall mean of the (potentially weighted and, if <code>args.scale_rewards</code> is true, normalized) rewards, after group-wise normalization (advantages).</li> <li><code>reward_std</code>: The standard deviation of the (potentially weighted) rewards <em>before</em> group-wise normalization for advantages.</li></ul> <p data-svelte-h="svelte-tho2jc"><strong>Policy and Loss Metrics:</strong></p> <ul data-svelte-h="svelte-k1vzio"><li><code>kl</code>: The mean Kullback-Leibler (KL) divergence between the current policy and the reference policy. This is logged only if <code>beta</code> (the KL coefficient in <code>GRPOConfig</code>) is non-zero.</li> <li>If Liger GRPOLoss is used (<code>use_liger_loss: True</code> in <code>GRPOConfig</code>):<ul><li><code>clip_ratio</code>: The fraction of policy updates where the probability ratio was clipped according to the GRPO loss’s epsilon bounds.</li></ul></li> <li>If standard GRPOLoss is used (<code>use_liger_loss: False</code>):<ul><li><code>clip_ratio/low_mean</code>: The mean fraction of instances where the probability ratio <code>r_t(θ)</code> was clipped at the lower bound <code>1 - epsilon_low</code> (occurs when advantage is negative and ratio is below the bound).</li> <li><code>clip_ratio/low_min</code>: The minimum observed fraction for <code>clip_ratio/low_mean</code> across batches/processes.</li> <li><code>clip_ratio/high_mean</code>: The mean fraction of instances where the probability ratio <code>r_t(θ)</code> was clipped at the upper bound <code>1 + epsilon_high</code> (occurs when advantage is positive and ratio is above the bound).</li> <li><code>clip_ratio/high_max</code>: The maximum observed fraction for <code>clip_ratio/high_mean</code> across batches/processes.</li> <li><code>clip_ratio/region_mean</code>: The mean fraction of instances where the probability ratio was clipped at either the lower or upper bound.</li></ul></li></ul> <h3 class="relative group"><a id="crucial-grpo-values" class="header-link block pr-1.5 text-lg no-hover:hidden with-hover:absolute with-hover:p-1.5 with-hover:opacity-0 with-hover:group-hover:opacity-100 with-hover:right-full" href="#crucial-grpo-values"><span><svg class="" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" aria-hidden="true" role="img" width="1em" height="1em" preserveAspectRatio="xMidYMid meet" viewBox="0 0 256 256"><path d="M167.594 88.393a8.001 8.001 0 0 1 0 11.314l-67.882 67.882a8 8 0 1 1-11.314-11.315l67.882-67.881a8.003 8.003 0 0 1 11.314 0zm-28.287 84.86l-28.284 28.284a40 40 0 0 1-56.567-56.567l28.284-28.284a8 8 0 0 0-11.315-11.315l-28.284 28.284a56 56 0 0 0 79.196 79.197l28.285-28.285a8 8 0 1 0-11.315-11.314zM212.852 43.14a56.002 56.002 0 0 0-79.196 0l-28.284 28.284a8 8 0 1 0 11.314 11.314l28.284-28.284a40 40 0 0 1 56.568 56.567l-28.285 28.285a8 8 0 0 0 11.315 11.314l28.284-28.284a56.065 56.065 0 0 0 0-79.196z" fill="currentColor"></path></svg></span></a> <span>Crucial GRPO values</span></h3> <p data-svelte-h="svelte-1k05nj2">During GRPO training, monitor these values for insights into performance and stability:</p> <ol data-svelte-h="svelte-59bahe"><li><code>reward</code>: This is the primary objective. It reflects the (group-wise normalized) rewards the policy is achieving. It should generally increase during successful training.</li> <li><code>kl</code>: If <code>beta &gt; 0</code>, this tracks the divergence from the reference model. Keep an eye on it to ensure the policy doesn’t stray too far, which can lead to instability.</li> <li><code>clip_ratio/*</code> (either <code>clip_ratio</code> for Liger loss or the more detailed <code>clip_ratio/...</code> metrics for standard loss): These indicate how often the policy updates are being constrained by the GRPO clipping mechanism. Very high values might suggest that the policy is trying to change too drastically (potentially due to large advantages or a learning rate that’s too high) or that the epsilon clipping range is too restrictive.</li> <li><code>completions/clipped_ratio</code>: A high ratio here indicates that the model is frequently generating completions that are cut off by <code>max_completion_length</code> rather than naturally ending with an EOS token. This might suggest issues with learning sequence termination or that <code>max_completion_length</code> is too short.</li> <li><code>rewards/{reward_func_name}/mean</code>: Monitoring the mean of individual reward functions can help diagnose which aspects of the desired behavior the model is learning or struggling with, especially when using multiple reward sources.</li></ol> <a class="!text-gray-400 !no-underline text-sm flex items-center not-prose mt-4" href="https://github.com/huggingface/trl/blob/main/docs/source/logging.md" target="_blank"><span data-svelte-h="svelte-1kd6by1">&lt;</span> <span data-svelte-h="svelte-x0xyl0">&gt;</span> <span data-svelte-h="svelte-1dajgef"><span class="underline ml-1.5">Update</span> on GitHub</span></a> <p></p>
<script>
{
__sveltekit_4tczb2 = {
assets: "/docs/trl/pr_3582/en",
base: "/docs/trl/pr_3582/en",
env: {}
};
const element = document.currentScript.parentElement;
const data = [null,null];
Promise.all([
import("/docs/trl/pr_3582/en/_app/immutable/entry/start.0f0f318c.js"),
import("/docs/trl/pr_3582/en/_app/immutable/entry/app.b27a462f.js")
]).then(([kit, app]) => {
kit.start(app, element, {
node_ids: [0, 27],
data,
form: null,
error: null
});
});
}
</script>

Xet Storage Details

Size:
21.2 kB
·
Xet hash:
4f17b4fa9595e439fe8c036f22428693bcccc7849a8ae9e9468ff7e21c7fb92c

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.