Spaces:
Running
Trackio Integration for TRL Training
Trackio is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.
β οΈ IMPORTANT: For Jobs training (remote cloud GPUs):
- Training happens on ephemeral cloud runners (not your local machine)
- Trackio syncs metrics to a Hugging Face Space for real-time monitoring
- Without a Space, metrics are lost when the job completes
- The Space dashboard persists your training metrics permanently
Setting Up Trackio for Jobs
Step 1: Add trackio dependency
# /// script
# dependencies = [
# "trl>=0.12.0",
# "trackio", # Required!
# ]
# ///
Step 2: Create a Trackio Space (one-time setup)
Option A: Let Trackio auto-create (Recommended)
Pass a space_id to trackio.init() and Trackio will automatically create the Space if it doesn't exist.
Option B: Create manually
- Create Space via Hub UI at https://huggingface.co/new-space
- Select Gradio SDK
- OR use command:
huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio
Step 3: Initialize Trackio with space_id
import trackio
trackio.init(
project="my-training",
space_id="username/trackio", # CRITICAL for Jobs! Replace 'username' with your HF username
config={
"model": "Qwen/Qwen2.5-0.5B",
"dataset": "trl-lib/Capybara",
"learning_rate": 2e-5,
}
)
Step 4: Configure TRL to use Trackio
SFTConfig(
report_to="trackio",
# ... other config
)
Step 5: Finish tracking
trainer.train()
trackio.finish() # Ensures final metrics are synced
What Trackio Tracks
Trackio automatically logs:
- β Training loss
- β Learning rate
- β GPU utilization
- β Memory usage
- β Training throughput
- β Custom metrics
How It Works with Jobs
- Training runs β Metrics logged to local SQLite DB
- Every 5 minutes β Trackio syncs DB to HF Dataset (Parquet)
- Space dashboard β Reads from Dataset, displays metrics in real-time
- Job completes β Final sync ensures all metrics persisted
Default Configuration Pattern
Use sensible defaults for trackio configuration unless user requests otherwise.
Recommended Defaults
import trackio
trackio.init(
project="qwen-capybara-sft",
name="baseline-run", # Descriptive name user will recognize
space_id="username/trackio", # Default space: {username}/trackio
config={
# Keep config minimal - hyperparameters and model/dataset info only
"model": "Qwen/Qwen2.5-0.5B",
"dataset": "trl-lib/Capybara",
"learning_rate": 2e-5,
"num_epochs": 3,
}
)
Key principles:
- Space ID: Use
{username}/trackiowith "trackio" as default space name - Run naming: Unless otherwise specified, name the run in a way the user will recognize
- Config: Keep minimal - don't automatically capture job metadata unless requested
- Grouping: Optional - only use if user requests organizing related experiments
Grouping Runs (Optional)
The group parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:
# Example: Group runs by experiment type
trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")
Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:
# Hyperparameter sweep - group by learning rate
trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")
Environment Variables for Jobs
You can configure trackio using environment variables instead of passing parameters to trackio.init(). This is useful for managing configuration across multiple jobs.
HF_TOKEN
Required for creating Spaces and writing to datasets (passed via secrets):
hf_jobs("uv", {
"script": "...",
"secrets": {
"HF_TOKEN": "$HF_TOKEN" # Enables Space creation and Hub push
}
})
Example with Environment Variables
hf_jobs("uv", {
"script": """
# Training script - trackio config from environment
import trackio
from datetime import datetime
# Auto-generate run name
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
run_name = f"sft_qwen25_{timestamp}"
# Project and space_id can come from environment variables
trackio.init(run_name=run_name, group="SFT")
# ... training code ...
trackio.finish()
""",
"flavor": "a10g-large",
"timeout": "2h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
When to use environment variables:
- Managing multiple jobs with same configuration
- Keeping training scripts portable across projects
- Separating configuration from code
When to use direct parameters:
- Single job with specific configuration
- When clarity in code is preferred
- When each job has different project/space
Viewing the Dashboard
After starting training:
- Navigate to the Space:
https://huggingface.co/spaces/username/trackio - The Gradio dashboard shows all tracked experiments
- Filter by project, compare runs, view charts with smoothing
Recommendation
- Trackio: Best for real-time monitoring during long training runs
- Weights & Biases: Best for team collaboration, requires account