llm-trainer / references /trackio_guide.md
burtenshaw's picture
burtenshaw HF Staff
Upload folder using huggingface_hub
6ab17a7 verified

Trackio Integration for TRL Training

Trackio is an experiment tracking library that provides real-time metrics visualization for remote training on Hugging Face Jobs infrastructure.

⚠️ IMPORTANT: For Jobs training (remote cloud GPUs):

  • Training happens on ephemeral cloud runners (not your local machine)
  • Trackio syncs metrics to a Hugging Face Space for real-time monitoring
  • Without a Space, metrics are lost when the job completes
  • The Space dashboard persists your training metrics permanently

Setting Up Trackio for Jobs

Step 1: Add trackio dependency

# /// script
# dependencies = [
#     "trl>=0.12.0",
#     "trackio",  # Required!
# ]
# ///

Step 2: Create a Trackio Space (one-time setup)

Option A: Let Trackio auto-create (Recommended) Pass a space_id to trackio.init() and Trackio will automatically create the Space if it doesn't exist.

Option B: Create manually

  • Create Space via Hub UI at https://huggingface.co/new-space
  • Select Gradio SDK
  • OR use command: huggingface-cli repo create my-trackio-dashboard --type space --space_sdk gradio

Step 3: Initialize Trackio with space_id

import trackio

trackio.init(
    project="my-training",
    space_id="username/trackio",  # CRITICAL for Jobs! Replace 'username' with your HF username
    config={
        "model": "Qwen/Qwen2.5-0.5B",
        "dataset": "trl-lib/Capybara",
        "learning_rate": 2e-5,
    }
)

Step 4: Configure TRL to use Trackio

SFTConfig(
    report_to="trackio",
    # ... other config
)

Step 5: Finish tracking

trainer.train()
trackio.finish()  # Ensures final metrics are synced

What Trackio Tracks

Trackio automatically logs:

  • βœ… Training loss
  • βœ… Learning rate
  • βœ… GPU utilization
  • βœ… Memory usage
  • βœ… Training throughput
  • βœ… Custom metrics

How It Works with Jobs

  1. Training runs β†’ Metrics logged to local SQLite DB
  2. Every 5 minutes β†’ Trackio syncs DB to HF Dataset (Parquet)
  3. Space dashboard β†’ Reads from Dataset, displays metrics in real-time
  4. Job completes β†’ Final sync ensures all metrics persisted

Default Configuration Pattern

Use sensible defaults for trackio configuration unless user requests otherwise.

Recommended Defaults

import trackio

trackio.init(
    project="qwen-capybara-sft",
    name="baseline-run",             # Descriptive name user will recognize
    space_id="username/trackio",     # Default space: {username}/trackio
    config={
        # Keep config minimal - hyperparameters and model/dataset info only
        "model": "Qwen/Qwen2.5-0.5B",
        "dataset": "trl-lib/Capybara",
        "learning_rate": 2e-5,
        "num_epochs": 3,
    }
)

Key principles:

  • Space ID: Use {username}/trackio with "trackio" as default space name
  • Run naming: Unless otherwise specified, name the run in a way the user will recognize
  • Config: Keep minimal - don't automatically capture job metadata unless requested
  • Grouping: Optional - only use if user requests organizing related experiments

Grouping Runs (Optional)

The group parameter helps organize related runs together in the dashboard sidebar. This is useful when user is running multiple experiments with different configurations but wants to compare them together:

# Example: Group runs by experiment type
trackio.init(project="my-project", run_name="baseline-run-1", group="baseline")
trackio.init(project="my-project", run_name="augmented-run-1", group="augmented")
trackio.init(project="my-project", run_name="tuned-run-1", group="tuned")

Runs with the same group name can be grouped together in the sidebar, making it easier to compare related experiments. You can group by any configuration parameter:

# Hyperparameter sweep - group by learning rate
trackio.init(project="hyperparam-sweep", run_name="lr-0.001-run", group="lr_0.001")
trackio.init(project="hyperparam-sweep", run_name="lr-0.01-run", group="lr_0.01")

Environment Variables for Jobs

You can configure trackio using environment variables instead of passing parameters to trackio.init(). This is useful for managing configuration across multiple jobs.

HF_TOKEN Required for creating Spaces and writing to datasets (passed via secrets):

hf_jobs("uv", {
    "script": "...",
    "secrets": {
        "HF_TOKEN": "$HF_TOKEN"  # Enables Space creation and Hub push
    }
})

Example with Environment Variables

hf_jobs("uv", {
    "script": """
# Training script - trackio config from environment
import trackio
from datetime import datetime

# Auto-generate run name
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M")
run_name = f"sft_qwen25_{timestamp}"

# Project and space_id can come from environment variables
trackio.init(run_name=run_name, group="SFT")

# ... training code ...
trackio.finish()
""",
    "flavor": "a10g-large",
    "timeout": "2h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

When to use environment variables:

  • Managing multiple jobs with same configuration
  • Keeping training scripts portable across projects
  • Separating configuration from code

When to use direct parameters:

  • Single job with specific configuration
  • When clarity in code is preferred
  • When each job has different project/space

Viewing the Dashboard

After starting training:

  1. Navigate to the Space: https://huggingface.co/spaces/username/trackio
  2. The Gradio dashboard shows all tracked experiments
  3. Filter by project, compare runs, view charts with smoothing

Recommendation

  • Trackio: Best for real-time monitoring during long training runs
  • Weights & Biases: Best for team collaboration, requires account