| # TraceOpt |
|
|
| **Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.** |
|
|
| TraceOpt builds **TraceML**, a lightweight runtime observability layer that makes PyTorch training behavior visible *while it runs*. |
|
|
| TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers: |
|
|
| - Which layer caused a CUDA OOM? |
| - Why did this training step suddenly slow down? |
| - Is the bottleneck data loading, forward, backward, or the optimizer? |
| - Where did a memory spike actually occur? |
|
|
| ## What TraceML focuses on |
|
|
| TraceML provides **semantic, step-level signals** that bridge the gap between system metrics and model-level behavior: |
|
|
| - Step-level timing (dataloader → forward → backward → optimizer) |
| - Step-level GPU memory tracking with peak attribution |
| - Optional deep-dive per-layer memory and compute diagnostics |
| - Designed to run continuously with low overhead during real training jobs |
|
|
| ## What TraceML is (and is not) |
|
|
| **What it is** |
| - Always-on runtime observability |
| - Step-aware attribution, not post-hoc profiling |
| - Focused on training-time behavior |
|
|
| **What it is not** |
| - Not a replacement for PyTorch Profiler or Nsight |
| - Not a training framework |
| - Not an orchestration or deployment tool |
|
|
| ## Project status |
|
|
| - Actively developed |
| - Single-GPU PyTorch training supported |
| - Multi-GPU (DDP / FSDP) support in progress |
| - APIs may evolve as abstractions are validated |
|
|
| ## Links |
|
|
| - GitHub: https://github.com/traceopt-ai/traceml |
| - PyPI: https://pypi.org/project/traceml-ai/ |
| - Website: https://traceopt.ai |
|
|
| We’re especially interested in feedback from people training real models hitting performance or memory pathologies. |
|
|