# TraceOpt **Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.** TraceOpt builds **TraceML**, a lightweight runtime observability layer that makes PyTorch training behavior visible *while it runs*. TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers: - Which layer caused a CUDA OOM? - Why did this training step suddenly slow down? - Is the bottleneck data loading, forward, backward, or the optimizer? - Where did a memory spike actually occur? ## What TraceML focuses on TraceML provides **semantic, step-level signals** that bridge the gap between system metrics and model-level behavior: - Step-level timing (dataloader → forward → backward → optimizer) - Step-level GPU memory tracking with peak attribution - Optional deep-dive per-layer memory and compute diagnostics - Designed to run continuously with low overhead during real training jobs ## What TraceML is (and is not) **What it is** - Always-on runtime observability - Step-aware attribution, not post-hoc profiling - Focused on training-time behavior **What it is not** - Not a replacement for PyTorch Profiler or Nsight - Not a training framework - Not an orchestration or deployment tool ## Project status - Actively developed - Single-GPU PyTorch training supported - Multi-GPU (DDP / FSDP) support in progress - APIs may evolve as abstractions are validated ## Links - GitHub: https://github.com/traceopt-ai/traceml - PyPI: https://pypi.org/project/traceml-ai/ - Website: https://traceopt.ai We’re especially interested in feedback from people training real models hitting performance or memory pathologies.