File size: 1,715 Bytes
dda4245
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# TraceOpt

**Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.**

TraceOpt builds **TraceML**, a lightweight runtime observability layer that makes PyTorch training behavior visible *while it runs*.

TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers:

- Which layer caused a CUDA OOM?
- Why did this training step suddenly slow down?
- Is the bottleneck data loading, forward, backward, or the optimizer?
- Where did a memory spike actually occur?

## What TraceML focuses on

TraceML provides **semantic, step-level signals** that bridge the gap between system metrics and model-level behavior:

- Step-level timing (dataloader → forward → backward → optimizer)
- Step-level GPU memory tracking with peak attribution
- Optional deep-dive per-layer memory and compute diagnostics
- Designed to run continuously with low overhead during real training jobs

## What TraceML is (and is not)

**What it is**
- Always-on runtime observability
- Step-aware attribution, not post-hoc profiling
- Focused on training-time behavior

**What it is not**
- Not a replacement for PyTorch Profiler or Nsight
- Not a training framework
- Not an orchestration or deployment tool

## Project status

- Actively developed
- Single-GPU PyTorch training supported
- Multi-GPU (DDP / FSDP) support in progress
- APIs may evolve as abstractions are validated

## Links

- GitHub: https://github.com/traceopt-ai/traceml
- PyPI: https://pypi.org/project/traceml-ai/
- Website: https://traceopt.ai

We’re especially interested in feedback from people training real models hitting performance or memory pathologies.