TraceOpt

company

https://traceopt.ai

AI & ML interests

None defined yet.

Organization Card

Community About org cards

TraceOpt

Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.

TraceOpt builds TraceML, a lightweight runtime observability layer that makes PyTorch training behavior visible while it runs.

TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers:

Which layer caused a CUDA OOM?
Why did this training step suddenly slow down?
Is the bottleneck data loading, forward, backward, or the optimizer?
Where did a memory spike actually occur?

What TraceML focuses on

TraceML provides semantic, step-level signals that bridge the gap between system metrics and model-level behavior:

Step-level timing (dataloader → forward → backward → optimizer)
Step-level GPU memory tracking with peak attribution
Optional deep-dive per-layer memory and compute diagnostics
Designed to run continuously with low overhead during real training jobs

What TraceML is (and is not)

What it is

Always-on runtime observability
Step-aware attribution, not post-hoc profiling
Focused on training-time behavior

What it is not

Not a replacement for PyTorch Profiler or Nsight
Not a training framework
Not an orchestration or deployment tool

Project status

Actively developed
Single-GPU PyTorch training supported
Multi-GPU (DDP / FSDP) support in progress
APIs may evolve as abstractions are validated

Links

GitHub: https://github.com/traceopt-ai/traceml
PyPI: https://pypi.org/project/traceml-ai/
Website: https://traceopt.ai

We’re especially interested in feedback from people training real models hitting performance or memory pathologies.

models 0

None public yet

datasets 0

None public yet