Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,48 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# TraceOpt
|
| 2 |
+
|
| 3 |
+
**Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.**
|
| 4 |
+
|
| 5 |
+
TraceOpt builds **TraceML**, a lightweight runtime observability layer that makes PyTorch training behavior visible *while it runs*.
|
| 6 |
+
|
| 7 |
+
TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers:
|
| 8 |
+
|
| 9 |
+
- Which layer caused a CUDA OOM?
|
| 10 |
+
- Why did this training step suddenly slow down?
|
| 11 |
+
- Is the bottleneck data loading, forward, backward, or the optimizer?
|
| 12 |
+
- Where did a memory spike actually occur?
|
| 13 |
+
|
| 14 |
+
## What TraceML focuses on
|
| 15 |
+
|
| 16 |
+
TraceML provides **semantic, step-level signals** that bridge the gap between system metrics and model-level behavior:
|
| 17 |
+
|
| 18 |
+
- Step-level timing (dataloader → forward → backward → optimizer)
|
| 19 |
+
- Step-level GPU memory tracking with peak attribution
|
| 20 |
+
- Optional deep-dive per-layer memory and compute diagnostics
|
| 21 |
+
- Designed to run continuously with low overhead during real training jobs
|
| 22 |
+
|
| 23 |
+
## What TraceML is (and is not)
|
| 24 |
+
|
| 25 |
+
**What it is**
|
| 26 |
+
- Always-on runtime observability
|
| 27 |
+
- Step-aware attribution, not post-hoc profiling
|
| 28 |
+
- Focused on training-time behavior
|
| 29 |
+
|
| 30 |
+
**What it is not**
|
| 31 |
+
- Not a replacement for PyTorch Profiler or Nsight
|
| 32 |
+
- Not a training framework
|
| 33 |
+
- Not an orchestration or deployment tool
|
| 34 |
+
|
| 35 |
+
## Project status
|
| 36 |
+
|
| 37 |
+
- Actively developed
|
| 38 |
+
- Single-GPU PyTorch training supported
|
| 39 |
+
- Multi-GPU (DDP / FSDP) support in progress
|
| 40 |
+
- APIs may evolve as abstractions are validated
|
| 41 |
+
|
| 42 |
+
## Links
|
| 43 |
+
|
| 44 |
+
- GitHub: https://github.com/traceopt-ai/traceml
|
| 45 |
+
- PyPI: https://pypi.org/project/traceml-ai/
|
| 46 |
+
- Website: https://traceopt.ai
|
| 47 |
+
|
| 48 |
+
We’re especially interested in feedback from people training real models hitting performance or memory pathologies.
|