abhinavsriva commited on
Commit
dda4245
·
verified ·
1 Parent(s): 0643043

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +48 -10
README.md CHANGED
@@ -1,10 +1,48 @@
1
- ---
2
- title: README
3
- emoji: 🐨
4
- colorFrom: gray
5
- colorTo: indigo
6
- sdk: static
7
- pinned: false
8
- ---
9
-
10
- Edit this `README.md` markdown file to author your organization card.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # TraceOpt
2
+
3
+ **Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.**
4
+
5
+ TraceOpt builds **TraceML**, a lightweight runtime observability layer that makes PyTorch training behavior visible *while it runs*.
6
+
7
+ TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers:
8
+
9
+ - Which layer caused a CUDA OOM?
10
+ - Why did this training step suddenly slow down?
11
+ - Is the bottleneck data loading, forward, backward, or the optimizer?
12
+ - Where did a memory spike actually occur?
13
+
14
+ ## What TraceML focuses on
15
+
16
+ TraceML provides **semantic, step-level signals** that bridge the gap between system metrics and model-level behavior:
17
+
18
+ - Step-level timing (dataloader → forward → backward → optimizer)
19
+ - Step-level GPU memory tracking with peak attribution
20
+ - Optional deep-dive per-layer memory and compute diagnostics
21
+ - Designed to run continuously with low overhead during real training jobs
22
+
23
+ ## What TraceML is (and is not)
24
+
25
+ **What it is**
26
+ - Always-on runtime observability
27
+ - Step-aware attribution, not post-hoc profiling
28
+ - Focused on training-time behavior
29
+
30
+ **What it is not**
31
+ - Not a replacement for PyTorch Profiler or Nsight
32
+ - Not a training framework
33
+ - Not an orchestration or deployment tool
34
+
35
+ ## Project status
36
+
37
+ - Actively developed
38
+ - Single-GPU PyTorch training supported
39
+ - Multi-GPU (DDP / FSDP) support in progress
40
+ - APIs may evolve as abstractions are validated
41
+
42
+ ## Links
43
+
44
+ - GitHub: https://github.com/traceopt-ai/traceml
45
+ - PyPI: https://pypi.org/project/traceml-ai/
46
+ - Website: https://traceopt.ai
47
+
48
+ We’re especially interested in feedback from people training real models hitting performance or memory pathologies.