Spaces:

traceopt
/

README

Configuration error

App Files Files Community

abhinavsriva commited on Jan 14

Commit

dda4245

verified ·

1 Parent(s): 0643043

Update README.md

Browse files

Files changed (1) hide show

README.md +48 -10

README.md CHANGED Viewed

@@ -1,10 +1,48 @@
----
-title: README
-emoji: 🐨
-colorFrom: gray
-colorTo: indigo
-sdk: static
-pinned: false
----
-Edit this `README.md` markdown file to author your organization card.

+# TraceOpt
+**Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.**
+TraceOpt builds **TraceML**, a lightweight runtime observability layer that makes PyTorch training behavior visible *while it runs*.
+TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers:
+- Which layer caused a CUDA OOM?
+- Why did this training step suddenly slow down?
+- Is the bottleneck data loading, forward, backward, or the optimizer?
+- Where did a memory spike actually occur?
+## What TraceML focuses on
+TraceML provides **semantic, step-level signals** that bridge the gap between system metrics and model-level behavior:
+- Step-level timing (dataloader → forward → backward → optimizer)
+- Step-level GPU memory tracking with peak attribution
+- Optional deep-dive per-layer memory and compute diagnostics
+- Designed to run continuously with low overhead during real training jobs
+## What TraceML is (and is not)
+**What it is**
+- Always-on runtime observability
+- Step-aware attribution, not post-hoc profiling
+- Focused on training-time behavior
+**What it is not**
+- Not a replacement for PyTorch Profiler or Nsight
+- Not a training framework
+- Not an orchestration or deployment tool
+## Project status
+- Actively developed
+- Single-GPU PyTorch training supported
+- Multi-GPU (DDP / FSDP) support in progress
+- APIs may evolve as abstractions are validated
+## Links
+- GitHub: https://github.com/traceopt-ai/traceml
+- PyPI: https://pypi.org/project/traceml-ai/
+- Website: https://traceopt.ai
+We’re especially interested in feedback from people training real models hitting performance or memory pathologies.