Spaces:

traceopt
/

README

Configuration error

App Files Files Community

README / README.md

abhinavsriva

Update README.md

dda4245 verified 2 months ago

preview code

raw

history blame contribute delete

1.72 kB

	# TraceOpt

	Runtime observability and failure attribution for PyTorch training: step-aware, low-overhead, and always-on.

	TraceOpt builds TraceML, a lightweight runtime observability layer that makes PyTorch training behavior visible while it runs.

	TraceML helps answer questions that are hard to debug with infrastructure metrics or heavyweight profilers:

	- Which layer caused a CUDA OOM?
	- Why did this training step suddenly slow down?
	- Is the bottleneck data loading, forward, backward, or the optimizer?
	- Where did a memory spike actually occur?

	## What TraceML focuses on

	TraceML provides semantic, step-level signals that bridge the gap between system metrics and model-level behavior:

	- Step-level timing (dataloader → forward → backward → optimizer)
	- Step-level GPU memory tracking with peak attribution
	- Optional deep-dive per-layer memory and compute diagnostics
	- Designed to run continuously with low overhead during real training jobs

	## What TraceML is (and is not)

	What it is
	- Always-on runtime observability
	- Step-aware attribution, not post-hoc profiling
	- Focused on training-time behavior

	What it is not
	- Not a replacement for PyTorch Profiler or Nsight
	- Not a training framework
	- Not an orchestration or deployment tool

	## Project status

	- Actively developed
	- Single-GPU PyTorch training supported
	- Multi-GPU (DDP / FSDP) support in progress
	- APIs may evolve as abstractions are validated

	## Links

	- GitHub: https://github.com/traceopt-ai/traceml
	- PyPI: https://pypi.org/project/traceml-ai/
	- Website: https://traceopt.ai

	We’re especially interested in feedback from people training real models hitting performance or memory pathologies.