Spaces:
Running
Running
Why GPU placement — not compute — is the bottleneck in distributed inference
#1
pinned
by
dystrio - opened
I built Dystrio as an interactive demo to make GPU placement decisions visible
and explainable for distributed inference on Kubernetes.
This Space:
- Ingests real PyTorch / NCCL profiler traces
- Reconstructs rank-to-rank communication graphs
- Detects whether patterns are stable across runs
- Emits Kubernetes-native podAffinity YAML only when it’s provably safe
- Explains why a recommendation is strong, weak, or downgraded
The UI here is static by design — all real analysis runs in backend services —
so it’s safe to explore without exposing models or cluster internals.
Primary focus:
- Hugging Face TGI
- Tensor-parallel / multi-GPU inference
- Latency-sensitive distributed workloads
I’m actively looking for feedback and design partners running real inference
systems on Kubernetes.
Happy to answer questions or walk through specific workloads.
dystrio pinned discussion