Why GPU placement — not compute — is the bottleneck in distributed inference

#1
by dystrio - opened
Owner

I built Dystrio as an interactive demo to make GPU placement decisions visible
and explainable for distributed inference on Kubernetes.

This Space:

  • Ingests real PyTorch / NCCL profiler traces
  • Reconstructs rank-to-rank communication graphs
  • Detects whether patterns are stable across runs
  • Emits Kubernetes-native podAffinity YAML only when it’s provably safe
  • Explains why a recommendation is strong, weak, or downgraded

The UI here is static by design — all real analysis runs in backend services —
so it’s safe to explore without exposing models or cluster internals.

Primary focus:

  • Hugging Face TGI
  • Tensor-parallel / multi-GPU inference
  • Latency-sensitive distributed workloads

I’m actively looking for feedback and design partners running real inference
systems on Kubernetes.

Happy to answer questions or walk through specific workloads.

dystrio pinned discussion

Sign up or log in to comment