Updating to Pytorch 2.4

torch.sparse now includes prototype support for semi-structured (2:4) sparsity on NVIDIA® GPUs
New CPU performance features include inductor improvements (e.g. bfloat16 support and dynamic shapes), AVX512 kernel support, and scaled-dot-product-attention kernels
torch.compile can now compile NumPy operations via translating them into PyTorch-equivalent operations

2.1->2.2

scaled_dot_product_attention (SDPA) now supports FlashAttention-2, yielding around 2x speedups compared to previous versions

2.2->2.3

Tensor Parallelism improves the experience for training Large Language Models using native PyTorch function
semi-structured sparsity implements semi-structured sparsity as a Tensor subclass, with observed speedups of up to 1.6 over dense matrix multiplication.

2.3->2.4

Introduced a new default server backend for TCPStore built with libuv which should introduce significantly lower initialization times and better scalability
Pytorch users can now experience improved quality and performance gains with the beta BF16 symbolic shape support

Haichen Wang Research Group org Jul 16, 2025

Based on a small setup with 4 trainings for each of torch 2.0 vs. torch 2.4, it looks like about 8% decrease in training time.

chultquist0 changed pull request status to open Jul 16, 2025

chultquist0 changed pull request status to merged Jul 16, 2025

Haichen Wang Research Group org Jul 18, 2025

•

using 600k events total, and 1024 batch size, the speed up is less dramatic but still noticeable (about 3%)

Torch 2.4
Batch: 809.5060 s
Train: 417.8602 s
Eval: 790.6683 s

Torch 2.0
Batch: 829.1964 s
Train: 432.0877 s
Eval: 811.7211 s

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment