| license: apache-2.0 | |
| Key Result | |
| O(N) learned causal convolution beats O(N²) softmax attention on both perplexity AND throughput, with the advantage growing at longer sequences: | |
| Model PPL Change TPS (128) TPS (2048) Speedup | |
| Learned Conv O(N) 8.08 -3.2% 378,066 1,009,622 5.5x | |
| Standard QKV O(N²) 8.34 baseline 317,968 183,408 1.0x | |
| At 2048 tokens, the O(N) model is 5.5x faster while achieving better perplexity. The gap widens with sequence length because O(N) scales linearly while O(N²) scales quadratically. | |
| https://github.com/MikeyBeez/DifferentialLR | |
| https://medium.com/p/6659a3793322 | |
| https://doi.org/10.5281/zenodo.18498944 | |