Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers Paper • 2601.17367 • Published 11 days ago • 33
Small-scale proxies for large-scale Transformer training instabilities Paper • 2309.14322 • Published Sep 25, 2023 • 21
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training Paper • 2602.00747 • Published 4 days ago • 8