view article Article Shipping a Trillion Parameters With a Hub Bucket: Delta Weight Sync in TRL +6 aminediroHF, qgallouedec, kashif, lewtun, edbeeching, albertvillanova, lvwerra, sergiopaniego • 30 days ago • 42
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training Paper • 2604.26687 • Published Apr 29 • 2
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models Paper • 2604.04707 • Published Apr 6 • 204
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression Paper • 2604.04921 • Published Apr 6 • 116
LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation Paper • 2604.00829 • Published Apr 1 • 8
Fast and Accurate Causal Parallel Decoding using Jacobi Forcing Paper • 2512.14681 • Published Dec 16, 2025 • 44
Ministral 3 Collection Mistral Ministral 3: new multimodal models in Base, Instruct, and Reasoning variants, available in 3B, 8B, and 14B sizes. • 36 items • Updated 10 days ago • 35
view article Article Tricks from OpenAI gpt-oss YOU 🫵 can use with transformers +5 ariG23498, sergiopaniego, reach-vb, pcuenq, ArthurZ, SaylorTwift, cyrilvallez • Sep 11, 2025 • 188
view article Article Continuous batching from first principles +1 ror, ArthurZ, mcpotato • Nov 25, 2025 • 411
Hala Technical Report: Building Arabic-Centric Instruction & Translation Models at Scale Paper • 2509.14008 • Published Sep 17, 2025 • 90
Reasoning Vectors: Transferring Chain-of-Thought Capabilities via Task Arithmetic Paper • 2509.01363 • Published Sep 1, 2025 • 62
Predicting the Order of Upcoming Tokens Improves Language Modeling Paper • 2508.19228 • Published Aug 26, 2025 • 23
view article Article Prefill and Decode for Concurrent Requests - Optimizing LLM Performance tngtech • Apr 16, 2025 • 81
Indonesian Text Similarity Dataset Collection This collection contains currated text similarity datasets that are available in huggingface dataset • 16 items • Updated Jul 11, 2025 • 6
view article Article No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL +4 toslali-ibm, mirinflim, qgallouedec, esnible, rganti, mudhakar • Jun 3, 2025 • 101
Gemma 3n Collection Google Gemma 3n models, all versions including Dynamic GGUF, 4-bit, 16-bit and formats! • 10 items • Updated 10 days ago • 28
Softpick: No Attention Sink, No Massive Activations with Rectified Softmax Paper • 2504.20966 • Published Apr 29, 2025 • 31
The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs Paper • 2504.17768 • Published Apr 24, 2025 • 14