NVIDIA Nemotron v3 Collection Open, Production-ready Enterprise Models • 6 items • Updated about 23 hours ago • 120
Cerebras REAP Collection Sparse MoE models compressed using REAP (Router-weighted Expert Activation Pruning) method • 22 items • Updated about 13 hours ago • 81
view article Article MLA: Redefining KV-Cache Through Low-Rank Projections and On-Demand Decompression Feb 4, 2025 • 19
Compressing KV Cache for Long-Context LLM Inference with Inter-Layer Attention Similarity Paper • 2412.02252 • Published Dec 3, 2024 • 2
TransMLA: Multi-head Latent Attention Is All You Need Paper • 2502.07864 • Published Feb 11, 2025 • 57
Kimi k1.5: Scaling Reinforcement Learning with LLMs Paper • 2501.12599 • Published Jan 22, 2025 • 126
Hibiki fr-en Collection Hibiki is a model for streaming speech translation , which can run on device! See https://github.com/kyutai-labs/hibiki. • 7 items • Updated 20 days ago • 55