igormolybog 's Collections Inference speed
updated
FlashDecoding++: Faster Large Language Model Inference on GPUs
Paper
• 2311.01282
• Published • 37
Co-training and Co-distillation for Quality Improvement and Compression
of Language Models
Paper
• 2311.02849
• Published • 8
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
Paper
• 2311.04934
• Published • 32
Exponentially Faster Language Modelling
Paper
• 2311.10770
• Published • 119
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published • 40
Transformers are Multi-State RNNs
Paper
• 2401.06104
• Published • 39
DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and
DeepSpeed-Inference
Paper
• 2401.08671
• Published • 15
Medusa: Simple LLM Inference Acceleration Framework with Multiple
Decoding Heads
Paper
• 2401.10774
• Published • 60
BiTA: Bi-Directional Tuning for Lossless Acceleration in Large Language
Models
Paper
• 2401.12522
• Published • 12
SubGen: Token Generation in Sublinear Time and Memory
Paper
• 2402.06082
• Published • 11
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts
Models
Paper
• 2402.07033
• Published • 19
Speculative Streaming: Fast LLM Inference without Auxiliary Models
Paper
• 2402.11131
• Published • 42
Towards Fast Multilingual LLM Inference: Speculative Decoding and
Specialized Drafters
Paper
• 2406.16758
• Published • 20