Collections
Discover the best community collections!
Collections trending this week
-
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Paper • 2308.06093 • Published • 2 -
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
Paper • 2308.07317 • Published • 25 -
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Paper • 2211.11315 • Published • 1 -
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
Paper • 2307.13269 • Published • 34
-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
Paper • 2405.12532 • Published -
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Paper • 2404.04793 • Published • 1 -
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Paper • 2405.14366 • Published • 3
-
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Paper • 2308.06093 • Published • 2 -
Weight Averaging Improves Knowledge Distillation under Domain Shift
Paper • 2309.11446 • Published • 1 -
SWAMP: Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning
Paper • 2305.14852 • Published • 1 -
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging
Paper • 2306.16788 • Published • 1
-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
Fast Distributed Inference Serving for Large Language Models
Paper • 2305.05920 • Published • 1 -
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Paper • 2305.13144 • Published • 1 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1
-
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Paper • 2308.06093 • Published • 2 -
Platypus: Quick, Cheap, and Powerful Refinement of LLMs
Paper • 2308.07317 • Published • 25 -
Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers
Paper • 2211.11315 • Published • 1 -
LoraHub: Efficient Cross-Task Generalization via Dynamic LoRA Composition
Paper • 2307.13269 • Published • 34
-
Experts Weights Averaging: A New General Training Scheme for Vision Transformers
Paper • 2308.06093 • Published • 2 -
Weight Averaging Improves Knowledge Distillation under Domain Shift
Paper • 2309.11446 • Published • 1 -
SWAMP: Sparse Weight Averaging with Multiple Particles for Iterative Magnitude Pruning
Paper • 2305.14852 • Published • 1 -
Sparse Model Soups: A Recipe for Improved Pruning via Model Averaging
Paper • 2306.16788 • Published • 1
-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
Paper • 2405.12532 • Published -
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Paper • 2404.04793 • Published • 1 -
MiniCache: KV Cache Compression in Depth Dimension for Large Language Models
Paper • 2405.14366 • Published • 3
-
S^{3}: Increasing GPU Utilization during Generative Inference for Higher Throughput
Paper • 2306.06000 • Published • 1 -
Fast Distributed Inference Serving for Large Language Models
Paper • 2305.05920 • Published • 1 -
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Paper • 2305.13144 • Published • 1 -
Towards MoE Deployment: Mitigating Inefficiencies in Mixture-of-Expert (MoE) Inference
Paper • 2303.06182 • Published • 1