Nurmukhamed 's Collections llm-performance
updated
QLoRA: Efficient Finetuning of Quantized LLMs
Paper
• 2305.14314
• Published • 60
Training Transformers with 4-bit Integers
Paper
• 2306.11987
• Published • 23
FasterViT: Fast Vision Transformers with Hierarchical Attention
Paper
• 2306.06189
• Published • 32
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
Long Sequence Transformer Models
Paper
• 2309.14509
• Published • 21
VeRA: Vector-based Random Matrix Adaptation
Paper
• 2310.11454
• Published • 30
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Paper
• 2310.08659
• Published • 28
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Paper
• 2310.17157
• Published • 14
TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
Paper
• 2312.16862
• Published • 31
SparQ Attention: Bandwidth-Efficient LLM Inference
Paper
• 2312.04985
• Published • 40
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
Paper
• 2312.12456
• Published • 45
LLM in a flash: Efficient Large Language Model Inference with Limited
Memory
Paper
• 2312.11514
• Published • 263
LLM Augmented LLMs: Expanding Capabilities through Composition
Paper
• 2401.02412
• Published • 38
SliceGPT: Compress Large Language Models by Deleting Rows and Columns
Paper
• 2401.15024
• Published • 73
Step-3 is Large yet Affordable: Model-system Co-design for
Cost-effective Decoding
Paper
• 2507.19427
• Published • 21