LLM - a ChunjiangGe Collection

ChunjiangGe 's Collections

LLM

updated Apr 7

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Paper • 2601.19895 • Published Jan 27 • 27
Elastic Attention: Test-time Adaptive Sparsity Ratios for Efficient Transformers

Paper • 2601.17367 • Published Jan 24 • 33
Small-scale proxies for large-scale Transformer training instabilities

Paper • 2309.14322 • Published Sep 25, 2023 • 22
Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Paper • 2602.00747 • Published Jan 31 • 9
HySparse: A Hybrid Sparse Attention Architecture with Oracle Token Selection and KV Cache Sharing

Paper • 2602.03560 • Published Feb 3 • 49
FineRMoE: Dimension Expansion for Finer-Grained Expert with Its Upcycling Approach

Paper • 2603.13364 • Published Mar 9 • 9
When Does Sparsity Mitigate the Curse of Depth in LLMs

Paper • 2603.15389 • Published Mar 16 • 5
Spectral Condition for μP under Width-Depth Scaling

Paper • 2603.00541 • Published Feb 28 • 15
Attention Sinks Are Provably Necessary in Softmax Transformers: Evidence from Trigger-Conditional Tasks

Paper • 2603.11487 • Published Mar 12 • 2
Extending the Context of Pretrained LLMs by Dropping Their Positional Embeddings

Paper • 2512.12167 • Published Dec 13, 2025 • 5
CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

Paper • 2602.05258 • Published Feb 5 • 7
Mixture-of-Depths Attention

Paper • 2603.15619 • Published Mar 16 • 81
Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Paper • 2510.25804 • Published Oct 29, 2025 • 1
Mixture of Lookup Key-Value Experts

Paper • 2512.09723 • Published Dec 10, 2025 • 1
Value Residual Learning For Alleviating Attention Concentration In Transformers

Paper • 2410.17897 • Published Oct 23, 2024 • 9
Multi-Token Prediction via Self-Distillation

Paper • 2602.06019 • Published Feb 5 • 1