Latest hf-mem release added a breakdown of Mixture-of-Experts (MoE) memory usage!
TL; DR MoEs can be misleading to reason about from active parameters alone, since each token only activates a subset of experts, while the serving setup still needs to account for the full resident memory footprint.
π§ hf-mem now splits MoE memory into base model weights, routed experts, and KV cache ποΈ Dense models usually load and use most weights every forward pass, while MoEs load many experts but only route each token to a few of them β‘ Active params isn't the same as memory footprint, especially for sparse architectures π¦ Runtime memory is about what is used per request/token, while loading memory also includes the expert weights that need to be resident π KV cache can still dominate depending on context length, batch size, and concurrency π Expert Parallelism (EP) helps shard experts across accelerators when expert weights dominate π Data Parallelism (DP) + EP is often a good fit for throughput-oriented MoE serving
Learn how to deploy Microsoft Research VibeVoice ASR on Microsoft Azure Foundry with Hugging Face to generate rich audio transcriptions with Who, When, and What! π₯
> π 60-minute single-pass processing, no chunking or stitching > π€ Customized hotwords to guide recognition on domain-specific content > π Rich transcription: joint ASR + diarization + timestamping in one pass > π 50+ languages with automatic detection and code-switching support > π€ Deployed on Microsoft Foundry via an OpenAI-compatible Chat Completions API
- π 30B-A3B MoE, the strongest model in the 30B class. It excels at coding tasks, agentic workflows and reasoning. - π€π» Lighter version of his 358B big brother, balancing performance and efficiency.
π₯ hf-mem v0.4.1 now also estimates KV cache memory requirements for any context length and batch size with the --experimental flag!
uvx hf-mem --model-id ... --experimental will automatically pull the required information from the Hugging Face Hub to include the KV cache estimation, when applicable.
π‘ Alternatively, you can also set the --max-model-len, --batch-size and --kv-cache-dtype arguments (Γ la vLLM) manually if preferred.
We prepared the 2025 version of the HF AI Timeline Grid, highlighting open vs API-based model releases, and allowing you to browse and filter by access, modality, and release type!
1οΈβ£ Q1 β Learning to Reason Deepseek not only releases a top-notch reasoning model, but shows how to train them and compete with closed frontier models. OpenAI debuts Deep Research.
Significant milestones: DeepSeek R1 & R1-Zero, Qwen 2.5 VL, OpenAI Deep Research, Gemini 2.5 Pro (experimental)
2οΈβ£ Q2 β Multimodality and Coding More LLMs embrace multimodality by default, and there's a surge in coding agents. Strong vision, audio, and generative models emerge.
Significant milestones: Llama 4, Qwen 3, Imagen 4, OpenAI Codex, Google Jules, Claude 4
3οΈβ£ Q3 β "Gold" rush, OpenAI opens up, the community goes bananas Flagship models get gold in Math olympiads and hard benchmarks. OpenAI releases strong open source models and Google releases the much anticipated nano-banana for image generation and editing. Agentic workflows become commonplace.
Significant milestones: Gemini and OpenAI IMO Gold, gpt-oss, Gemini 2.5 Flash Image, Grok 4, Claude Sonnet 4.5
4οΈβ£ Q4 β Mistral returns, leaderboard hill-climbing Mistral is back with updated model families. All labs release impressive models to wrap up the year!
Significant milestones: Claude Opus 4.5, DeepSeek Math V2, FLUX 2, GPT 5.1, Kimi K2 Thinking, Nano Banana Pro, GLM 4.7, Gemini 3, Mistral 3, MiniMax M2.1 π€―
deepseek-ai/DeepSeek-OCR is out! π₯ my take β€΅οΈ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face π₯
> not only a document converter but also can do document question answering, understand multiple languages π€― > best part: released with Apache 2.0 license π use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! π€ > built on SigLIP2 & granite-165M
first vision language model built off openai/gpt-oss-20b just dropped! π₯
InternVL3.5 comes with 32 models π€― pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb comes with gpt-oss or Qwen3 for LLM part ‡οΈ