Model,Lab,Playground,"Params (total, B)","Params (active, B)",Arch,Tokens trained (B),Data ratio (total),H100 cost to train,ALScore,MMLU,MMLU -Pro,GPQA,HLE,Training dataset,Announced ▼,Public?,Disclosure score,Paper / Repo,Tags,Notes,Count (rough),Audit,Params total confidence,Params active confidence,Tokens confidence,License,Context window,Country,Training hardware,Compute (FLOPs),Compute (Log FLOPs),Compute (ZettaFLOPs),Frontier compute %,Compute percentile,H100 hours,H100 energy (MWh),H100 CO2 Emissions (tonnes),B200 hours,B200 cost to train,B200 energy (MWh),B200 CO2 Emissions (tonnes),Rubin hours (hold),Rubin cost to train (hold),"Rubin energy (MWh, hold)","Rubin CO2 Emissions (tonnes, hold)" Glimmer-1-Base,Glint Research,https://huggingface.co/Glint-Research/Glimmer-1-Base,0.0000119,,Dense,0,43:1,███,0.000,,,,,web-scale,Jun/2026,🟢,A,███,,"11.9K-parameter (0.0000119B) experimental micro-model trained on 500K tokens of FineWeb-Edu. Llama-style transformer exploring the lower bound of useful language model scale. Base only, no SFT. Trained on FineWeb-Edu on a single RTX 4070 SUPER.",886,███,███,███,███,MIT,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-5.2,Z.AI,https://huggingface.co/zai-org/GLM-5.2,744,40,MoE,"28,500",39:1,███,15.3,███,,91.2,54.7,"synthetic, web-scale",Jun/2026,🟢,A,https://arxiv.org/abs/2602.15763,Reasoning,1M-token context (up from 200K in GLM-5.1); 131K max output. Trained entirely on Huawei Ascend 910B; no NVIDIA hardware. Two thinking modes (High; Max).,885,███,███,███,███,███,"1,000,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ VibeThinker-3B,WeiboAI (Sina Weibo),███,3,,Dense,"5,500","1,834:1",███,0.4,,,70.2,,"synthetic, web-scale",Jun/2026,🟢,C,https://arxiv.org/abs/2606.16140,Reasoning,3B dense reasoning model scoring 94.3 on AIME26 and 80.2 on LiveCodeBench v6; matches 100x+ larger models on verifiable reasoning tasks. Built on Qwen2.5-Coder-3B via curriculum SFT + multi-domain RL + offline self-distillation.,884,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Rio-3.5-Open-397B,IplanRIO,https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B,397,17,MoE,███,91:1,███,12.6,,88,90.9,36.5,"synthetic, web-scale",Jun/2026,🟢,C,https://huggingface.co/prefeitura-rio/Rio-3.5-Open-397B,Reasoning,Merge of Qwen 3.5 397B + Next-N2-Pro. IplanRIO is Rio de Janeiro municipal IT company. Features SwiReasoning: dynamic latent/explicit reasoning via entropy-based confidence signals. SWE-Bench Verified=80.2. https://github.com/nex-agi/Nex-N2/issues/4,883,███,███,███,███,MIT,"1,010,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ openPangu 2.0 Pro,Huawei,pending 30/jun,505,18,MoE,"19,000",38:1,███,10.3,,,,,"synthetic, web-scale",Jun/2026,███,C,https://gitcode.com/ascend-tribe,,Open-source MoE with record 28:1 sparsity ratio. DSA+SWA hybrid attention architecture. Optimized for Ascend NPU; 2x single-card throughput vs mainstream open-source models. Open-sourcing 7 components from 30/Jun/2026. Dataset: Predecessor openPangu-Ultra-MoE-718B (718B/39B active) trained on ~19T tokens.,882,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi-K2.7-Code,Moonshot AI,https://huggingface.co/moonshotai/Kimi-K2.7-Code,1000,32,MoE,"30,500",31:1,███,18.4,,,,,"synthetic, web-scale",Jun/2026,🟢,A,https://huggingface.co/moonshotai/Kimi-K2.7-Code,Reasoning,Coding-focused agentic model built upon Kimi K2.6. Reduces thinking-token usage by ~30% compared to K2.6. 15.5T is verified base pretraining only; K2.5→K2.6→K2.7 continued training adds undisclosed tokens,███,███,███,███,███,Other,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nex-N2-Pro,Nex AGI,https://huggingface.co/nex-agi/Nex-N2-Pro,397,17,███,"36,000",91:1,███,12.6,,,90.7,,"synthetic, web-scale",Jun/2026,🟢,C,https://github.com/nex-agi/Nex-N2,Reasoning,"Post-trained on Qwen3.5-397B-A17B. ""An agentic model with Agentic Thinking."" GPQA Diamond up from 88.4 (base Qwen3.5) to 90.7 (+2.3 from post-training). SWE-Bench Verified=80.8, Terminal-Bench 2.1=75.3, SWE-Bench Pro=58.8. Competitive with GPT-5.5 and Opus 4.7 on coding and agentic benchmarks.",880,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DiffusionGemma 26B A4B IT,Google DeepMind,https://huggingface.co/google/diffusiongemma-26B-A4B-it,25.2,3.8,MoE,"14,000",556:1,███,2.0,,77.6,73.2,11.9,web-scale,Jun/2026,🟢,C,https://huggingface.co/google/diffusiongemma-26B-A4B-it,"Reasoning, Diffusion",███,879,███,███,███,███,███,"256,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Apodex-1.0-H,Apodex AI,https://apodex.ai,███,17,MoE,"36,000",91:1,███,12.6,,,,60.8,"synthetic, web-scale",Jun/2026,🟢,D,https://www.apodex.com/blog/apodex-1.0,Reasoning,"Verification-centric deep-research agent team on Qwen3.5 base. Heavy-duty mode coordinates up to 150 sub-agents over 15,000 steps. SOTA on BrowseComp (90.3), DeepSearchQA (94.4), FrontierScience-Research (46.7).",878,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Fable 5,Anthropic,https://claude.ai/,6000,400,MoE,"250,000",42:1,███,129.1,,,94.1,64.5,"synthetic, web-scale",Jun/2026,🟢,D,███,"Reasoning, SOTA","Mythos-class model made safe for general use. Same underlying model as Claude Mythos 5 with safety classifiers (fallback to Opus 4.8 in <5% of sessions for cyber, bio/chem, distillation). Pricing $10/$50 per Mtok. API: claude-fable-5.",877,███,███,███,███,Proprietary,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ North-Mini-Code-1.0,Cohere,https://huggingface.co/CohereLabs/North-Mini-Code-1.0,███,3,MoE,"12,000",400:1,███,2.0,,,,,"synthetic, web-scale",Jun/2026,🟢,C,https://huggingface.co/blog/CohereLabs/introducing-north-mini-code,,"30B-A3B MoE (128 experts, 8 active per token) optimized for agentic software engineering. First model in Cohere's North family. Artificial Analysis Coding Index: 33.4. 256K context, 64K output.",876,███,███,███,███,███,"256,000",Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AFM 3 Core Advanced,Apple,https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models,20,4,MoE,"25,000","1,250:1",███,2.4,,,,,"synthetic, web-scale",Jun/2026,🟢,███,https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models,,"Most powerful Apple on-device model. 20B params stored in flash (NAND); 1–4B activated per prompt via Instruction-Following Pruning (IFP). Natively multimodal (text, image, audio). Built with Google on cloud TPUs. 2026 blog states 'we significantly scaled pre-training on the latest generation of cloud TPU accelerators' and 'all models shared a common initial foundation.' Conservative 15T estimate accounts for scaling over 14T+ base + multimodal tokens.",875,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AFM 3 Cloud Pro,Apple,https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models,1200,60,MoE,"60,000",50:1,███,28.3,███,,,,"synthetic, web-scale",Jun/2026,🟢,D,https://machinelearning.apple.com/research/introducing-third-generation-of-apple-foundation-models,,"Apple–Google–NVIDIA collaboration. Based on custom 1.2T-parameter Gemini model with Apple's own pre-training and post-training. Runs on NVIDIA GPUs in Google Cloud via extended Private Cloud Compute. 'Our most capable server-based model, which powers our most demanding use cases, like agentic tool use and complex reasoning.' Tech report planned for summer 2026. Bloomberg, Mark Gurman, Nov 5 2025: ""Apple Inc. is planning to pay about $1 billion a year for an ultrapowerful 1.2 trillion parameter artificial intelligence model developed by Alphabet Inc.'s Google"" URI: https://www.bloomberg.com/news/articles/2025-11-05/apple-plans-to-use-1-2-trillion-parameter-google-gemini-model-to-power-new-siri ""Based on Gemini foundation… Apple did their own pre-training, post-training"" — Max Weinbach tweet, Jun 8 2026, quoted in wccftech: ""Apple just clarified AFM Cloud is Apple's own model, trained with Gemini outputs / AFM local models are entirely Apple models / AFM Cloud Pro seems to be based on Gemini foundation and data, but Apple did their own pre-training, post-training, RL, etc"" URI: https://wccftech.com/apple-removes-the-fog-around-its-new-cloud-based-and-20-billion-parameter-on-device-ai-models-brushes-aside-googles-contributions-while-hyping-nvidias/.",874,███,███,███,███,███,,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Macaron-V1-Preview-749B,Mind Lab,https://macaron-model-previews.macaron.im/,749,41,███,"28,500",39:1,███,15.4,,,,,"synthetic, web-scale",Jun/2026,🟢,A,https://macaron.im/mindlab/research/macaron-v1-preview,,"749B Mixture-of-LoRA agent model post-trained from GLM-5.1 (744B frozen base + 5 × 1B specialist LoRAs for chat, personal-life, coding, Generative UI, and OpenClaw tasks). Router Tool routes between adapters. SWE-bench Verified=78.1.",873,███,███,███,███,MIT,"202,752",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma 4 12B,Google DeepMind,https://huggingface.co/google/gemma-4-12B-it,12,,Dense,"14,000","1,167:1",███,1.4,,77.2,78.8,5.2,███,Jun/2026,🟢,C,https://huggingface.co/google/gemma-4-12B-it,Reasoning,"Encoder-free multimodal (text, image, audio) dense model with configurable thinking mode and 256K context.",872,███,███,███,███,███,"256,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Aion-1.0-Plan,Microsoft,,14,,Dense,"14,000",███,███,1.5,,,,,"synthetic, web-scale",Jun/2026,🟢,D,https://blogs.windows.com/windowsdeveloper/2026/06/02/build-2026-furthering-windows-as-the-trusted-platform-for-development/,Reasoning,"On-device reasoning and tool-calling SLM that ships in-box as part of Windows on capable devices, enabling fully local agentic workflows. ""Enables applications to reason over user intent, invoke tools, manage files and orchestrate sub-agents, bringing fully agentic workflows onto the device."" Announced at Build 2026; available in the coming months.",871,███,███,███,███,Proprietary,"32,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Aion-1.0-Instruct,Microsoft,https://microsoftedge.github.io/Demos/built-in-ai/playgrounds/prompt-api/,2,,Dense,"8,000","4,000:1",███,0.4,,,,,███,Jun/2026,🟢,D,https://blogs.windows.com/msedgedev/2026/06/02/expanding-on-device-ai-in-microsoft-edge-new-models-and-apis-for-the-web/,,"Pre-release small language model for on-device AI in Microsoft Edge (Canary/Dev), powering the Prompt and Writing Assistance APIs. Successor to Phi-4-mini (4B); ""smaller, faster, and more efficient,"" supports CPU inference for devices without a GPU. Planned open-source release on Hugging Face in July 2026.",870,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MAI-Code-1-Flash,Microsoft,https://github.blog/changelog/2026-06-02-mai-code-1-flash-is-now-available-for-github-copilot/,30,,Dense,"15,000",500:1,███,2.2,,,,,web-scale,Jun/2026,🟢,D,https://microsoft.ai/news/introducingmai-code-1-flash/,███,"Lightweight agentic coding model from Microsoft AI, built end-to-end on clean and appropriately licensed data, trained directly with GitHub Copilot harnesses. Adaptive solution-length control: solves harder problems with up to 60% fewer tokens. Outperforms Claude Haiku 4.5 across SWE-Bench Verified, SWE-Bench Pro (51.2% vs 35.2%), SWE-Bench Multilingual, and Terminal Bench 2. Available in VS Code GitHub Copilot.",869,███,███,███,███,███,,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MAI-Thinking-1,Microsoft,███,1000,35,MoE,"33,500",34:1,███,19.3,,85,84.2,,web-scale,Jun/2026,🟢,A,https://microsoft.ai/wp-content/uploads/2026/06/main_20260602_2.pdf,Reasoning,"Microsoft AI's reasoning model. 35B-active, ~1T-total parameters sparse MoE. Trained from the ground up without distillation from third-party models, on clean and commercially licensed data. Matches Claude Opus 4.6 on SWE-Bench Pro and preferred over Claude Sonnet 4.6 in blind human side-by-side evaluations. AIME 2025=97.0, AIME 2026=94.5.",868,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ KeyLM-75M-Instruct,Independent,https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct,0.0753,,Dense,18,240:1,███,0.004,24,,,,web-scale,Jun/2026,🟢,A,https://huggingface.co/Eclipse-Senpai/KeyLM-75M-Instruct,███,"75M-param from-scratch small LM; competitive on IFEval vs SmolLM-135M-Instruct at half the size. ""trained completely on kaggle (tpu v5e-8)"" Announce: https://www.reddit.com/r/LocalLLaMA/comments/1tuyb8s/i_trained_a_75m_parameter_llm_from_scratch_on_18b/",867,███,███,███,███,Apache 2.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cosmos 3 Super,NVIDIA,https://huggingface.co/nvidia/Cosmos3-Super,64,32,MoE,200,4:1,███,0.4,,,,,special,Jun/2026,🟢,███,https://research.nvidia.com/labs/cosmos-lab/cosmos3/technical-report.pdf,SOTA,"Omnimodal world model for Physical AI; dual-tower mixture-of-transformers (reasoner + generator) initialized from Qwen3-VL-32B. Dataset: ‘two epochs over the full pre-training mixture’ with sequences ‘at most 16k tokens.’ Conservative avg ~4K tokens/sample × 22M × 2 epochs ≈ 176B pretrain + ~9B SFT ≈ ~185B; rounded to 200B. Excludes generator-pathway vision/audio/action tokens (hundreds of millions of images and videos, not directly token-comparable to LLM corpora) and the prior Qwen3-VL-32B initialization tokens (not disclosed).",866,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mellum2-12B-A2.5B-Thinking,JetBrains,https://huggingface.co/JetBrains/Mellum2-12B-A2.5B-Thinking,12,2.5,MoE,"10,600",884:1,███,1.2,,,57.6,,███,Jun/2026,🟢,A,https://arxiv.org/abs/2605.31268,Reasoning,"Open-weight 12B MoE (64 experts, 8 active) language model specialised in software engineering; successor to the 4B dense Mellum.",865,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.7-Plus,Alibaba,https://chat.qwen.ai/,480,35,MoE,"36,000",75:1,███,13.9,,88.5,90.3,34.7,"synthetic, web-scale",Jun/2026,🟢,D,███,Reasoning,Multimodal agent model unifying vision and language; operates GUI and CLI within a single agent loop,864,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron 3 Ultra,NVIDIA,https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16#nvidia-nemotron-3-ultra-550b-a55b-bf16,███,55,MoE,"25,000",46:1,███,12.4,89.1,86.8,87,37.4,"synthetic, web-scale",Jun/2026,🟢,C,https://arxiv.org/abs/2512.20856,,NVIDIA’s largest open model: 550B total parameters with up to 55B active per token via a hybrid Mamba-Transformer MoE architecture. Most intelligent US open weights model per Artificial Analysis (Intelligence Index 48).,863,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-M3,MiniMax,https://huggingface.co/MiniMaxAI/MiniMax-M3,428,23,MoE,"100,000",███,███,21.8,,,,,"synthetic, web-scale",Jun/2026,🟢,C,https://www.minimax.io/blog/minimax-m3,"Reasoning, SOTA","""M3 is a model that has undergone mixed-modality training from Step 0... After rebuilding the entire data pipeline for this data, we are now able to scale the training data to the order of 100 trillion tokens."" First open-weight model combining frontier coding, 1M-token MSA context, and native multimodality. SWE-Bench Pro: 59.0; BrowseComp: 83.5 (surpasses Opus 4.7 at 79.3). Tech report and weights to be released within 10 days of launch.",862,███,███,███,███,Proprietary,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Step 3.7 Flash,StepFun,https://huggingface.co/stepfun-ai/Step-3.7-Flash,198,11,MoE,"24,000",122:1,███,7.3,,,,49.7,"synthetic, web-scale",May/2026,🟢,A,███,Reasoning,A high-efficiency Flash model for real-world agents.,861,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LFM2.5-8B-A1B,Liquid AI,https://huggingface.co/LiquidAI/LFM2.5-8B-A1B,8.3,1.5,MoE,"38,000","4,579:1",███,███,,,,,"synthetic, web-scale",May/2026,🟢,A,https://www.liquid.ai/blog/lfm2-5-8b-a1b,Reasoning,"Edge MoE for fast on-device tool calling; 128K context, reasoning-only model. Highlights: IFEval 91.84, MATH500 88.76, BFCLv3 64.79, Tau2-Telecom 88.07.",860,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Opus 4.8,Anthropic,https://claude.ai/,███,250,MoE,"80,000",16:1,███,66.7,,,93.6,57.9,"synthetic, web-scale",May/2026,🟢,D,https://www.anthropic.com/claude-opus-4-8-system-card,"Reasoning, SOTA",Announce: https://www.anthropic.com/news/claude-opus-4-8 HLE=with tools (49.8 no tools). Same price as Opus 4.7 ($5/$25). Params/tokens carried from Opus-class estimate (Opus 4.7).,859,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ESMC 6B,Biohub,https://huggingface.co/biohub/ESMC-6B,6,,███,"6,600","1,100:1",███,0.7,,,,,special,May/2026,🟢,A,https://biohub.ai/papers/esm_protein.pdf,,"“Language Modeling Materializes a World Model of Protein Biology”. Protein language model, 80 layers, 2.37e23 training FLOPs. Trained on ~2.8B protein sequences (UniRef + MGnify + JGI, clustered at 70% identity). Tokens back-calculated from disclosed compute: training FLOPs of 2.37e23 reported on the HF model card, divided by 6N (Kaplan/Chinchilla rule for transformer training), gives 2.37e23 ÷ (6 × 6e9) ≈ 6.6T tokens. Released alongside ESMFold2 and ESM Atlas (6.8B sequences / 1.1B predicted structures). MIT license.",858,███,███,███,███,Non-commercial research,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniCPM5-1B,OpenBMB,https://huggingface.co/spaces/openbmb/MiniCPM5-1B-Demo,1.08,,Dense,"8,000",███,███,0.3,,48.85,26.26,,"synthetic, web-scale",May/2026,🟢,A,https://huggingface.co/openbmb/MiniCPM5-1B,Reasoning,"the first model in the MiniCPM5 series. It is a dense 1B Transformer built for on-device, local deployment, and resource-constrained scenarios, reaching 1B-class open-source SOTA. Hybrid reasoning with template. 1,080,632,832 params, 24 layers, GQA 16Q/2KV, 131K context. Post-training: 200B deep-thinking SFT + 200B hybrid-thinking SFT + RL + OPD. Pretraining token count not disclosed; estimate aligned with MiniCPM4 paper's 8T-token series budget (UltraClean/Ultra-FineWeb data-efficient pipeline). MMLU-Redux=70.06, MATH-500=91.6, AIME-2025=40.42, BBH=71.89, IFEval=80.41 (Thinking mode).",857,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gated DeltaNet-2,NVIDIA,https://github.com/NVlabs/GatedDeltaNet-2,1.3,,Dense,100,77:1,███,0.04,,,,,web-scale,May/2026,🟢,A,https://github.com/NVlabs/GatedDeltaNet-2/blob/main/paper/GDN2_paper.pdf,,███,856,███,███,███,███,Other,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Command A+,Cohere,https://huggingface.co/CohereLabs/command-a-plus-05-2026-bf16,218,25,MoE,"20,000",92:1,███,7.0,,,,,"synthetic, web-scale",May/2026,███,C,https://huggingface.co/CohereLabs/command-a-plus-05-2026-bf16,,"""open source model with 25 billion active parameters and 218B total parameters model optimized for agentic, multilingual, and reasoning-heavy tasks with a focus on enterprise performance, while also providing support for vision inputs"". 128 experts, 8 active per token + 1 shared expert. 128K context. 48 languages. Announce: https://cohere.com/blog/command-a-plus",855,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.7-Max,Alibaba,https://chat.qwen.ai/,2000,100,MoE,"40,000",20:1,███,29.8,,89.6,92.4,53.5,"synthetic, web-scale",May/2026,🟢,D,https://qwen.ai/blog?id=qwen3.7,Reasoning,"""Qwen3.7-Max, our latest proprietary model designed for the agent era."" 35-hour autonomous kernel optimization run with 1,000+ tool calls; 10.0x geomean speedup over Triton reference. Available soon via Alibaba Cloud Model Studio.",███,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ HRM-Text-1B,Sapient Intelligence,https://huggingface.co/sapientinc/HRM-Text-1B,1,,Dense,160,160:1,███,0.04,60.7,,,,"synthetic, web-scale",May/2026,🟢,A,https://github.com/sapientinc/HRM-Text,███,"""1B text generation model based on the HRM architecture, strengthened by task completion and latent space reasoning."" Pretraining cost ~$1472 on 16 H100s. 40B unique tokens × 4 epochs = 160B total.",853,███,███,███,███,███,"4,096",Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-Labs-Diffusion-14B,NVIDIA,███,14,,Dense,"4,345",311:1,███,0.8,82.51,,54.55,,"synthetic, web-scale",May/2026,🟢,D,https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_Diffusion_Tech_Report_v1.pdf,Diffusion,"Tri-mode LM unifying AR, diffusion, and self-speculation decoding within a single architecture. Explicit Nemotron training is disclosed at 1T (Stage 1 AR) + 300B (Stage 2 joint) + 45B (SFT) = 1,345B. Base init from Ministral3-14B (Liu et al., arXiv 2601.08584, 2026)≈3T (disclosed).",852,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 3.5 Flash,Google DeepMind,https://gemini.google.com/,500,25,MoE,"100,000",200:1,███,23.6,,,,40.2,"synthetic, web-scale",May/2026,🟢,A,███,,"“Gemini 3.5 Flash delivers intelligence that rivals large flagship models on multiple dimensions, at the speeds you have come to expect from the Flash series.” Beats Gemini 3.1 Pro on Terminal-Bench 2.1 (76.2%), MCP Atlas (83.6%), CharXiv Reasoning (84.2%). Pro coming next month. Announce: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-5/",851,███,███,███,███,Proprietary,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ZAYA1-8B-Diffusion-Preview,Zyphra,https://huggingface.co/Zyphra/ZAYA1-8B,8.4,0.76,MoE,"15,100","1,798:1",███,1.2,,,94,,"synthetic, web-scale",May/2026,🟢,A,███,Diffusion,"First MoE diffusion model converted from an autoregressive LLM, and first diffusion-LM trained on AMD. Built from ZAYA1-8B base via TiDAR-style conversion: 600B tokens at 32k + 500B context extension to 128k + diffusion SFT. 4.6x decoding speedup (lossless sampler), 7.7x (mixed-logits sampler). Diffuses blocks of 16 tokens simultaneously.",850,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Intern-S2-Preview,Shanghai AI Laboratory/SenseTime,https://huggingface.co/internlm/Intern-S2-Preview,35,3,MoE,"41,000","1,172:1",███,4.0,,███,,18.07,"synthetic, web-scale",May/2026,🟢,A,https://huggingface.co/internlm/Intern-S2-Preview,Reasoning,"""an efficient 35B scientific multimodal foundation model... continued pretrained from Qwen3.5."" 35B-A3B MoE. Tokens estimate: Qwen3.5 base ~36T + ~5T scientific continued pretraining (matching Intern-S1-Pro pattern) ≈ 41T. Announce: https://github.com/InternLM/Intern-S1",849,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ring-2.6-1T,Inclusion AI,https://huggingface.co/inclusionAI/Ring-2.6-1T,1000,50,███,"20,750",21:1,███,15.2,,,88.27,,"synthetic, web-scale",May/2026,🟢,C,https://huggingface.co/inclusionAI/Ring-2.6-1T,Reasoning,"Reasoning sibling of Ling-2.6-1T. Async RL training + IcePop algorithm. Two reasoning effort levels: high and xhigh. xhigh: AIME 26=95.83, ARC-AGI-V2=66.18, GPQA Diamond=88.27. high: PinchBench=87.60, ClawEval=63.82, Tau2-Bench Telecom=95.32.",848,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TML-Interaction-Small,Thinking Machines Lab,,███,12,MoE,"30,000",109:1,███,9.6,,,,,"web-scale, audio, video",May/2026,🟡,D,https://thinkingmachines.ai/blog/interaction-models/,,"Native multimodal (audio, video, text) interaction model with time-aligned 200ms micro-turns. “TML-Interaction-Small dominates interaction quality while being more intelligent than any non thinking model.” Research preview; larger models planned later in 2026.",847,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Needle,Cactus-Compute,https://huggingface.co/Cactus-Compute/needle,0.026,,Dense,202,"7,770:1",███,0.008,,,███,,"synthetic, web-scale",May/2026,🟢,A,https://github.com/cactus-compute/needle,,"Distilled from Gemini 3.1 into a 26M-parameter ""Simple Attention Network"" (attention + gating, no MLPs/FFNs). 6000 tok/s prefill, 1200 tok/s decode on consumer devices. Pretrained on 16 TPU v6e for 200B tokens (27 hrs), post-trained on 2B tokens of single-shot function-calling data (45 min) synthesized via Gemini across 15 tool categories. MIT licensed. Weights: https://huggingface.co/Cactus-Compute/needle",846,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B,NVIDIA,https://huggingface.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16,30,3.6,MoE,"25,160",839:1,███,2.9,,78.63,72.1,,"synthetic, web-scale",May/2026,🟢,A,https://arxiv.org/abs/2511.16664,,"""NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16 is a 3-in-1 elastic large language model (LLM) developed by NVIDIA. It contains three nested model variants (30B, 23B, and 12B parameters) within a single BF16 checkpoint, all sharing the same parameter space.""",███,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ZAYA1-74B-Preview,Zyphra,https://huggingface.co/Zyphra/ZAYA1-74B-preview,74,4,MoE,███,244:1,███,3.8,,84.4,83.8,,"synthetic, web-scale",May/2026,🟢,A,https://www.zyphra.com/zaya1-8b-technical-report,Reasoning,"""Pre-RL reasoning base checkpoint (no instruction or RL post-training). Trained end-to-end on AMD MI300x hardware. Uses CCA attention with sliding window attention hybrid and 256k context. ""ZAYA1-74B-Preview is a pre-RL reasoning-base checkpoint, released under an Apache 2.0 license.""",844,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SubQ 1M-Preview,Subquadratic,https://subq.ai/request-early-access,70,,Dense,"12,000",172:1,███,3.1,,,,,"synthetic, web-scale",May/2026,🟢,███,https://subq.ai/how-ssa-makes-long-context-practical,,"First fully subquadratic LLM. SSA (Subquadratic Sparse Attention). 12M token research context, 1M production. SWE-Bench Verified=81.8.",843,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 3.1 8B + CUA (quantum),Multiverse Computing,,8,,Dense,"15,000",███,███,1.2,,,,,"synthetic, web-scale",May/2026,🟡,B,https://arxiv.org/abs/2605.05914,Quantum,"First end-to-end quantum enhancement of a production-scale LLM on real superconducting hardware (156-qubit IBM Quantum System Two); Cayley unitary adapters add 6,000 params and reduce Llama 3.1 8B perplexity by 1.4%.",842,███,███,███,███,███,"131,072",Spain,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ZAYA1-8B,Zyphra,https://huggingface.co/Zyphra/ZAYA1-8B,8.4,0.76,MoE,"14,000","1,667:1",███,1.1,,74.2,71,███,"synthetic, web-scale",May/2026,🟢,A,https://www.zyphra.com/zaya1-8b-technical-report,Reasoning,"First MoE model pretrained, midtrained, and SFT’d entirely on AMD Instinct MI300; 760M active params; Apache-2.0. Announce: https://www.zyphra.com/post/zaya1-8b",841,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GENE-26.5,Genesis AI,https://www.genesis.ai/blog/gene-26-5-advancing-robotic-manipulation-to-human-level,30,███,Dense,"10,000",334:1,███,1.8,,,,,robotics,May/2026,🟡,D,https://www.genesis.ai/blog/gene-26-5-advancing-robotic-manipulation-to-human-level,,"""our first robotic foundation model system and the initial public release in the GENE family… designed to push general-purpose robotic manipulation towards human-level capability."" 200,000+ hours of multimodal data (vision, hand state, language, tactile, robot controls). 200,000 hours × ~2M tokens/hour ≈ 400B multimodal tokens, + internet language/video pretraining priors ≈ 10T total.",840,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.5 Instant,OpenAI,https://chatgpt.com/,300,15,MoE,"114,000",380:1,███,███,,,85.6,,"synthetic, web-scale",May/2026,🟢,D,https://openai.com/index/gpt-5-5-instant/,"Reasoning, SOTA","""smarter and more accurate, with clearer, more concise answers that feel better tailored to you. Because Instant is the daily driver for hundreds of millions of people...""",839,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MAMMAL,IBM,https://huggingface.co/ibm/biomed.omics.bl.sm.ma-ted-458m,0.458,,Dense,700,"1,529:1",███,2.3,,,,,special,May/2026,🟢,A,https://www.nature.com/articles/s44386-026-00047-4,███,"""MAMMAL (Molecular Aligned Multi Modal Architecture and Language), a foundation model for cross-modal learning, designed to address the challenges associated with drug discovery tasks."" 2 Billion total samples × ~350 average tokens/sample ≈ 700 Billion tokens.",838,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE-5.1-Preview,Baidu,https://ernie.baidu.com/,███,60,MoE,"100,000",125:1,███,29.8,,84.3,91,,"synthetic, web-scale",Apr/2026,🟢,C,https://ernie.baidu.com/blog/posts/ernie-5.1-preview-0430-release-on-lmarena/,Reasoning,"""ERNIE-5.1-Preview builds on the pre-training foundation of ERNIE-5.0 while compressing total parameters to approximately 1/3 and active parameters to approximately 1/2, achieving leading performance at its model scale using only about 6% of the pre-training cost of comparable models.""",837,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite-4.1-30B,IBM,https://huggingface.co/ibm-granite/granite-4.1-30b,30,,Dense,"15,000",500:1,███,2.3,80.16,64.09,45.76,,"synthetic, web-scale",Apr/2026,🟢,███,https://huggingface.co/blog/ibm-granite/granite-4-1,Reasoning,"""Granite 4.1 is a family of dense, decoder‑only LLMs (3B, 8B, and 30B) trained on ~15T tokens using a multi‑stage pre‑training pipeline, including long‑context extension of up to 512K tokens.""",836,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Medium 3.5,Mistral,https://chat.mistral.ai/chat,128,,Dense,"12,000",94:1,███,███,,,,,"synthetic, web-scale",Apr/2026,🟢,C,https://mistral.ai/news/vibe-remote-agents-mistral-medium-3-5,,"""a dense 128B model with a 256k context window, handling instruction-following, reasoning, and coding in a single set of weights""",835,███,███,███,███,Other,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron 3 Nano Omni,NVIDIA,https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16,30,3,MoE,"25,700",857:1,███,2.9,,77.3,72.2,,███,Apr/2026,🟢,A,https://arxiv.org/abs/2604.24954,Reasoning,"Built on the highly efficient Nemotron 3 Nano 30B-A3B backbone, Nemotron 3 Nano Omni natively supports audio inputs alongside text, images, and video. ~717B tokens (multimodal post-training) on top of ~25T base LLM pretraining, ~25.7T total.",834,███,███,███,███,Other,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Laguna XS.2,Poolside,https://huggingface.co/poolside/Laguna-XS.2,33,3,MoE,"10,000",304:1,███,1.9,,,,,"synthetic, web-scale",Apr/2026,🟢,███,https://poolside.ai/blog/introducing-laguna-xs2-m1,,"""Laguna XS.2, as open-weights. 33B total parameters, 3B active. Apache 2.0.""",833,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Laguna M.1,Poolside,https://platform.poolside.ai/,225,23,MoE,"10,000",███,███,5.0,,,,,"synthetic, web-scale",Apr/2026,🟢,C,https://poolside.ai/blog/introducing-laguna-xs2-m1,,"""Laguna M.1 is a 225B total parameter model with 23B activated parameters, built for agentic coding and long-horizon work.""",832,███,███,███,███,███,"256,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V4-Pro,DeepSeek-AI,https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro,1600,49,MoE,"33,000",21:1,███,24.2,90.1,87.5,90.1,37.7,"synthetic, web-scale",Apr/2026,🟢,A,███,"SOTA, Reasoning","MoE with 1.6T total / 49B active parameters, 1M-token context, trained on 32T+ tokens with FP4/FP8 mixed precision; introduces Compressed Sparse Attention (CSA) using only 27% single-token FLOPs vs. V3.2 and 10% KV cache; scores 90.1 on GPQA Diamond and 80.6% on SWE-bench Verified. Open-weight.",831,███,███,███,███,MIT,"1,048,576",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ talkie-1930-13b,Independent,https://talkie-lm.com/chat,13,,Dense,260,20:1,███,0.2,,,███,,history only,Apr/2026,🟢,A,https://talkie-lm.com/introducing-talkie,,"Alec Radford (GPT-1, GPT-2, GPT-3). talkie-1930-13b is a 13b language model trained on pre-1931 English-language text, instruction-tuned using a novel instruction-following dataset built from pre-1931 reference works including etiquette manuals, letter-writing manuals, encyclopedias, and poetry collections. It has also undergone reinforcement learning using online DPO to improve instruction-following capabilities.",830,███,███,███,███,Proprietary,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hy3 preview,Tencent,https://huggingface.co/tencent/Hy3-preview,295,21,MoE,"40,000",136:1,███,11.5,87.42,65.76,,30,"synthetic, web-scale",Apr/2026,███,C,https://github.com/Tencent-Hunyuan/Hy3-preview,Reasoning,"""Hy3 preview is the first model trained on our rebuilt infrastructure, and the strongest we've shipped so far. It improves significantly on complex reasoning, instruction following, context learning, coding, and agent tasks.""",829,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ling-2.6-1T,Inclusion AI,https://huggingface.co/inclusionAI/Ling-2.6-1T,1000,50,MoE,"20,750",21:1,███,15.2,,,,,"synthetic, web-scale",Apr/2026,🟢,C,https://huggingface.co/inclusionAI/Ling-2.6-1T,SOTA,███,828,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.5,OpenAI,https://chatgpt.com/,3000,150,MoE,"114,000",38:1,███,61.6,,,93.6,57.2,"synthetic, web-scale",Apr/2026,🟢,███,https://deploymentsafety.openai.com/gpt-5-5/gpt-5-5.pdf,"Reasoning, SOTA",Announce: https://openai.com/index/introducing-gpt-5-5/ HLE result is for GPT-5.5 Pro/x-high.,827,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Marul V7,Independent,https://marulai.com.tr/,0.258,,███,"1,000","3,876:1",███,0.05,,,,,web-scale,Apr/2026,🟢,C,https://www.reddit.com/r/LocalLLaMA/comments/1sshwtu/s%C4%B1f%C4%B1rdan_e%C4%9Fitilmi%C5%9F_258m_parametre_t%C3%BCrk%C3%A7e_llm/,,Turkish.,826,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.6-27B,Alibaba,https://huggingface.co/Qwen/Qwen3.6-27B,27,,Dense,███,"1,334:1",███,3.3,,86.1,87.8,24.3,"synthetic, web-scale",Apr/2026,🟢,C,https://huggingface.co/Qwen/Qwen3.6-27B,Reasoning,"""the first open-weight variant of Qwen3.6. Built on direct feedback from the community, Qwen3.6 prioritizes stability and real-world utility, offering developers a more intuitive, responsive, and genuinely productive coding experience.""",825,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiMo-V2.5-Pro,Xiaomi,https://huggingface.co/XiaomiMiMo/MiMo-V2.5-Pro,1020,███,MoE,"27,000",27:1,███,17.5,89.4,68.5,66.7,48,"synthetic, web-scale",Apr/2026,🟢,A,https://mimo.xiaomi.com/mimo-v2-5-pro,Reasoning,"Uses a 7:1 Hybrid Attention mechanism and supports a 1M-token context window. ""significant improvements over its predecessor, MiMo-V2-Pro, in general agentic capabilities, complex software engineering, and long-horizon tasks.""",824,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ling-2.6-Flash,Inclusion AI,https://huggingface.co/inclusionAI/Ling-2.6-flash,104,7.4,MoE,"20,750",200:1,███,4.9,,,,███,"synthetic, web-scale",Apr/2026,🟢,C,https://x.com/AntLingAGI/status/2046660999491858521,Reasoning,MoE with 104B total / 7.4B active params using a 1:7 MLA + Lightning Linear hybrid attention architecture for up to 4x throughput vs. comparable models; supports 262K context; achieves 61.2% on SWE-bench Verified and 73.85% on MathArena AIME 2026. Open-weight.,823,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite-4.1-8B,IBM,https://huggingface.co/ibm-granite/granite-4.1-8b,8,,Dense,"15,000","1,875:1",███,2.3,73.84,55.99,███,,"synthetic, web-scale",Apr/2026,🟢,A,https://huggingface.co/ibm-granite/granite-4.1-8b,Reasoning,"""improved post-training pipeline, including supervised finetuning and reinforcement learning alignment, resulting in enhanced tool calling, instruction following, and chat capabilities.""",822,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OpenMythos,Independent,https://github.com/kyegomez/OpenMythos,0.77,0.04,███,30,39:1,███,18.4,,,,,web-scale,Apr/2026,🟢,A,https://github.com/kyegomez/OpenMythos,,"770M trained, up to 1T available. ""OpenMythos is an open-source, theoretical implementation of the Claude Mythos model. It implements a Recurrent-Depth Transformer (RDT) with three stages: Prelude (transformer blocks), a looped Recurrent Block (up to max_loop_iters), and a final Coda. Attention is switchable between MLA and GQA, and the feed-forward uses a sparse MoE with routed and shared experts ideal for exploring compute-adaptive, depth-variable reasoning.""",821,███,███,███,███,███,"8,000",,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi K2.6,Moonshot AI,https://huggingface.co/moonshotai/Kimi-K2.6,1000,32,MoE,"30,500",31:1,███,18.4,,,90.5,54,"synthetic, web-scale",Apr/2026,🟢,███,https://www.kimi.com/blog/kimi-k2-6,"Reasoning, SOTA","""Kimi K2.6 is an open-source, native multimodal agentic model""",820,███,███,███,███,Other,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.6-Max-Preview,Alibaba,https://chat.qwen.ai/,███,100,MoE,"36,000",18:1,███,20.0,,,,,"synthetic, web-scale",Apr/2026,🟢,D,https://qwen.ai/blog?id=qwen3.6-max-preview,Reasoning,"""Qwen3.6-Max-Preview is an early preview of our next proprietary model, delivering meaningful improvements over Qwen3.6-Plus in agentic coding, world knowledge, and instruction following. It achieves the top score on six major coding benchmarks — SWE-bench Pro, Terminal-Bench 2.0, SkillsBench, QwenClawBench, QwenWebBench, and SciCode — with substantial gains over its predecessor. It also demonstrates stronger knowledge (SuperGPQA, QwenChineseBench) and better instruction following (ToolcallFormatIFBench).""",819,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.6-35B-A3B,Alibaba,https://huggingface.co/Qwen/Qwen3.6-35B-A3B,35,███,MoE,"36,000","1,029:1",███,3.7,,85.2,86,21.4,"synthetic, web-scale",Apr/2026,🟢,C,https://qwen.ai/blog?id=qwen3.6-35b-a3b,Reasoning,"""Following the February release of the Qwen3.5 series, we're pleased to share the first open-weight variant of Qwen3.6.""",818,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok 4.3,xAI,https://grok.com/,███,25,MoE,"80,000",160:1,███,21.1,,,,,"synthetic, web-scale",Apr/2026,🟢,C,https://grok.com/release-notes,Reasoning,"""Grok 4.3 is a new pre-trained model matching the scale of Grok 4.20 with an improved architecture and a December 2025 knowledge cutoff."" ""0.5T total. Current Grok [4.2] is half the size of Sonnet and 1/10th the size of Opus."" https://x.com/elonmusk/status/2042123561666855235 ""The public facing v4.2 is based on foundation model v8, trained on Hoppers, with significant shortfalls in training data quality, comprehensiveness and proportionality. It is also only 0.5T in size."" https://x.com/elonmusk/status/2055298325994164377",817,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Opus 4.7,Anthropic,https://claude.ai/ ,5000,250,MoE,"80,000",16:1,███,66.7,,,94.2,54.7,"synthetic, web-scale",Apr/2026,🟢,D,███,"Reasoning, SOTA",Announce: https://www.anthropic.com/news/claude-opus-4-7,816,███,███,███,███,Proprietary,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-Rosalind,OpenAI,,3000,150,MoE,"114,000",38:1,███,61.6,███,,,,"synthetic, web-scale",Apr/2026,🔴,F,https://openai.com/index/introducing-gpt-rosalind/,Reasoning,"""our frontier reasoning model built to support research across biology, drug discovery, and translational medicine. The life sciences model series is optimized for scientific workflows, combining improved tool use with deeper understanding across chemistry, protein engineering, and genomics.""",815,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.4-Cyber,OpenAI,,3000,150,MoE,"114,000",38:1,███,61.6,,,,███,"synthetic, web-scale",Apr/2026,🔴,F,https://openai.com/index/scaling-trusted-access-for-cyber-defense/,Reasoning,"""In preparation for increasingly more capable models from OpenAI over the next few months, we are fine-tuning our models specifically to enable defensive cybersecurity use cases, starting today with a variant of GPT‑5.4 trained to be cyber-permissive: GPT‑5.4‑Cyber. """,814,███,███,███,███,Proprietary,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Marco-Mini,Alibaba,https://huggingface.co/AIDC-AI/Marco-Mini-Instruct,17.3,0.86,MoE,"36,000","2,081:1",███,2.6,83.4,70.7,50.3,,"synthetic, web-scale",Apr/2026,🟢,C,https://huggingface.co/AIDC-AI/Marco-Mini-Instruct,Reasoning,███,813,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EXAONE 4.5,LG,https://huggingface.co/LGAI-EXAONE/EXAONE-4.5-33B,33,,Dense,"14,000",425:1,███,███,,83.3,80.5,13.6,web-scale,Apr/2026,🟢,C,https://arxiv.org/abs/2604.08644,Reasoning,"“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. ""We introduce EXAONE 4.5, the first open-weight vision language model developed by LG AI Research. Integrating a dedicated visual encoder... EXAONE 4.5 features 33 billion parameters in total.""",812,███,███,███,███,███,"131,072",South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Muse Spark,Meta AI,https://meta.ai/,70,,Dense,"40,000",███,███,5.6,,,89.5,58.4,"synthetic, web-scale",Apr/2026,🟢,D,https://ai.meta.com/blog/introducing-muse-spark-msl/,"Reasoning, SOTA","""This initial model is small and fast by design, yet capable enough to reason through complex questions in science, math, and health."" Announce: https://about.fb.com/news/2026/04/introducing-muse-spark-meta-superintelligence-labs/ ""we can reach the same capabilities with over an order of magnitude less compute than our previous model, Llama 4 Maverick. This improvement also makes Muse Spark significantly more efficient than the leading base models available for comparison.""",811,███,███,███,███,Other,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Horus 1.0 4B,TokenAI,https://huggingface.co/tokenaii/horus,███,,Dense,"3,000",750:1,███,0.4,85,60,20,,"synthetic, web-scale",Apr/2026,🟢,C,https://tokenai.cloud/horus,Reasoning,First open-source AI model from Egypt.,810,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ternary Bonsai 8B,PrismML,███,8.19,,Dense,"36,000","4,396:1",███,1.8,,,,,"synthetic, web-scale",Apr/2026,🟢,A,https://github.com/PrismML-Eng/Bonsai-demo/blob/main/ternary-bonsai-8b-whitepaper.pdf,,"1.58-bit ternary {-1,0,+1} weights across embeddings, attention, MLP, and LM head; quantization-aware retrain of Qwen3-8B. 9.4x smaller than FP16; 27 tok/s on iPhone 17 Pro Max.",809,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Mythos Preview,Anthropic,,10000,500,MoE,"250,000",███,███,166.7,,,94.5,64.7,"synthetic, web-scale",Apr/2026,🔴,F,https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf,"Reasoning, SOTA","Claude Mythos is suspected to be a Recurrent-Depth Transformer (RDT) — also called a Looped Transformer (LT). https://github.com/kyegomez/OpenMythos ""Claude Mythos Preview’s large increase in capabilities has led us to decide not to make it generally available.""",808,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-4B (In-Place TTT),Alibaba,https://chat.qwen.ai/,4,,Dense,"36,000","9,000:1",███,1.3,37.42,,,,"synthetic, web-scale",Apr/2026,🟢,███,https://arxiv.org/abs/2604.06169,,"""Through relatively cheap continual training, our In-Place TTT enables Qwen3-4B-Base to achieve superior performance on tasks with contexts up to 128k.""",807,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma 4 31B,Google DeepMind,https://huggingface.co/google/gemma-4-31B-it,31,,Dense,"14,000",452:1,███,2.2,,85.25,84.3,26.5,web-scale,Apr/2026,███,C,https://ai.google.dev/gemma/docs/core/model_card_4,Reasoning,"""Gemma 4 31B IT is an open multimodal model... designed to deliver frontier-level performance for reasoning, agentic workflows, coding, and multimodal understanding.""",806,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GEN-1,Generalist,https://generalistai.com/blog/apr-02-2026-GEN-1,30,,Dense,"10,000",334:1,███,1.8,,,,,robotics,Apr/2026,🟡,D,https://generalistai.com/blog/apr-02-2026-GEN-1,SOTA,"""First general-purpose AI model to master simple physical tasks. Completes tasks roughly 3x faster than the prior SOTA, requiring only 1 hour of robot data per result.""",███,███,███,███,███,Proprietary,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.6-Plus,Alibaba,https://chat.qwen.ai/,1000,50,MoE,"36,000",███,███,20.0,,88.5,90.4,50.6,"synthetic, web-scale",Apr/2026,🟢,D,https://qwen.ai/blog?id=qwen3.6,Reasoning,1M context window by default.,804,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Trinity-Large-Thinking,Arcee AI,https://huggingface.co/arcee-ai/Trinity-Large-Thinking,400,13,MoE,"17,000",43:1,███,8.7,,83.4,76.3,,"synthetic, web-scale",Apr/2026,🟢,A,https://arxiv.org/abs/2602.17004,Reasoning,███,803,███,███,███,███,Other,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TweetyBERT,University of Oregon,https://github.com/georgevenven/tweety_bert,0.0025,,Dense,1,400:1,███,0.000,,,,,canary song audio,Mar/2026,🟢,C,https://doi.org/10.1016/j.patter.2025.101491,,███,802,███,███,███,███,███,"1,000",,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Machina Mirabilis,Independent,https://huggingface.co/mhla/gpt1900-d34-22btok,3.3,,Dense,22,7:1,███,0.03,,███,,,history only,Mar/2026,🟢,A,https://michaelhla.com/blog/machina-mirabilis.html,,"""A 3.29B parameter language model trained exclusively on pre-1900 English text. GPT-1900 knows nothing of the 20th century — no relativity, no quantum mechanics, no world wars. It thinks like a Victorian-era scholar, grounded in the science, literature, and worldview of its time. Trained on ~22B tokens from digitized books and newspapers published before 1900, sourced from HathiTrust, Internet Archive, the British Library, and historical American newspapers.""",801,███,███,███,███,███,"4,000",,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 1-bit Bonsai 8B,PrismML,https://huggingface.co/prism-ml/Bonsai-8B-gguf,8.2,,Dense,"36,000","4,391:1",███,1.8,65.7,,30,,"synthetic, web-scale",Mar/2026,███,C,https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-bonsai-8b-whitepaper.pdf,,"End-to-end true 1-bit model, with all embedding, attention, MLP, and output head weights represented solely as +1 or -1. Base likely Qwen3-8B dense. Showing MMLU Redux. Announce: https://prismml.com/news/bonsai-8b",800,███,███,███,███,███,"8,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LFM2.5-350M,Liquid AI,https://huggingface.co/LiquidAI/LFM2.5-350M,0.35,,Dense,"28,000","80,000:1",███,0.3,,20.01,30.64,,web-scale,Mar/2026,🟢,A,http://www.liquid.ai/blog/lfm2-5-350m-no-size-left-behind,███,"New record ratio. ""LFM2.5 is a new family of hybrid models designed for on-device deployment... Training budget: 28T tokens.""",799,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Holo3-122B-A10B,H Company,https://hcompany.ai/holo-models-api,122,10,MoE,"20,014",165:1,███,5.2,,,,███,"synthetic, web-scale",Mar/2026,🟢,C,https://hcompany.ai/holo3,Reasoning,"Base: Qwen3.5. 122B-A10B. ""With a score of 78.85% on the OSWorld-Verified benchmark, Holo3-122B-A10B establishes a new state of the art for the industry on the leading desktop computer use benchmark.""",798,███,███,███,███,Proprietary,"256,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.5-Omni-Plus,Alibaba,https://huggingface.co/spaces/Qwen/Qwen3.5-Omni-Online-Demo,35,3,MoE,"36,000","1,029:1",███,3.7,,███,83.9,,"synthetic, web-scale",Mar/2026,🟢,D,https://qwen.ai/blog?id=qwen3.5-omni,Reasoning,"""Qwen3.5-Omni is Qwen's latest generation of fully omnimodal LLM, supporting the understanding of text, images, audio, and audio-visual content.""",797,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mr_chatterbox,Independent,https://huggingface.co/spaces/tventurella/mr_chatterbox,0.34,,Dense,3,9:1,███,0.003,███,,,,history only,Mar/2026,🟢,D,https://huggingface.co/tventurella/mr_chatterbox_model,,"""Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library.""",796,███,███,███,███,MIT,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-5.1,Z.AI,https://chat.z.ai/,744,40,MoE,"28,500",39:1,███,███,,,86.2,52.3,"synthetic, web-scale",Mar/2026,🟢,A,https://huggingface.co/zai-org/GLM-5.1,Reasoning,Announce: https://x.com/Zai_org/status/2037490078126084514,795,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-Cascade-2-30B-A3B,NVIDIA,https://huggingface.co/nvidia/Nemotron-Cascade-2-30B-A3B,30,3,MoE,"25,540",852:1,███,███,,79.8,76.1,17.7,"synthetic, web-scale",Mar/2026,🟢,A,https://arxiv.org/abs/2603.19220,Reasoning,Gold medal performance in both the 2025 IMO and the IOI. HLE=no tools.,794,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiMo-V2-Pro,Xiaomi,https://aistudio.xiaomimimo.com/,1000,42,MoE,"27,000",27:1,███,17.3,,86.3,87,28.3,"synthetic, web-scale",Mar/2026,███,A,https://mimo.xiaomi.com/mimo-v2-pro,Reasoning,Uses a 7:1 Hybrid Attention mechanism and supports a 1M-token context window.,793,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MDM-Prime-v2,UToronto,https://chen-hao-chao.github.io/mdm-prime-v2/,1.1,,Dense,168,153:1,███,0.05,,,,███,web-scale,Mar/2026,🟢,A,https://arxiv.org/abs/2603.16077,,"""Our scaling analysis reveals that MDM-Prime-v2 is 21.8× more compute-efficient than autoregressive models (ARM).""",792,███,███,███,███,Non-commercial research,███,Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mamba-3,CMU,https://github.com/state-spaces/mamba,1.5,,Dense,100,67:1,███,0.04,34.9,,,,web-scale,Mar/2026,███,A,https://arxiv.org/abs/2603.15569,,"""with architectural refinements, our Mamba-3 model achieves significant gains across retrieval, state-tracking, and downstream language modeling tasks.""",791,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-M2.7,MiniMax,https://huggingface.co/MiniMaxAI/MiniMax-M2.7,███,10,MoE,"7,200",32:1,███,4.3,,87.7,,,web-scale,Mar/2026,🟢,D,https://www.minimax.io/news/minimax-m27-en,Reasoning,"Early RSI. ""M2.7 is our first model deeply participating in its own evolution…"" https://lifearchitect.ai/asi/ ""M2.7's parameter size the same as 2.5. We will opensource in 2 weeks."" https://huggingface.co/MiniMaxAI/MiniMax-M2.5/discussions/53#69c238efda3cb4a28dbccf9b",790,███,███,███,███,Other,"204,800",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Holotron-12B,H Company,https://huggingface.co/Hcompany/Holotron-12B,12,,Dense,███,"1,668:1",███,1.6,,,,,"synthetic, web-scale",Mar/2026,🟢,A,https://hcompany.ai/holotron-12b,,"""Holotron-12B is a high-throughput, multimodal Vision-Language Model (VLM) designed specifically as a policy model for computer-use agents.""",789,███,███,███,███,Other,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMo Hybrid,Allen AI,███,7,,Hybrid,"5,650",808:1,███,0.7,,41.7,,,"synthetic, web-scale",Mar/2026,🟢,A,https://arxiv.org/abs/2604.03444,,7B hybrid model interleaving transformer attention with Gated DeltaNet layers in a 3:1 pattern; matches Olmo 3 MMLU using 49% fewer tokens.,788,███,███,███,███,Apache 2.0,"65,536",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Small 4,Mistral,https://huggingface.co/mistralai/Mistral-Small-4-119B-2603,119,███,MoE,"30,000",253:1,███,6.3,,78,71.2,,"synthetic, web-scale",Mar/2026,🟢,C,https://mistral.ai/news/mistral-small-4,Reasoning,"""unifies the capabilities of three different model families—Instruct, Reasoning (previously called Magistral), and Devstral—into a single, unified model.""",787,███,███,███,███,███,"1,000,000",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiroThinker-H1,MiroMindAI,https://huggingface.co/miromind-ai/MiroThinker-1.7,235,11.75,MoE,███,154:1,███,9.7,,,,47.7,"synthetic, web-scale",Mar/2026,🟢,A,https://github.com/MiroMindAI/MiroThinker,Reasoning,"""Our proprietary agent, MiroThinker-H1 provides promising evidence for long-chain verifiable reasoning [based on new model, MiroThinker-1.7""",786,███,███,███,███,Apache 2.0,███,Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Covenant-72B,1Covenant,███,72,,Dense,"1,100",16:1,███,0.9,67.11,40.9,,,"synthetic, web-scale",Mar/2026,🟢,A,https://arxiv.org/abs/2603.08163,,"""largest permissionless collaboratively trained language model"" ~20 distinct peers, each running 8xB200 GPUs.",785,███,███,███,███,Apache 2.0,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron 3 Super,NVIDIA,https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-FP8?ncid=ref-inor-231713,120,12,MoE,"25,000",209:1,███,5.8,86.01,75.65,███,22.82,"synthetic, web-scale",Mar/2026,🟢,A,https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Super-Technical-Report.pdf,Reasoning,Announce: https://blogs.nvidia.com/blog/nemotron-3-super-agentic-ai/,784,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sarvam 105B,Sarvam AI,https://huggingface.co/sarvamai/sarvam-105b,105,10.3,███,"12,000",115:1,███,3.7,90.6,81.7,78.7,11.2,"synthetic, web-scale",Mar/2026,🟢,A,https://www.sarvam.ai/blogs/sarvam-30b-105b,Reasoning,"""22 Indian languages""",783,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FINGERS-7B,MIT,https://fingerprint.bio/,7,,███,"8,000","1,143:1",███,0.8,,,,,special,Mar/2026,🟢,A,https://openreview.net/forum?id=fVqvRQ6XRV,,"""Mamba–Transformer hybrid. First AI foundation model for Alzheimer’s prevention. ‘a 7-billion-parameter multi-omic foundation model pretrained on 8 trillion quality-aware, hierarchically structured, and semantically meaningful tokens and 300K metabolite profiles curated from public gut-brain-relevant metagenomic archives, then fine-tuned on clinical data from the WW-FINGERS network.’ Achieves AUC=0.92 for preclinical AD detection.""",782,███,███,███,███,███,"300,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.4,OpenAI,https://chatgpt.com/,3000,150,MoE,"114,000",38:1,███,61.6,,,███,58.7,"synthetic, web-scale",Mar/2026,🟢,D,https://deploymentsafety.openai.com/gpt-5-4-thinking,"Reasoning, SOTA","""most capable and efficient frontier model for professional work."" Announce: https://openai.com/index/introducing-gpt-5-4/",781,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yuan3.0-Ultra,YuanLabAI,https://huggingface.co/YuanLabAI/Yuan3.0-Ultra,1515,68.8,MoE,"2,200",2:1,███,6.1,87.8,71.9,,,"synthetic, web-scale",Mar/2026,🟢,A,███,Reasoning,Poor performance due to low training data/ratio.,780,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.3 Instant,OpenAI,https://chatgpt.com/,300,15,MoE,"114,000",380:1,███,19.5,,,78.5,,"synthetic, web-scale",Mar/2026,🟢,D,https://deploymentsafety.openai.com/gpt-5-3-instant/introduction,Reasoning,███,779,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ STATIC,Google,https://github.com/youtube/static-constraint-decoding,3,,Dense,"6,000","2,000:1",███,0.4,,,,,"synthetic, web-scale, video",Feb/2026,🟢,D,███,,"YouTube (Google). STATIC (Sparse Transition Matrix-Accelerated Trie Index for Constrained Decoding). ""The model is a Gemini-based generative retrieval model similar to PLUM [8], served with a batch size of 2 (per chip) and a beam size of 𝑀 = 70. The model is based on a non-Mixture-of-Experts (MoE) architecture with 3 billion dense parameters. All benchmark experiments are conducted on Google TPU v6e accelerators.""",778,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Arrow 1.0,Quiver,https://app.quiver.ai/,32,,Dense,███,13:1,███,0.4,,,,,SVG,Feb/2026,🟢,D,https://docs.quiver.ai/getting-started/overview,,"""A first of it's kind SVG AI model."" Announce: https://x.com/QuiverAI/status/2026792057893708072",777,███,███,███,███,███,"32,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.5-27B,Alibaba,https://huggingface.co/Qwen/Qwen3.5-27B,27,,Dense,"36,000","1,334:1",███,3.3,,86.1,85.5,48.5,"synthetic, web-scale",Feb/2026,🟢,C,https://huggingface.co/Qwen/Qwen3.5-27B,Reasoning,"""Qwen3.5 represents a significant leap forward, integrating breakthroughs in multimodal learning, architectural efficiency, reinforcement learning scale, and global accessibility""",███,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LFM2-24B-A2B,Liquid AI,https://huggingface.co/LiquidAI/LFM2-24B-A2B,24,,Dense,"17,000",709:1,███,2.1,,,,,web-scale,Feb/2026,███,A,https://www.liquid.ai/blog/lfm2-24b-a2b,,"""a traditional instruct model without reasoning traces.""",775,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mercury 2,Inception Labs,https://chat.inceptionlabs.ai/,180,███,Dense,"16,000",89:1,███,5.7,,,74,,"synthetic, web-scale",Feb/2026,🟢,D,https://www.inceptionlabs.ai/blog/introducing-mercury-2,"Diffusion, Reasoning",Diffusion large language model (dLLM).,774,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 3.1 Pro,Google DeepMind,https://gemini.google.com/,3000,150,███,"100,000",34:1,███,57.7,,,94.3,51.4,"synthetic, web-scale",Feb/2026,🟢,D,https://deepmind.google/models/model-cards/gemini-3-1-pro/,"Reasoning, SOTA",Knowledge cutoff still=January 2025. Announce: https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-1-pro/,773,███,███,███,███,Proprietary,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ZUNA,Zyphra,https://huggingface.co/Zyphra/ZUNA,0.38,███,Dense,324,853:1,███,0.04,,,,,EEG,Feb/2026,🟢,A,https://www.zyphra.com/zuna-technical-paper,,"For BCI, 'thought-to-text'. Training dataset calcs: (2M hours * 3,600 seconds/hour * 256 samples/second ) / 32 samples/token = 57.6B tokens (refined to 45.1B after rigorous filtering ); 150,000 steps * 2.16M tokens/batch = 324B total tokens seen during training. Announce: https://www.zyphra.com/post/zuna",772,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok 4.2,xAI,https://grok.com/,███,25,MoE,"80,000",160:1,███,21.1,,,,,"synthetic, web-scale",Feb/2026,🟢,C,https://docs.x.ai/developers/models/grok-4.20-0309-reasoning,Reasoning,"No details provided. Announce: https://x.com/elonmusk/status/2023829664318583105 ""0.5T total. Current Grok [4.2] is half the size of Sonnet and 1/10th the size of Opus."" https://x.com/elonmusk/status/2042123561666855235 ""The public facing v4.2 is based on foundation model v8, trained on Hoppers, with significant shortfalls in training data quality, comprehensiveness and proportionality. It is also only 0.5T in size."" https://x.com/elonmusk/status/2055298325994164377",771,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ INTELLECT-3.1,Prime Intellect,https://chat.primeintellect.ai/ ,106,███,MoE,"22,000",208:1,███,5.1,,,,,"synthetic, web-scale",Feb/2026,🟢,A,https://huggingface.co/PrimeIntellect/INTELLECT-3.1,Reasoning,"Base: GLM-4.5-Air-Base, INTELLECT-3 model.",770,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Sonnet 4.6,Anthropic,https://claude.ai/ ,1000,20,MoE,"80,000",80:1,███,29.8,88.7,,89.9,49,"synthetic, web-scale",Feb/2026,🟢,D,https://www-cdn.anthropic.com/78073f739564e986ff3e28522761a7a0b4484f84.pdf,Reasoning,1M context. Announce: https://www.anthropic.com/news/claude-sonnet-4-6 Showing GMMLU (Global MMLU by Cohere).,███,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Tiny Aya,Cohere,https://huggingface.co/CohereLabs/tiny-aya-base,3.35,,Dense,"8,000","2,389:1",███,0.5,44.9,,,,"synthetic, web-scale",Feb/2026,🟢,███,https://github.com/Cohere-Labs/tiny-aya-tech-report/blob/main/tiny_aya_tech_report.pdf,,70+ languages. Showing GMMLU (Global MMLU by Cohere).,768,███,███,███,███,CC-BY-NC 4.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ gpt-oss-puzzle-88B,NVIDIA,https://huggingface.co/nvidia/gpt-oss-puzzle-88B,88,5.1,MoE,"30,000",341:1,███,5.4,,79.32,75.25,17.52,███,Feb/2026,🟢,C,https://arxiv.org/abs/2602.11937,Reasoning,"gpt-oss-puzzle-88B is a deployment-optimized large language model developed by NVIDIA, derived from OpenAI's gpt-oss-120b.",767,███,███,███,███,Other,"224,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3.5-397B-A17B,Alibaba,https://huggingface.co/Qwen/Qwen3.5-397B-A17B,397,17,MoE,"36,000",91:1,███,12.6,88.61,76.01,88.4,48.3,"synthetic, web-scale",Feb/2026,🟢,C,https://qwen.ai/blog?id=qwen3.5,Reasoning,███,766,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ JoyAI-LLM Flash,JD Open Source,https://huggingface.co/jdopensource/JoyAI-LLM-Flash,48,3,MoE,"20,000",417:1,███,3.3,,███,74.43,,web-scale,Feb/2026,🟢,A,https://huggingface.co/jdopensource/JoyAI-LLM-Flash,Reasoning,"MoE with 48B total / 3B active params (256 experts, 8 selected), Multi-head Latent Attention (MLA), 128K context, trained on 20T tokens with the Muon optimizer; introduces Fibration Policy Optimization (FiberPO) for heterogeneous agent RL; scores 74.43 on GPQA Diamond and 60.60% on SWE-bench Verified. Open-weight.",765,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ring-2.5-1T,Inclusion AI,https://huggingface.co/inclusionAI/Ring-2.5-1T,1000,███,MoE,"29,000",29:1,███,18.0,,,,,"synthetic, web-scale",Feb/2026,🟢,A,https://huggingface.co/inclusionAI/Ring-2.5-1T,"Reasoning, SOTA","World’s first open-source trillion-parameter thinking model based on hybrid linear attention. IMO 2025: 35/42 (gold), CMO 2025: 105/126.",764,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ling-2.5-1T,Inclusion AI,https://huggingface.co/inclusionAI/Ling-2.5-1T,1000,63,MoE,"29,000",29:1,███,18.0,,,,,"synthetic, web-scale",Feb/2026,🟢,A,https://huggingface.co/inclusionAI/Ling-2.5-1T,SOTA,"Non-thinking flagship of Ling 2.5 series; hybrid linear attention. Pre-training corpus expanded from 20T to 29T tokens. Matches frontier thinking models on AIME 2026 at ~5,890 output tokens.",███,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-M2.5,MiniMax,https://huggingface.co/MiniMaxAI/MiniMax-M2.5,230,10,MoE,"7,200",32:1,███,4.3,,,85.2,19.4,web-scale,Feb/2026,🟢,███,https://www.minimax.io/news/minimax-m25,Reasoning,HLE showing without tools.,762,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-5,Z.AI,https://huggingface.co/zai-org/GLM-5,744,40,MoE,"28,500",39:1,███,15.3,,,86,50.4,"synthetic, web-scale",Feb/2026,🟢,A,███,Reasoning,Announce: https://z.ai/blog/glm-5,761,███,███,███,███,███,"200,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nanbeige4.1-3B,Nanbeige,https://huggingface.co/Nanbeige/Nanbeige4.1-3B,3,,Dense,"23,000","7,667:1",███,0.9,,,83.8,22.29,███,Feb/2026,🟢,A,https://huggingface.co/Nanbeige/Nanbeige4.1-3B,"Reasoning, SOTA",SOTA for size (3B),760,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RynnBrain-30B-A3B,Alibaba,https://huggingface.co/Alibaba-DAMO-Academy/RynnBrain-Nav-8B,30,3,MoE,"36,000","1,200:1",███,3.5,,███,,,"synthetic, web-scale",Feb/2026,🟢,A,https://alibaba-damo-academy.github.io/RynnBrain.github.io/,Reasoning,"Base: Qwen3-VL-30B-A3B-Instruct. ""an embodied foundation model grounded in physical reality.""",759,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Opus 4.6,Anthropic,https://claude.ai/ ,███,250,MoE,"100,000",20:1,███,74.5,,,91.3,53.1,"synthetic, web-scale",Feb/2026,🟢,D,https://www-cdn.anthropic.com/0dd865075ad3132672ee0ab40b05a53f14cf5288.pdf,"Reasoning, SOTA","Anthropic's most capable model (mid-2026); 200K context standard (1M beta), up to 128K output tokens; leads on Terminal-Bench 2.0, Humanity's Last Exam, and BrowseComp; scores 76% on MRCR v2 1M vs. Sonnet 4.5's 18.5%; supports adaptive extended thinking. API-only, closed weights.",758,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Intern-S1-Pro,Shanghai AI Laboratory/SenseTime,https://huggingface.co/internlm/Intern-S1-Pro,1000,22,MoE,"41,000",41:1,███,21.3,,86.6,███,,"synthetic, web-scale",Feb/2026,🟢,C,https://arxiv.org/abs/2603.25040,"Reasoning, SOTA","Assumes base model of Qwen3. ""Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data""",757,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Step 3.5 Flash,StepFun,https://huggingface.co/stepfun-ai/Step-3.5-Flash,196,11,MoE,"18,000",92:1,███,6.3,,,,,███,Feb/2026,🟢,C,https://static.stepfun.com/blog/step-3.5-flash/,Reasoning,MoE with 196B total / 11B active params; hybrid SWA + Full Attention at 3:1 ratio with 256K context; 3-way Multi-Token Prediction (MTP-3) enables 100-350 tok/s; scores 97.3 on AIME 2025 (99.9 with Parallel Thinking) and 74.4% on SWE-bench Verified.,756,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Assistant_Pepe_8B,Independent,https://huggingface.co/SicariusSicariiStuff/Assistant_Pepe_8B,8,,Dense,"15,600","1,950:1",███,1.2,███,,,,"synthetic, web-scale",Jan/2026,🟢,A,https://www.reddit.com/r/LocalLLaMA/comments/1qsrscu/can_4chan_data_really_improve_a_model_turns_out/,,"Warning for inappropriate content. Base: Llama-3.1-Nemotron-8B. ""trained it on an extended 4chan dataset"" ""the original, gpt4chan (by Yannic Kilcher) scored especially high in truthfulness (that was b4 benchmaxxing)... outperformed the base tune (the unabliterated one), it also changed its political alignment... People were initially joking about the ""alignment tax"", I think there's a none trivial substance in all of this. It seems to me just above a marginal error or statistical noise.""",755,███,███,███,███,███,"1,073,152",,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Trinity-Large,Arcee AI,https://www.arcee.ai/trinity#trinity-large-preview,400,13,MoE,███,43:1,███,8.7,87.2,75.2,63.3,,"synthetic, web-scale",Jan/2026,🟢,A,https://www.arcee.ai/trinity,Reasoning,"""we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large.""",754,███,███,███,███,Other,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SERA,Allen AI,███,32,,Dense,"36,000","1,125:1",███,3.6,,,,,"synthetic, web-scale",Jan/2026,🟢,A,https://allenai.org/papers/opencodingagents,Reasoning,"Base: Qwen3-32B. SERA=Soft-verified Efficient Repository Agents. ""SERA was built largely by a single Ai2 researcher."" https://allenai.org/blog/open-coding-agents ""SERA-32B was trained using Soft Verified Generation (SVG), a simple and efficient method that is 26x cheaper than reinforcement learning and 57x cheaper than previous synthetic data methods to reach equivalent performance. The total cost for data generation and training is approximately $2,000 (40 GPU-days).""",753,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi K2.5,Moonshot AI,https://huggingface.co/moonshotai/Kimi-K2.5,1000,███,MoE,"30,500",31:1,███,18.4,,87.1,87.6,50.2,"synthetic, web-scale",Jan/2026,🟢,A,https://www.kimi.com/blog/kimi-k2-5.html,"Reasoning, SOTA","1T parameters and 384 experts. Open source SOTA. ""Kimi K2.5 builds on Kimi K2 [15.5T tokens] with continued pretraining over approximately 15T mixed visual and text tokens. [+ 15T=30.5T]""",752,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-4.7-Flash,Z.AI,https://huggingface.co/zai-org/GLM-4.7-Flash,30,3,MoE,"22,000",734:1,███,2.7,,,75.2,14.4,"synthetic, web-scale",Jan/2026,🟢,███,https://huggingface.co/zai-org/GLM-4.7-Flash,Reasoning,"MoE with 30B total / 3B active params, 128K context; achieves 91.6 on AIME 2025, 75.2 on GPQA, and 59.2% on SWE-bench Verified, outperforming Qwen3-30B-A3B (22.0%) and GPT-OSS-20B (34.0%) on SWE-bench. MIT licensed, open-weight.",751,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MedGemma 1.5 4B,Google DeepMind,https://huggingface.co/google/medgemma-1.5-4b-it,4,,Dense,"14,000","3,500:1",███,0.8,67.2,,,,web-scale,Jan/2026,███,C,https://developers.google.com/health-ai-developer-foundations/medgemma/model-card,,Lower MMLU score compared to previous MedGemma 1 27B (67.2 v 87). Announce: https://research.google/blog/next-generation-medical-image-interpretation-with-medgemma-15-and-medical-speech-to-text-with-medasr/,750,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FrogBoss,Microsoft,https://huggingface.co/microsoft/FrogBoss-32B-2510,32,,Dense,"36,000","1,125:1",███,3.6,,,,,"synthetic, web-scale",Jan/2026,🟢,███,https://arxiv.org/abs/2510.19898,,Base: Qwen3-32B.,749,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EDEN,NVIDIA,,28,,Dense,"9,700",347:1,███,1.7,,,,,nucleotide tokens,Jan/2026,🔴,B,███,,"""EDEN (environmentally-derived evolutionary network) family of metagenomic foundation models, including a 28 billion parameter model trained on 9.7 trillion nucleotide tokens from BaseData1 . This dataset, at the time of training, contained more than 10 billion novel genes from over 1 million new species, and is intentionally enriched for environmental and host-associated metagenomes, phage sequences, and mobile genetic elements, enabling the model to learn from diverse and novel cross-species evolutionary mechanisms and apply them to key challenges in human health.""",748,███,███,███,███,Other,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Baichuan-M3,Baichuan,https://huggingface.co/baichuan-inc/Baichuan-M3-235B,235,,███,"20,000",86:1,███,7.2,,,,,"synthetic, web-scale",Jan/2026,🟢,A,https://www.baichuan-ai.com/blog/baichuan-M3,Reasoning,"""new-generation medical-enhanced large language model""",747,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Engram,DeepSeek-AI,https://github.com/deepseek-ai/Engram,39.5,3.8,MoE,262,7:1,███,0.3,60.6,31.3,,,"synthetic, web-scale",Jan/2026,🟡,A,███,,"""we explore conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic N -gram embeddings for O ( 1 ) lookup.""",746,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SleepFM,Stanford,https://github.com/zou-group/sleepfm-clinical/tree/sleepfm_release?tab=readme-ov-file,0.091,,Dense,13,139:1,███,0.004,,,,,PSG recordings,Jan/2026,███,C,https://www.nature.com/articles/s41591-025-04133-4,,"Uses a leave-one-out contrastive learning approach to align brain activity (EEG), heart activity (ECG), and respiratory signals. 130+ disease categories and 19–20+ clinical PSG channels. Dataset ~12.63B (Calculated based on 585,000 hours of data across 3 modality groups using 5-second window tokens) x 10 epochs.",745,███,███,███,███,Non-commercial research,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TimeCapsuleLLM-v2-1800-1875,Independent,https://huggingface.co/haykgrigorian/TimeCapsuleLLM-v2-1800-1875,1.2,,Dense,15,13:1,███,0.01,,,,,history only,Jan/2026,🟢,C,https://github.com/haykgrigo3/TimeCapsuleLLM,███,112GB dataset=30B tokens x 0.5 epochs = 15B tokens.,744,███,███,███,███,MIT,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jamba2,AI21,https://huggingface.co/collections/ai21labs/jamba2,52,12,MoE,"1,700",33:1,███,1.0,███,,,,web-scale,Jan/2026,🟢,C,https://www.ai21.com/blog/introducing-jamba2/,Reasoning,Pre-training tokens from Jamba=1.2T + 500B mid.,743,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LFM2.5,Liquid AI,https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct,1.2,,Dense,"28,000","23,334:1",███,0.6,,44.35,38.89,,web-scale,Jan/2026,🟢,███,https://huggingface.co/LiquidAI/LFM2.5-1.2B-Instruct,,"For on-device agentic applications. ""Extended pre-training from 10T to 28T tokens and large-scale multi-stage reinforcement learning.""",742,███,███,███,███,Other,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiroThinker v1.5,MiroMindAI,https://huggingface.co/miromind-ai/MiroThinker-v1.5-235B,235,22,MoE,"36,000",154:1,███,9.7,,,,39.2,"synthetic, web-scale",Jan/2026,███,A,https://github.com/MiroMindAI/MiroThinker,Reasoning,Base: Qwen3 235B-A22B. Official demo: https://dr.miromind.ai,741,███,███,███,███,███,"262,144",Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon-H1R,TII,https://huggingface.co/tiiuae/Falcon-H1-34B-Instruct-GGUF,7,,Dense,"18,000","2,572:1",███,1.2,,72.1,70.2,11.1,███,Jan/2026,🟢,A,https://github.com/tiiuae/falcon-h1r/blob/main/tech_report.pdf,Reasoning,Base model: Falcon-H1 (May/2025). Announce: https://huggingface.co/blog/tiiuae/falcon-h1r-7b,740,███,███,███,███,███,"262,144",UAE,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mHC 27B,DeepSeek-AI,,27,4.14,MoE,262,10:1,███,0.3,63.4,,,,"synthetic, web-scale",Dec/2025,🔴,B,https://arxiv.org/abs/2512.24880,,███,739,███,███,███,███,MIT,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-M2.1,MiniMax,https://agent.minimax.io/,229,10,MoE,"10,000",44:1,███,5.0,███,88,83,22.2,"synthetic, web-scale",Dec/2025,🟢,D,https://huggingface.co/MiniMaxAI/MiniMax-M2.1,Reasoning,"229B MoE (10B active) optimized for agentic coding, tool use, instruction following, and long-horizon planning; matches or beats Claude Sonnet 4.5 on multilingual SWE-bench.",738,███,███,███,███,Other,"204,800",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ IQuest-Coder-V1,IQuestLab,https://huggingface.co/IQuestLab/IQuest-Coder-V1-40B-Instruct,40,,███,"1,000",25:1,███,0.7,,,,,web-scale,Dec/2025,🟢,C,https://github.com/IQuestLab/IQuest-Coder-V1/blob/main/papers/IQuest_Coder_Technical_Report.pdf,Reasoning,"""IQuest-Coder-V1 captures the dynamic evolution of software logic, delivering state-of-the-art performance across critical dimensions"" https://github.com/IQuestLab/IQuest-Coder-V1",737,███,███,███,███,███,"131,072",India,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ A.X K1,SK Hynix,https://huggingface.co/skt/A.X-K1,519,33,MoE,"10,000",20:1,███,7.6,,███,74,8.6,"synthetic, web-scale",Dec/2025,🟢,A,https://arxiv.org/abs/2601.09200,Reasoning,MoE with 519B parameters pretrained on ~10T tokens; introduces a Think-Fusion training recipe enabling user-controlled switching between thinking and non-thinking modes in a single model; designed to bridge the gap between reasoning capability and inference efficiency; distinctive advantage on Korean-language benchmarks. Open-weight.,736,███,███,███,███,███,"131,072",South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ K-EXAONE,LG,https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B,236,23,MoE,"14,000",60:1,███,6.1,,83.9,80,███,"synthetic, web-scale",Dec/2025,🟢,C,https://huggingface.co/LGAI-EXAONE/K-EXAONE-236B-A23B,Reasoning,“EXAONE”=“EXpert AI for EveryONE”.,735,███,███,███,███,███,"262,144",South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ranke-4B,UZH,,4,,Dense,80,20:1,███,0.06,,,,,history only,Dec/2025,🔴,B,https://github.com/DGoettlich/history-llms?tab=readme-ov-file,,"Base Model: Qwen 3. 600B tokens of pre-(1913, 1929, 1933, 1939, 1946) data only.",███,███,███,███,███,Non-commercial research,███,Switzerland,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WeDLM,Tencent,https://huggingface.co/tencent/WeDLM-8B-Instruct,8,,Dense,"18,000","2,250:1",███,1.3,75.14,,44.95,,"synthetic, web-scale",Dec/2025,🟢,C,███,Diffusion,"Project page: https://wedlm.github.io/ ""WeDLM, a diffusion decoding framework built entirely on standard causal attention to make parallel generation prefix-cache friendly. The core idea is to let each masked position condition on all currently observed tokens while keeping a strict causal mask, achieved by Topological Reordering that moves observed tokens to the physical prefix while preserving their logical positions.. We instantiate WeDLM on both Qwen2.5-7B and Qwen3-8B, utilizing 100B tokens for continued training and 10B tokens for SFT.""",733,███,███,███,███,███,"16,384",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SOLAR Open,Upstage AI,https://huggingface.co/upstage/Solar-Open-100B,102,12,MoE,"19,700",194:1,███,4.7,88.2,███,68.1,10.5,"synthetic, web-scale",Dec/2025,🟢,A,https://huggingface.co/upstage/Solar-Open-100B,,South Korean. Released 31/Dec.,732,███,███,███,███,███,"131,072",South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-4.7,Z.AI,https://huggingface.co/zai-org/GLM-4.7,355,32,MoE,"22,000",62:1,███,9.3,,84.3,85.7,42.8,"synthetic, web-scale",Dec/2025,███,A,https://z.ai/blog/glm-4.7,Reasoning,"""context window has been expanded from 128K to 200K tokens""",731,███,███,███,███,MIT,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NitroGen,NVIDIA,https://huggingface.co/nvidia/NitroGen,0.493,,Dense,"2,000","4,057:1",███,0.1,,,,,video,Dec/2025,🟢,C,https://nitrogen.minedojo.org/assets/documents/nitrogen.pdf,,"""NitroGen is a unified vision-to-action model designed to play video games directly from raw frames. It takes video game footage as input and outputs gamepad actions... trained on 40,000 hours of gameplay videos across more than 1,000 games.""",███,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiMo-V2-Flash,Xiaomi,https://github.com/XiaomiMiMo/MiMo-V2-Flash/tree/main,309,15,MoE,"27,000",88:1,███,9.6,86.7,84.9,83.7,22.1,"synthetic, web-scale",Dec/2025,🟢,███,https://github.com/XiaomiMiMo/MiMo-V2-Flash/blob/main/paper.pdf,Reasoning,"MoE with 309B total / 15B active params, trained on 27T tokens with FP8 mixed precision; 256K context via SWA + Global Attention (5:1 ratio, 128-token window) reducing KV cache ~6x; lightweight MTP module triples inference speed; RL post-training achieves 94.1 on AIME 2025 and 73.4% on SWE-bench Verified. Open-weight.",729,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FunctionGemma,Google DeepMind,https://huggingface.co/google/functiongemma-270m-it,0.27,,Dense,"6,000","22,223:1",███,0.1,,,,,"synthetic, web-scale",Dec/2025,🟢,A,https://blog.google/technology/developers/functiongemma/,,"""FunctionGemma, a specialized version of our Gemma 3 270M model tuned for function calling. It is designed as a strong base for further training into custom, fast, private, local agents that translate natural language into executable API actions.""",███,███,███,███,███,Gemma,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ T5Gemma 2,Google DeepMind,https://huggingface.co/collections/google/t5gemma-2,4,,Dense,"6,000","1,500:1",███,0.5,,44.4,27.8,,"synthetic, web-scale",Dec/2025,🟢,A,https://arxiv.org/abs/2512.14856,███,Base model: Gemma 3. Dataset: Gemma 3 4B checkpoint (4T) + pretraining (2T)=6T.,727,███,███,███,███,Gemma,512,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 3 Flash,Google DeepMind,https://gemini.google.com/,200,10,MoE,"100,000",500:1,███,14.9,,,90.4,43.5,"synthetic, web-scale",Dec/2025,🟢,D,https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf,Reasoning,Announce: https://deepmind.google/models/gemini/flash/,███,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron 3 Nano 30B-A3B,NVIDIA,https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16,31.6,3.2,MoE,"25,000",792:1,███,3.0,,78.3,75,15.5,"synthetic, web-scale",Dec/2025,🟢,███,https://research.nvidia.com/labs/nemotron/files/NVIDIA-Nemotron-3-Nano-Technical-Report.pdf,Reasoning,"Hybrid Mamba-2 + Transformer MoE backbone. 31.6B total / 3.2B active params, 128 routed + 1 shared expert (6 active/token), 52 layers (23 MoE + 23 Mamba-2 + 6 GQA). Pretrained on 25T tokens (23.5T diverse + 1.5T high-quality, WSD schedule). 1M context window.",725,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Bolmo,Allen AI,https://huggingface.co/allenai/Bolmo-7B,███,,Dense,"6,000",858:1,███,0.7,65.1,,,,"synthetic, web-scale",Dec/2025,🟢,A,https://www.datocms-assets.com/64837/1765814974-bolmo.pdf,Reasoning,Base Model: Olmo 3 7B. Announce: https://allenai.org/blog/bolmo,724,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EuroLLM-22B,Consortium,https://huggingface.co/utter-project/EuroLLM-22B-Instruct-2512,22,,Dense,"4,100",███,███,1.0,69.81,50.85,26.77,,"synthetic, web-scale",Dec/2025,🟢,A,https://huggingface.co/blog/eurollm-team/eurollm-22b,,A fully open language model developed in Europe.,723,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LLaDA2.0 Flash,Inclusion AI,https://github.com/inclusionAI/LLaDA2.0,103,6.1,MoE,"20,750",202:1,███,4.9,███,73.4,61.98,,"synthetic, web-scale",Dec/2025,🟢,A,https://github.com/inclusionAI/LLaDA2.0/blob/main/tech_report.pdf,Diffusion,"Base Model: Ling-flash-2.0: 103B total parameters with 6.1B activated. ""largest diffusion language model to date""",722,███,███,███,███,Apache 2.0,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.2,OpenAI,https://chatgpt.com/,███,150,MoE,"114,000",38:1,███,61.6,91.3,,93.2,50,"synthetic, web-scale",Dec/2025,🟢,D,https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf,"Reasoning, SOTA","""GPT‑5.2 sets a new state of the art across many benchmarks, including GDPval, where it outperforms industry professionals at well-specified knowledge work tasks spanning 44 occupations."" Announce: https://openai.com/index/introducing-gpt-5-2/ MMLU is for Spanish.",721,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Apriel-1.6-15B-Thinker,ServiceNow,https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-15b-thinker,███,,Dense,"6,000",400:1,███,1.0,,,73,10,"synthetic, web-scale",Dec/2025,🟢,C,https://huggingface.co/blog/ServiceNow-AI/apriel-1p6-15b-thinker,Reasoning,"Dense 15B reasoning model from ServiceNow built via depth-upscaling and two-stage continual pretraining; RL with GSPO reduces output token usage 30%+ vs. v1.5; 49K text context; scores 88 on AIME 2025 and 79 on MMLU-Pro, competitive with Qwen3-235B. Open-weight.",720,███,███,███,███,███,"49,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Motif 2 12.7B,Motif-Technologies,https://huggingface.co/Motif-Technologies/Motif-2-12.7B-Reasoning,12.7,,Dense,"5,500",434:1,███,0.9,86.11,70,,███,"synthetic, web-scale",Dec/2025,🟢,A,https://arxiv.org/abs/2511.07464,,"Dense 12.7B model trained on 5.5T tokens across linguistic, math, scientific, and programming domains; introduces Grouped Differential Attention (GDA) to disentangle signal and noise-control attention pathways; uses MuonClip optimizer, fused PolyNorm activations, and a curriculum-driven data scheduler. Open-weight.",719,███,███,███,███,███,"131,072",South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Devstral 2,Mistral,https://console.mistral.ai/,123,,Dense,"30,000",244:1,███,6.4,,,███,,"synthetic, web-scale",Dec/2025,🟢,C,https://mistral.ai/news/devstral-2-vibe-cli,,SWE-bench Verified=72.2%.,718,███,███,███,███,MIT,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nanbeige4-3B-Base,Nanbeige LLM Lab,https://huggingface.co/Nanbeige/Nanbeige4-3B-Base,3,,Dense,"23,000","7,667:1",███,0.9,,,82.2,,"synthetic, web-scale",Dec/2025,🟢,A,https://arxiv.org/abs/2512.06266,,"Dense 3B model pretrained on 23T high-quality tokens with a Fine-Grained Warmup-Stable-Decay (FG-WSD) training scheduler; post-training uses Dual Preference Distillation (DPD), chain-of-thought reconstruction, and RL with verifiable rewards; significantly outperforms models of comparable parameter scale and rivals much larger models. Open-weight.",███,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ HY 2.0,Tencent,https://hunyuan.tencent.com/,406,32,MoE,"40,000",99:1,███,13.4,,,,18.8,"synthetic, web-scale",Dec/2025,🟢,███,https://x.com/TencentHunyuan/status/1996948083377332614,Reasoning,"Tencent Hunyuan MoE reasoning model (406B total); supports slow- and fast-thinking modes, 256K context; achieves 87.3 on AIME 2024, 94.3 on MATH, and 78.3 on BFCL v3 tool use; available via Tencent Cloud API.",716,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ K2-V2,MBZUAI,https://huggingface.co/LLM360/K2-V2,70,███,Dense,"12,000",172:1,███,3.1,75.2,59.8,69.3,,"synthetic, web-scale",Dec/2025,🟢,A,https://www.llm360.ai/reports/K2_V2_report.pdf,Reasoning,8.5x more tokens trained than K2 (1.4T v 12T). Project page: https://ifm.ai/k2/,715,███,███,███,███,Apache 2.0,███,UAE,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Trinity-Mini,Arcee AI,https://huggingface.co/arcee-ai/Trinity-Mini,26,3,███,"20,000",770:1,███,2.4,84.95,,58.55,,"synthetic, web-scale",Dec/2025,🟢,A,https://www.arcee.ai/blog/the-trinity-manifesto,Reasoning,"""we worked closely with Prime Intellect. They not only served the H100 clusters Datology used to generate synthetic data, they have been deeply involved in helping scale our training setup to the GPU footprint required for a fully frontier sized model, including the current 2048 B300 GPU configuration for Trinity Large.""",714,███,███,███,███,Other,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nova 2 Pro,Amazon,https://nova.amazon.com/chat,200,,███,"20,000",100:1,███,6.7,,81.6,81.4,,"synthetic, web-scale",Dec/2025,🟢,D,https://www.aboutamazon.com/news/aws/aws-agentic-ai-amazon-bedrock-nova-models,Reasoning,"""Nova 2 Pro is Amazon's most intelligent reasoning model that can process text, images, video, and speech to generate text.""",713,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Large 3,Mistral,https://huggingface.co/collections/mistralai/mistral-large-3,675,41,MoE,"20,000",30:1,███,12.2,,███,43.9,,"synthetic, web-scale",Dec/2025,🟢,C,https://mistral.ai/news/mistral-3,Reasoning,"""Mistral Large 3 joins the ranks of frontier instruction-fine-tuned open-source models."" EU tech doc: https://legal.cms.mistral.ai/assets/1e37fffd-7ea5-469b-822f-05dcfbb43623",712,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V3.2-Speciale,DeepSeek-AI,https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale,685,37,MoE,"15,640",23:1,███,10.9,,,85.7,30.6,"synthetic, web-scale",Dec/2025,🟢,C,https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf,███,"The word 'Speciale' may be a reference to Ferrari. ""It shows gold-medal performance in the IOI 2025, ICPC World Final 2025, IMO 2025, and CMO 2025."" API: https://api-docs.deepseek.com/news/news251201",711,███,███,███,███,MIT,"163,840",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ZAYA1-base,Zyphra,https://huggingface.co/Zyphra/ZAYA1-base,8.3,0.76,MoE,"14,000","1,687:1",███,1.1,███,40.43,30.7,,"synthetic, web-scale",Nov/2025,🟢,A,https://arxiv.org/abs/2511.17127,,First large-scale MoE foundation model trained entirely on an integrated AMD platform (Instinct MI300X + Pensando Pollara + ROCm); 760M active params.,710,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-Math-V2,DeepSeek-AI,https://huggingface.co/deepseek-ai/DeepSeek-Math-V2,685,37,MoE,"15,640",23:1,███,10.9,,,,,"synthetic, web-scale",Nov/2025,🟢,███,https://github.com/deepseek-ai/DeepSeek-Math-V2/blob/main/DeepSeekMath_V2.pdf,"SOTA, Reasoning","""DeepSeekMath-V2, demonstrates strong theorem-proving capabilities, achieving gold-level scores on IMO 2025 and CMO 2024 and a near-perfect 118/120 on Putnam 2024 with scaled testtime compute. """,709,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Orchestrator-8B,NVIDIA,https://huggingface.co/nvidia/Orchestrator-8B,8,,Dense,"36,000",███,███,1.8,,,,37.1,"synthetic, web-scale",Nov/2025,🟢,A,https://arxiv.org/abs/2511.21689,Reasoning,Base Model: Qwen3-8B,708,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ INTELLECT-3,Prime Intellect,https://chat.primeintellect.ai/ ,███,12,MoE,"22,000",208:1,███,5.1,,81.9,74.4,14.6,"synthetic, web-scale",Nov/2025,🟢,A,https://storage.googleapis.com/intellect-3-paper/INTELLECT_3_Technical_Report.pdf,Reasoning,Base: GLM-4.5-Air-Base model. Announce: https://www.primeintellect.ai/blog/intellect-3,707,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fara-7B,Microsoft,https://huggingface.co/microsoft/Fara-7B,7,███,Dense,"18,000","2,572:1",███,1.2,,,,,"synthetic, web-scale",Nov/2025,🟢,A,https://www.microsoft.com/en-us/research/wp-content/uploads/2025/11/Fara-7B-An-Efficient-Agentic-Model-for-Computer-Use.pdf,,"""Fara-7B is Microsoft's first agentic small language model (SLM) designed specifically for computer use. With only 7 billion parameters, Fara-7B is an ultra-compact Computer Use Agent (CUA)...Current production baselines leverage Qwen 2.5-VL (7B).""",706,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Opus 4.5,Anthropic,https://claude.ai/ ,5000,250,MoE,"100,000",20:1,███,74.5,,,86.95,███,"synthetic, web-scale",Nov/2025,🟢,D,https://assets.anthropic.com/m/64823ba7485345a7/Claude-Opus-4-5-System-Card.pdf,"Reasoning, SOTA","""the best model in the world for coding, agents, and computer use."" Announce: https://www.anthropic.com/news/claude-opus-4-5",705,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron Elastic,NVIDIA,https://huggingface.co/nvidia/Nemotron-Elastic-12B,12,,Dense,110,10:1,███,0.1,,76.2,63.25,,"synthetic, web-scale",Nov/2025,🟢,A,███,Reasoning,"""Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning...We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens""",704,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GeoVista,Tencent,https://github.com/ekonwang/GeoVista,7,,Dense,"18,000","2,572:1",███,1.2,,,,,"synthetic, web-scale",Nov/2025,🟢,A,https://arxiv.org/abs/2511.15705,███,"Base model: Qwen2.5-VL-7B-Instruct. ""GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. "" Project page: https://ekonwang.github.io/geo-vista/",703,███,███,███,███,███,"4,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMo 3,Allen AI,https://huggingface.co/collections/allenai/olmo-3,32,,Dense,"6,000",188:1,███,1.5,85.4,,58.1,,"synthetic, web-scale",Nov/2025,🟢,A,https://www.datocms-assets.com/64837/1763662397-1763646865-olmo_3_technical_report-1.pdf,Reasoning,███,702,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 3 Pro,Google DeepMind,https://gemini.google.com/,3000,150,MoE,"100,000",███,███,57.7,,90.1,93.8,45.8,"synthetic, web-scale",Nov/2025,🟢,D,https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Model-Card.pdf,"Reasoning, SOTA","""The knowledge cutoff date for Gemini 3 Pro was January 2025.""",701,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok 4.1,xAI,https://grok.com/,3000,150,███,"80,000",27:1,███,51.6,,,,,"synthetic, web-scale",Nov/2025,🟢,D,https://x.ai/news/grok-4-1,Reasoning,"xAI MoE reasoning model; API/closed-weight only; focuses on extended chain-of-thought reasoning for math, science, and coding; positioned as an incremental improvement over Grok 4 with the same API interface. Specific benchmark details not publicly disclosed.",700,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Baguettotron,PleIAs,https://huggingface.co/PleIAs/Baguettotron,0.321,,Dense,200,624:1,███,0.03,40,,,███,"synthetic, web-scale",Nov/2025,🟢,A,https://huggingface.co/PleIAs/Baguettotron,Reasoning,"""The name is both a nod to French origins and to the unusual shape of the model: with 80 layers, Baguettotron is currently the deepest SLM in its size range.""",699,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE-5.0-Preview-1022,Baidu,https://ernie.baidu.com/,2400,120,MoE,"100,000",42:1,███,███,,,,,"synthetic, web-scale",Nov/2025,🟢,C,https://arxiv.org/abs/2602.04705,Reasoning,Very low performance on ALPrompt. 2.4T params confirmed: https://global.chinadaily.com.cn/a/202511/13/WS691571bda310d6866eb29500.html,698,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5.1,OpenAI,https://chatgpt.com/,3000,150,MoE,"114,000",38:1,███,61.6,91,,88.1,,"synthetic, web-scale",Nov/2025,███,D,https://openai.com/index/gpt-5-1/,"Reasoning, SOTA",Personality change via fine-tuning. GPQA (no tools) increased from GPT-5=85.7 to GPT-5.1=88.1. MMLU is for Spanish.,697,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TiDAR,NVIDIA,https://tidarlm.github.io/,8,,Dense,"36,150","4,519:1",███,1.8,76.57,,,,"synthetic, web-scale",Nov/2025,███,A,https://arxiv.org/abs/2511.08923,"Reasoning, Diffusion","Base model: Qwen3-8B (36T) + 150B continual training. ""TiDAR, a sequence-level hybrid architecture that drafts tokens (Thinking) in Diffusion and samples final outputs (Talking) AutoRegressively - all within a single forward pass using specially designed structured attention masks""",696,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SONIC,NVIDIA,https://nvlabs.github.io/GEAR-SONIC/,0.04,,Dense,"1,000","25,000:1",███,0.02,███,,,,video,Nov/2025,🟢,A,https://arxiv.org/abs/2511.07820,,"Supersizing mOtion tracking for Natural humanoId Control (SONIC). Training dataset calcs: (700 hours * 3,600 seconds/hour * 50 frames/second ) / 1 frame/token = 126M tokens (refined to 100M+ after rigorous filtering ); 150,000 steps * 6.67M tokens/batch = 1.0T total tokens seen during training.",695,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ JustRL-Nemotron-1.5B,Tsinghua,https://huggingface.co/hbx/JustRL-Nemotron-1.5B,1.5,,Dense,"9,000","6,000:1",███,0.4,███,,,,"synthetic, web-scale",Nov/2025,🟢,C,https://relieved-cafe-fe1.notion.site/JustRL-Scaling-a-1-5B-LLM-with-a-Simple-RL-Recipe-24f6198b0b6b80e48e74f519bfdaf0a8,Reasoning,"""JustRL, a simple recipe with fixed hyperparameters, achieves state-of-the-art performance on two different 1.5B base models (54.5% and 64.3% across 9 math benchmarks) while using 2× less compute than sophisticated approaches. The same hyperparameters transfer across both models without tuning, and training remains stable over thousands of steps without intervention. This suggests the field may be adding complexity to solve problems that disappear with a stable, scaled-up baseline.""",694,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE-4.5-VL-28B-A3B-Thinking,Baidu,https://huggingface.co/baidu/ERNIE-4.5-VL-28B-A3B-Thinking,28,3,MoE,"15,000",536:1,███,2.2,78.9,66,,,"synthetic, web-scale",Nov/2025,🟢,C,https://github.com/PaddlePaddle/ERNIE,███,Open-sourced 12/Nov/2025 from Jun/2025 release.,693,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ HOPE,Google DeepMind,,1.3,,Dense,100,77:1,███,0.04,,,,,"synthetic, web-scale",Nov/2025,🟡,B,https://abehrouz.github.io/files/NL.pdf,███,"""Combining our self-modifying sequence model with the continuum memory system, we present a learning module, called HOPE, showing promising results in language modeling, continual learning, and long-context reasoning tasks."" Announce: https://research.google/blog/introducing-nested-learning-a-new-ml-paradigm-for-continual-learning/ May be released after paper is public.",692,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi K2 Thinking,Moonshot AI,https://kimi.com/,1000,32,MoE,"15,500",16:1,███,13.1,94.4,84.6,███,44,"synthetic, web-scale",Nov/2025,🟢,A,https://moonshotai.github.io/Kimi-K2/thinking.html,"Reasoning, SOTA","1T parameters and 384 experts. Open source SOTA. HLE=51.0 on text-only subset, compare to Grok-4 HLE=50.7 also on text-only, but Grok-4 HLE=44.4 on HLE full, ∴ Kimi K2 Thinking HLE≈44 full (estimated).",691,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ling-1T,Inclusion AI,https://huggingface.co/inclusionAI/Ling-1T,1000,50,MoE,"20,750",21:1,███,15.2,86.03,82.04,72.98,,"synthetic, web-scale",Nov/2025,🟢,A,https://arxiv.org/abs/2510.22115,,"MoE with 1T parameters at high sparsity; trained with full-scale FP8, reasoning-oriented data, mid-training CoT activation, and DFT/Evo-CoT reinforcement fine-tuning; achieves up to 7-fold active-compute efficiency compared with dense counterparts and establishes a new Pareto frontier of reasoning accuracy versus computational efficiency. Open-weight.",███,███,███,███,███,MIT,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GEN-0,Generalist,https://generalistai.com/blog/nov-04-2025-GEN-0,10,,███,"10,000","1,000:1",███,1.1,,,,,robotics,Nov/2025,🟡,C,https://generalistai.com/blog/nov-04-2025-GEN-0,SOTA,"""GEN-0, a new class of embodied foundation models built for multimodal training directly on high-fidelity raw physical interaction. Its architecture builds on the strengths of vision and language models while also going beyond them—natively designed to capture human-level reflexes and physical commonsense. One core feature is Harmonic Reasoning, in which the models are trained to simultaneously think and act seamlessly... GEN-0 is pretrained on our in-house robotics dataset, which includes over 270,000 hours of real-world diverse manipulation data, growing at a rate of 10,000 hours a week and accelerating.""",689,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ouro,ByteDance,https://huggingface.co/ByteDance/Ouro-2.6B,2.6,,Dense,"7,700","2,962:1",███,0.5,,55.73,38.4,,"synthetic, web-scale",Oct/2025,🟢,A,https://arxiv.org/abs/2510.25741,███,"""We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective... and (iii) scaling to 7.7T tokens.""",688,███,███,███,███,Apache 2.0,"65,536",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CALM,Wechat,https://github.com/shaochenze/calm,1.82,,Dense,230,127:1,███,0.07,,,,,web-scale,Oct/2025,🟢,A,███,,"""Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9% accuracy... We train our models on the Pile uncopyrighted dataset (Gao et al., 2020). The raw text is processed with the Llama 3 tokenizer (Grattafiori et al., 2024), resulting in a training set of ∼230B tokens.""",687,███,███,███,███,███,"4,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi-Linear,Moonshot AI,https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct,48,3,MoE,"5,700",119:1,███,1.7,,51,,,"synthetic, web-scale",Oct/2025,███,A,https://github.com/MoonshotAI/Kimi-Linear?tab=readme-ov-file,,"""Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At its core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory.""",686,███,███,███,███,███,"1,000,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-M2,MiniMax,https://huggingface.co/MiniMaxAI/MiniMax-M2,230,10,MoE,"7,200",32:1,███,4.3,,82,78,31.8,web-scale,Oct/2025,███,C,https://platform.minimax.io/docs/guides/text-generation,Reasoning,"MoE with 229.9B total / 9.8B active params, designed end-to-end for agentic deployment; trained with agent-driven data pipelines and Forge, a scalable agent-native RL system with windowed-FIFO scheduling and prefix-tree merging; M2.7 checkpoint demonstrates self-evolution capabilities including modifying its own architectural scaffolding. Open-weight.",685,███,███,███,███,Other,"204,800",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MACE-MH-1,Cambridge/LBNL,https://huggingface.co/mace-foundations/mace-mh-1,0.025,0.025,Dense,0,9:1,███,0.000,,,,,materials chemistry,Oct/2025,🟢,A,███,,"MACE-MH-1 (Multi-Head 1). Features Multiple Heads (OMAT PBE, OMOL r2scan, OC20) to maintain high accuracy across domains",684,███,███,███,███,Other,███,UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-OCR,DeepSeek-AI,https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf,3,0.57,MoE,"6,000","2,000:1",███,███,,,,,special,Oct/2025,🟢,C,https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf,,"2D vision tokens for 1D text achieves huge compression. Encoder/Decoder: DeepEncoder 380M (80M SAM-base + 300M CLIP-large), DeepSeek-3B-MoE (A570M).",683,███,███,███,███,MIT,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ UserLM-8b,Microsoft,https://huggingface.co/microsoft/UserLM-8b,8,,Dense,"1,000",125:1,███,0.3,,,,,WildChat,Oct/2025,🟢,C,https://huggingface.co/microsoft/UserLM-8b,,"""we trained UserLM-8b to simulate the “user” role in conversation (by training it to predict user turns in a large corpus of conversations called WildChat).""",███,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CoDA,Salesforce,https://huggingface.co/Salesforce/CoDA-v0-Instruct,1.7,,Dense,180,106:1,███,0.06,,,,,"synthetic, web-scale",Oct/2025,🟢,A,https://github.com/SalesforceAIResearch/CoDA/blob/main/technical_report.pdf,Diffusion,"""diffusion coder trained on TPU [Google TPU v4-1024 VM]""",███,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TRM,Samsung,https://github.com/SamsungSAILMontreal/TinyRecursiveModels,0.007,,Dense,0,15:1,███,0.000,,,,,Mazes (ARC-AGI),Oct/2025,🟢,███,https://arxiv.org/abs/2510.04871v1,,"""Tiny Recursive Model (TRM), a much simpler recursive reasoning approach that achieves significantly higher generalization than HRM, while using a single tiny network with only 2 layers""",680,███,███,███,███,MIT,███,South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite-4.0 Small,IBM,https://huggingface.co/ibm-granite/granite-4.0-h-small,32,9,MoE,"15,000",469:1,███,2.3,78.33,55.47,40.63,,"synthetic, web-scale",Oct/2025,🟢,C,https://www.ibm.com/granite/docs/models/granite,Reasoning,Announce: https://www.ibm.com/new/announcements/ibm-granite-4-0-hyper-efficient-high-performance-hybrid-models,███,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-4.6,Z.AI,https://huggingface.co/zai-org/GLM-4.6,355,32,MoE,"22,000",███,███,9.3,,,82.9,30.4,"synthetic, web-scale",Sep/2025,🟢,A,https://z.ai/blog/glm-4.6,Reasoning,"""context window has been expanded from 128K to 200K tokens""",678,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ring-1T-preview,Inclusion AI,https://huggingface.co/inclusionAI/Ring-1T-preview,1000,48.5,MoE,"20,000",20:1,███,14.9,,███,,,"synthetic, web-scale",Sep/2025,🟢,C,https://huggingface.co/inclusionAI/Ring-1T-preview,Reasoning,"MoE with 1T parameters on the Ling 2.0 architecture, pretrained on 20T tokens and post-trained via RLVR with the AReaL framework and icepop efficient RL method; scores 92.6 on AIME 2025 and solved IMO 2025 Problem 3 on first attempt; preview release with known language-mixing and repetitive-reasoning artifacts. Open-weight.",677,███,███,███,███,MIT,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Sonnet 4.5,Anthropic,https://claude.ai/ ,1000,20,MoE,"80,000",80:1,███,29.8,,,83.4,,"synthetic, web-scale",Sep/2025,🟢,D,https://assets.anthropic.com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card.pdf,███,"The Claude Sonnet 4.5 ""system card"" is an absolute farce. Announce: https://www.anthropic.com/news/claude-sonnet-4-5",676,███,███,███,███,Proprietary,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Robotics 1.5,Google DeepMind,,200,10,MoE,"20,000",100:1,███,6.7,,,59.6,,"synthetic, web-scale",Sep/2025,🟢,F,https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf,███,"2. ""vision-language-action (VLA) model turns visual information and instructions into motor commands for a robot to perform a task."" Available to select partners.",675,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Robotics-ER 1.5,Google DeepMind,https://aistudio.google.com/?model=gemini-robotics-er-1.5-preview,30,1.5,███,"30,000","1,000:1",███,3.2,,,83.3,,"synthetic, web-scale",Sep/2025,🟢,D,https://storage.googleapis.com/deepmind-media/gemini-robotics/Gemini-Robotics-1-5-Tech-Report.pdf,Reasoning,"1. ""vision-language model (VLM) reasons about the physical world, natively calls digital tools and creates detailed, multi-step plans to complete a mission."" Available to all devs.",674,███,███,███,███,Proprietary,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TimesFM-ICF,Google,https://huggingface.co/collections/google/timesfm-release-66e4be5fdb56e960c1e482a6,0.2,,Dense,███,500:1,███,0.01,,,,,special,Sep/2025,🔴,A,https://research.google/blog/time-series-foundation-models-can-be-few-shot-learners/,,TimesFM-ICF is 6.8% more accurate than TimesFM (Base). Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.,673,███,███,███,███,Apache 2.0,512,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-Max,Alibaba,https://chat.qwen.ai/,███,50,MoE,"36,000",36:1,███,20.0,,,85.4,,"synthetic, web-scale",Sep/2025,🟢,D,https://qwen.ai/blog?id=241398b9cd6353de490b0f82806c7848c5d2777d&from=research.latest-advancements-list,Reasoning,"""Qwen3-Max-Thinking — still under active training — is already demonstrating remarkable potential. When augmented with tool usage and scaled test-time compute, the Thinking variant has achieved 100% on challenging reasoning benchmarks such as AIME 25 and HMMT. """,672,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-Omni,Alibaba,https://github.com/QwenLM/Qwen3-Omni?tab=readme-ov-file,30,1.5,MoE,"17,000",567:1,███,2.4,88.8,,73.1,███,"synthetic, web-scale",Sep/2025,🟢,A,https://github.com/QwenLM/Qwen3-Omni/blob/main/assets/Qwen3_Omni.pdf,Reasoning,"""Qwen3-Omni is a unified end-to-end model capable of processing multiple modalities, such as text, audio, image and video, and generating real-time text or speech response.""... ""pretraining utilizes a large-scale dataset containing approximately 2 trillion tokens, with the following distribution across modalities: text (0.57 trillion), audio (0.77 trillion), image (0.82 trillion), video (0.05 trillion), and video-audio (0.05 trillion).""",671,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V3.1-Terminus,DeepSeek-AI,https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Terminus,685,37,MoE,"15,640",███,███,10.9,,85,80.7,21.7,"synthetic, web-scale",Sep/2025,🟢,C,https://api-docs.deepseek.com/news/news250922,"SOTA, Reasoning",Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2,670,███,███,███,███,MIT,"163,840",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Isaac 0.1,Perceptron,https://huggingface.co/PerceptronAI/Isaac-0.1,2,,Dense,"2,000","1,000:1",███,0.2,███,,,,"synthetic, web-scale",Sep/2025,🟢,C,https://www.perceptron.inc/blog/introducing-isaac-0-1,,"""perceptive-language model...delivering capabilities that meet or exceed those of models over 50 times its size. Founded by the team behind Meta's Chameleon multimodal models, Perceptron is tackling a fundamental challenge: bringing the power of physical AI to the dynamic, multimodal, and real-time environments we live and work in.""",669,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok 4 Fast,xAI,https://grok.com/,3000,150,MoE,"20,000",███,███,25.8,,,85.7,20,"synthetic, web-scale",Sep/2025,🟢,D,https://x.ai/news/grok-4-fast,"Reasoning, SOTA","""2M token context window, and a unified architecture that blends reasoning and non-reasoning modes in one model.""",668,███,███,███,███,███,"2,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ VaultGemma,Google DeepMind,https://huggingface.co/google/vaultgemma-1b,1,,Dense,"13,000","13,000:1",███,0.4,,,,,web-scale,Sep/2025,🟢,C,https://services.google.com/fh/files/blogs/vaultgemma_tech_report.pdf,███,"""Differential Privacy (DP) has emerged as the gold standard, providing a rigorous, mathematical framework to limit the influence of any single example in the training data on the resulting model. A model trained with DP provably bounds the reconstruction or leakage of information tied to individual data points."" Announce: https://research.google/blog/vaultgemma-the-worlds-most-capable-differentially-private-llm/",667,███,███,███,███,███,"1,024",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-Next-80B-A3B,Alibaba,███,80,3,MoE,"15,000",188:1,███,3.7,84.72,66.05,43.43,,"synthetic, web-scale",Sep/2025,🟢,A,https://qwen.ai/blog?id=4074cca80393150c248e508aa62983f9cb7d27cd&from=research.latest-advancements-list,Reasoning,"""Qwen3-Next introduces several key improvements: a hybrid attention mechanism, a highly sparse Mixture-of-Experts (MoE) structure, training-stability-friendly optimizations, and a multi-token prediction mechanism for faster inference.""",666,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ K2-Think,MBZUAI,https://www.k2think.ai/,32,,███,"18,000",563:1,███,2.5,,,71.08,9.95,"synthetic, web-scale",Sep/2025,🟢,A,https://arxiv.org/abs/2509.07604,Reasoning,"""Built on the Qwen2.5 base model, our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time computation techniques. The approach is based on six key technical pillars: Long Chain-of-thought Supervised Finetuning, Reinforcement Learning with Verifiable Rewards (RLVR), Agentic planning prior to reasoning, Test-time Scaling, Speculative Decoding, and Inference-optimized Hardware, all using publicly available open-source datasets.""",665,███,███,███,███,Apache 2.0,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mmBERT,JHU,https://huggingface.co/collections/jhu-clsp/mmbert-a-modern-multilingual-encoder-68b725831d7c6e3acc435ed4,0.307,,Dense,"3,000",███,███,0.1,,,,,"synthetic, web-scale",Sep/2025,🟢,A,https://arxiv.org/abs/2509.06888,,"""a modern multilingual encoder trained on 3T tokens and 1833 languages. We introduce several novel elements in training: an inverse masking schedule and a cascading annealed language learning schedule for multilingual data"" Announce: https://huggingface.co/blog/mmbert",664,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE X1.1,Baidu,███,424,47,MoE,"30,000",71:1,███,11.9,,,,,"synthetic, web-scale",Sep/2025,🟢,D,https://www.prnewswire.com/news-releases/baidu-unveils-reasoning-model-ernie-x1-1-with-upgrades-in-key-capabilities-302551170.html,Reasoning,Dataset: Estimate only. Params: ERNIE 4.5 family MoE (424/47); X1.1 same arch,663,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE-4.5-21B-A3B-Thinking,Baidu,https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking,21,3,MoE,"15,000",715:1,███,1.9,,,,███,"synthetic, web-scale",Sep/2025,🟢,C,https://huggingface.co/baidu/ERNIE-4.5-21B-A3B-Thinking,Reasoning,"MoE with 21B total / 3B active params (64 experts, 6 active + 2 shared), 128K context; post-trained for extended chain-of-thought reasoning with enhanced tool use and function calling; significantly improved performance on reasoning tasks in logic, mathematics, science, and coding. Apache 2.0, open-weight.",662,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Klear-46B-A2.5B,Kuaishou,https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct,46,2.5,MoE,"22,000",479:1,███,3.4,80.5,███,35.3,,"synthetic, web-scale",Sep/2025,🟢,A,https://huggingface.co/Kwai-Klear/Klear-46B-A2.5B-Instruct,,"MoE with 46B total / 2.5B active params (256 experts), trained on 22T+ tokens via three-stage progressive curriculum (foundational 12T, complexity 8T, reasoning+long-context 2T); 64K context; scores 86.4 on MATH500 and 63.61 on MMLU-Pro. Open-weight from Kuaishou.",661,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TildeOpen-30b,Tilde AI,███,30,,Dense,"2,000",67:1,███,0.8,,,,,"synthetic, web-scale",Sep/2025,🟢,A,https://tilde.ai/lv/tildeopen-llm/,,"""language data from across Europe""",660,███,███,███,███,███,"65,536",Latvia,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-Max-Preview,Alibaba,https://chat.qwen.ai/,1000,50,MoE,"36,000",███,███,20.0,,,64.6,,"synthetic, web-scale",Sep/2025,🟢,D,https://modelstudio.console.alibabacloud.com/?tab=doc#/doc/?type=model&url=2840914_2&modelId=qwen3-max-preview,,"GPQA score is SuperGPQA. ""our biggest model yet, with over 1 trillion parameters""",659,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi K2-Instruct-0905,Moonshot AI,https://huggingface.co/moonshotai/Kimi-K2-Instruct ,1000,32,MoE,"15,500",███,███,13.1,,,,,"synthetic, web-scale",Sep/2025,🟢,A,https://huggingface.co/moonshotai/Kimi-K2-Instruct-0905,"Reasoning, SOTA",1T parameters and 384 experts. Open source SOTA.,658,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Apertus,ETH Zürich,https://huggingface.co/swiss-ai/Apertus-70B-Instruct-2509,70,,Dense,"15,000",215:1,███,3.4,65.2,,30.6,,"synthetic, web-scale",Sep/2025,🟢,A,https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf,███,"""Apertus – Latin for “open”"" 1,811 languages. Announce: https://ethz.ch/en/news-and-events/eth-news/news/2025/09/press-release-apertus-a-fully-open-transparent-multilingual-language-model.html",657,███,███,███,███,Apache 2.0,"65,536",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LongCat-Flash,Meituan,https://longcat.ai/,560,18.6,MoE,"20,000",36:1,███,11.2,89.71,82.68,73.23,,"synthetic, web-scale",Sep/2025,🟢,A,https://github.com/meituan-longcat/LongCat-Flash-Chat/blob/main/tech_report.pdf,"Reasoning, SOTA",███,656,███,███,███,███,MIT,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Baichuan-M2,Baichuan,https://github.com/baichuan-inc/Baichuan-M2-32B,32,,Dense,"20,000",625:1,███,2.7,,,,,"synthetic, web-scale",Sep/2025,🟢,A,███,Reasoning,"Base: Qwen2.5. ""medical augmented reasoning model""",655,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MAI-1-preview,Microsoft,https://microsoft.ai/news/two-new-in-house-models/,500,25,MoE,"10,000",20:1,███,7.5,,,███,,"synthetic, web-scale",Aug/2025,🟢,D,https://microsoft.ai/news/two-new-in-house-models/,,"MAI=Microsoft artificial intelligence. ""MAI’s first foundation model trained end-to-end... MAI-1-preview is an in-house mixture-of-experts model, pre-trained and post-trained on ~15,000 NVIDIA H100 GPUs. This model is designed to provide powerful capabilities to consumers seeking to benefit from models that specialize in following instructions and providing helpful responses to everyday queries. We will be rolling MAI-1-preview out for certain text use cases within Copilot""",654,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ grok-code-fast-1,xAI,https://github.com/features/copilot,800,40,MoE,"10,000",13:1,███,9.4,,,,,"synthetic, web-scale",Aug/2025,███,D,https://data.x.ai/2025-08-26-grok-code-fast-1-model-card.pdf,,"""We built grok-code-fast-1 from scratch, starting with a brand-new model architecture. To lay a robust foundation, we carefully assembled a pre-training corpus rich with programming-related content. For post-training, we curated high-quality datasets that reflect real-world pull requests and coding tasks."" Announce: https://x.ai/news/grok-code-fast-1",653,███,███,███,███,Proprietary,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hermes 4,Nous Research,https://huggingface.co/NousResearch/Hermes-4-405B-FP8,405,,Dense,"15,656",███,███,8.4,87.2,80.5,70.5,,"synthetic, web-scale",Aug/2025,🟢,A,https://arxiv.org/abs/2508.18255,Reasoning,Based on Llama 3. Announce: https://hermes4.nousresearch.com/,652,███,███,███,███,Llama 3,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jet-Nemotron-4B,NVIDIA,https://github.com/NVlabs/Jet-Nemotron,4,,Dense,400,100:1,███,0.1,65.2,44.2,,███,"synthetic, web-scale",Aug/2025,🟢,A,https://arxiv.org/abs/2508.15884v1,Reasoning,"""pre-training corpus and train Jet-Nemotron models for 50B tokens. This is also the setting in Section 2 where we perform PostNAS. At the second stage, we include more high-quality data from math [65] and coding [66, 67] domains into our data mixture. The models are then trained on 350B tokens.""",651,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V3.1-Base,DeepSeek-AI,https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base,685,37,MoE,███,23:1,███,10.6,93.7,84.8,80.1,29.8,"synthetic, web-scale",Aug/2025,🟢,C,https://huggingface.co/deepseek-ai/DeepSeek-V3.1-Base,"SOTA, Reasoning",Hybrid reasoning. Dataset tokens: https://x.com/deepseek_ai/status/1958417072536608952 HLE: https://x.com/deepseek_ai/status/1958417068568481854/photo/2,650,███,███,███,███,███,"163,840",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron Nano 2,NVIDIA,https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-12B-v2-Base,12.31,,Dense,"20,000","1,625:1",███,1.7,78.24,63.98,64.48,,"synthetic, web-scale",Aug/2025,🟢,A,███,Reasoning,Announce: https://research.nvidia.com/labs/adlr/NVIDIA-Nemotron-Nano-2/,649,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma 3 270M,Google DeepMind,https://huggingface.co/google/gemma-3-270m-it,0.27,,Dense,"6,000","22,223:1",███,0.1,,,,,web-scale,Aug/2025,███,A,https://developers.googleblog.com/en/introducing-gemma-3-270m/,,"Dense 270M-param model (170M embeddings for 256K-token vocabulary, 100M transformer blocks); ships with QAT INT4 weights consuming just 0.75% battery per 25 conversations on Pixel 9 Pro; designed for fine-tuning on high-volume tasks such as sentiment analysis, entity extraction, and query routing rather than general chat. Open-weight.",648,███,███,███,███,Gemma,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-5,OpenAI,https://poe.com/GPT-5,3000,150,MoE,"114,000",38:1,███,61.6,91,███,89.4,42,"synthetic, web-scale",Aug/2025,🟢,D,https://openai.com/index/gpt-5-system-card/,"SOTA, Reasoning",Announce: https://openai.com/index/introducing-gpt-5/. MMLU is based on ES and PT translated from EN.,647,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ gpt-oss-120b,OpenAI,https://huggingface.co/openai/gpt-oss-120b,116.8,5.1,MoE,"30,000",257:1,███,6.2,90,,80.1,19,"synthetic, web-scale",Aug/2025,🟢,C,https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf,"Reasoning, SOTA",116.8B total parameters and 5.1B “active” parameters per token per forward pass. https://openai.com/index/introducing-gpt-oss/,███,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ gpt-oss-20b,OpenAI,https://huggingface.co/openai/gpt-oss-20b,20.9,3.6,MoE,"13,000",623:1,███,1.7,███,,71.5,17.3,"synthetic, web-scale",Aug/2025,🟢,C,https://cdn.openai.com/pdf/419b6906-9da6-406c-a19d-1bb078ac7637/oai_gpt-oss_model_card.pdf,"Reasoning, SOTA",20.9B total and 3.6B active parameters. https://openai.com/index/introducing-gpt-oss/,645,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Opus 4.1,Anthropic,https://claude.ai/ ,5000,250,MoE,"100,000",███,███,74.5,,,80.9,,"synthetic, web-scale",Aug/2025,🟢,D,https://www.anthropic.com/news/claude-opus-4-1,"Reasoning, SOTA","Achieves 74.5% on SWE-bench Verified (with extended thinking); one standard deviation improvement over Opus 4 on Windsurf's junior developer benchmark; strong multi-file code refactoring and agentic search; pricing unchanged from Opus 4. API-only via claude.ai, Claude Code, Bedrock, and Vertex AI; closed weights.",644,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-4.5,Z.AI,https://huggingface.co/zai-org/GLM-4.5,355,32,MoE,"22,000",62:1,███,9.3,,84.6,79.1,14.4,"synthetic, web-scale",Jul/2025,███,A,https://z.ai/blog/glm-4.5,Reasoning,"MoE with 355B total / 32B active params; 128K context; hybrid thinking/non-thinking modes with MTP layers for efficient inference; scores 63.2 across 12 industry benchmarks (3rd among open and proprietary models); GLM-4.5-Air variant is 106B total / 12B active. MIT licensed, open-weight.",643,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ T1,China Telecom Artificial Intelligence Research Institute,https://github.com/Tele-AI/T1,115,,Dense,"10,000",87:1,███,3.6,,,,,███,Jul/2025,🟢,A,https://arxiv.org/abs/2507.18013,Reasoning,"Dense 115B transformer from China Telecom AI Research Institute; pretrained on 10T tokens, then SFT, DPO, domain continual pretraining, and RL for code/math; designed for complex reasoning with long Chain-of-Thought support; claims to outperform o1-mini and GPT-4o on its evaluation suite. Open-weight.",642,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Intern-S1,Shanghai AI Laboratory/SenseTime,https://huggingface.co/internlm/Intern-S1,235,11.75,MoE,"41,000",175:1,███,10.3,███,83.5,77.3,,"synthetic, web-scale",Jul/2025,🟢,C,https://huggingface.co/internlm/Intern-S1,"Reasoning, SOTA","41T tokens assumes base model of Qwen3. ""Built upon a 235B MoE language model and a 6B Vision encoder, Intern-S1 has been further pretrained on 5 trillion tokens of multimodal data""",641,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Step 3,StepFun,https://www.stepfun.com/,321,38,MoE,"18,000",57:1,███,8.0,,,███,,web-scale,Jul/2025,🟢,C,https://github.com/stepfun-ai/Step3/blob/main/Step3-Sys-Tech-Report.pdf,,https://x.com/CyouSakura/status/1948767450751009227,640,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-235B-A22B-Thinking-2507,Alibaba,https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507,235,22,MoE,"36,000",154:1,███,9.7,93.8,84.4,81.1,,"synthetic, web-scale",Jul/2025,🟢,A,https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507,Reasoning,"""Qwen3 is pre-trained on 36 trillion tokens across 119 languages"" MMLU score is MMLU-Redux.",███,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ KAT-V1-200B,Kuaishou,,200,40,MoE,███,25:1,███,3.3,,82.3,78.2,,"synthetic, web-scale",Jul/2025,🔴,D,https://arxiv.org/abs/2507.08297,Reasoning,"In training as of Jul/2025. ""to address the overthinking problem in reasoning-intensive tasks""",638,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ KAT-V1-40B,Kuaishou,https://huggingface.co/Kwaipilot/KAT-V1-40B,40,,Dense,"5,000",125:1,███,1.5,███,77.8,75.1,,"synthetic, web-scale",Jul/2025,🟢,C,https://arxiv.org/abs/2507.08297,Reasoning,"""to address the overthinking problem in reasoning-intensive tasks""",637,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-Coder-480B-A35B-Instruct,Alibaba,https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct,480,35,MoE,"36,000",75:1,███,13.9,███,,,,"synthetic, web-scale",Jul/2025,🟢,A,https://qwenlm.github.io/blog/qwen3-coder/,,"MoE with 480B total / 35B active params; 256K native context (1M with extrapolation); pretrained on 7.5T tokens with 70% code ratio; post-trained with execution-feedback code RL and Long-Horizon RL for multi-turn agent trajectories; state-of-the-art among open models on Agentic Coding, Browser-Use, and Tool-Use, comparable to Claude Sonnet 4.",636,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-235B-A22B-Instruct-2507,Alibaba,https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507,235,███,MoE,"36,000",154:1,███,9.7,93.1,83,77.5,,"synthetic, web-scale",Jul/2025,🟢,A,https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507,SOTA,"""Qwen3 is pre-trained on 36 trillion tokens across 119 languages"" MMLU score is MMLU-Redux.",635,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FlexOlmo,Allen AI,https://huggingface.co/allenai/FlexOlmo-7x7B-1T,37,20,MoE,"4,150",113:1,███,1.3,60.4,30.9,███,,"synthetic, web-scale",Jul/2025,🟢,A,https://arxiv.org/abs/2507.07024v1,,"""We adopt the OLMo-2 7B setup, starting from a a checkpoint pre-trained on 4T tokens and annealed for 50B tokens to produce a public expert. We then train two additional experts on math and code, each for 50B tokens, and combine them with the public expert to form a three-expert version of FLEXOLMO.""",634,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EXAONE 4.0,LG,https://huggingface.co/LGAI-EXAONE/EXAONE-4.0-32B,32,,Dense,"14,000",438:1,███,2.2,92.3,81.8,75.4,,web-scale,Jul/2025,🟢,███,https://www.lgresearch.ai/data/cdn/upload/EXAONE_4_0.pdf,Reasoning,"“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio: EXAONE-3 7.8B=8T tokens (Aug/2024) -> EXAONE-3.5 7.8B=9T -> EXAONE-3.5 32B=6.5T tokens -> EXAONE 4.0 32B=14T tokens. MMLU score is MMLU-Redux. Interesting: ""To focus [RL] training on more informative data samples, we perform accuracy-based filtering by generating eight responses from the SFT model and excluding samples where all eight responses are correct, a pre-filtering step that removes problems that are easy for the model to avoid inefficient training.""",633,███,███,███,███,Other,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi K2,Moonshot AI,https://huggingface.co/moonshotai/Kimi-K2-Instruct,1000,32,MoE,"15,500",16:1,███,13.1,███,81.1,75.1,4.7,"synthetic, web-scale",Jul/2025,🟢,A,https://moonshotai.github.io/Kimi-K2/,"Reasoning, SOTA",1T parameters and 384 experts. Open source SOTA.,632,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Reka Flash 3.1,Reka AI,https://huggingface.co/RekaAI/reka-flash-3.1,21,,Dense,"5,000",239:1,███,1.1,,,,███,web-scale,Jul/2025,🟢,C,https://www.reka.ai/news/introducing-reka-flash,Reasoning,Dense 21B reasoning model with 32K context (35% fewer params than QwQ-32B); SFT + RLOO reinforcement learning targeting general improvements rather than narrow math/coding; scores 65.0 on MMLU-Pro and compresses to 11GB at 4-bit quantization; performs competitively with OpenAI o1-mini. Open-weight.,631,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Devstral Medium,Mistral,https://chat.mistral.ai/chat,50,,Dense,"12,000",240:1,███,2.6,,,,,"synthetic, web-scale",Jul/2025,🟢,D,https://mistral.ai/news/devstral-2507,,███,630,███,███,███,███,Proprietary,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok 4,xAI,https://grok.com/,3000,150,MoE,"80,000",27:1,███,51.6,,,88.9,44.4,"synthetic, web-scale",Jul/2025,🟢,C,https://lifearchitect.ai/grok/,"Reasoning, SOTA","2.4T? https://x.com/kalomaze/status/1942996555088134592 ""The smartest AI in the world, 100% on SAT, etc, questions that it's never seen before.""",███,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-4-mini-flash-reasoning,Microsoft,https://huggingface.co/microsoft/Phi-4-mini-flash-reasoning,3.8,,Dense,"5,150","1,356:1",███,███,,,,,"synthetic, web-scale",Jul/2025,🟢,A,https://azure.microsoft.com/en-us/blog/reasoning-reimagined-introducing-phi-4-mini-flash-reasoning/,,"""Pre-training: 5T tokens; Reasoning training: 150B tokens"" ""At the core of Phi-4-mini-flash-reasoning is the newly introduced decoder-hybrid-decoder architecture, SambaY, whose central innovation is the Gated Memory Unit (GMU), a simple yet effective mechanism for sharing representations between layers. The architecture includes a self-decoder that combines Mamba (a State Space Model) and Sliding Window Attention (SWA), along with a single layer of full attention. The architecture also involves a cross-decoder that interleaves expensive cross-attention layers with the new, efficient GMUs. This new architecture with GMU modules drastically improves decoding efficiency, boosts long-context retrieval performance and enables the architecture to deliver exceptional performance across a wide range of tasks. """,628,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ T5Gemma,Google DeepMind,https://huggingface.co/google/t5gemma-9b-9b-ul2-it,9,,███,"10,000","1,112:1",███,1.0,76.7,55.7,40.4,,web-scale,Jul/2025,🟢,A,https://developers.googleblog.com/en/t5gemma/,,Related paper: https://arxiv.org/abs/2504.06225. Dataset was Gemma 2 9B on 8T tokens + 2T tokens adapted.,627,███,███,███,███,Gemma,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MedGemma 1 27B,Google DeepMind,https://huggingface.co/google/medgemma-27b-it,27,,Dense,"14,000",519:1,███,2.0,87,███,,,web-scale,Jul/2025,🟢,A,https://arxiv.org/abs/2507.05201,,Multimodal model. Text MMLU score for med only=87.0.,626,███,███,███,███,Other,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ R1T2 Chimera,TNG,https://huggingface.co/tngtech/DeepSeek-TNG-R1T2-Chimera,685,37,MoE,"14,800",22:1,███,10.6,,,███,,"synthetic, web-scale",Jul/2025,🟢,A,https://arxiv.org/abs/2506.14794,,"Assembly of Experts-method of V3-0324, R1, R1-0528. Announce: https://x.com/tngtech/status/1940531045432283412?s=46",625,███,███,███,███,███,"131,072",Germany,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Spectra 1.1,Consortium,,3.6,███,Dense,"1,200",334:1,███,0.2,36.12,,,,"synthetic, web-scale",Jun/2025,🟢,B,https://arxiv.org/abs/2506.23025,,"""Spectra-1.1, an open suite of TriLMs trained on up to 1.2 trillion tokens, demonstrating sustained performance gains at scale. Furthermore, to improve inference efficiency, we propose novel 2-bit and 1.6-bit packing schemes for ternary weights""",624,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DiffuCoder,Apple,https://github.com/apple/ml-diffucoder,7,███,Dense,"5,630",805:1,███,0.7,,,,,"code, The Stack",Jun/2025,🟢,A,https://arxiv.org/abs/2506.20639,Diffusion,"""We adapt our model from Qwen2.5-Coder (Hui et al., 2024) as the base model to perform continual pre-training using the adaptation approach from Gong et al. (2025). During this pre-training, we use a 400B-token code pre-training corpus from RefineCode (Huang et al., 2024) and Stackv2 (Lozhkov et al., 2024).""",623,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hunyuan-A13B,Tencent,https://huggingface.co/tencent/Hunyuan-A13B-Instruct,80,13,MoE,"7,000",88:1,███,2.5,88.17,67.23,71.2,,"synthetic, web-scale",Jun/2025,🟢,C,███,,"We have open-sourced Hunyuan-A13B-Pretrain , Hunyuan-A13B-Instruct , Hunyuan-A13B-Instruct-FP8 , Hunyuan-A13B-Instruct-GPTQ-Int4 on Hugging Face.'",622,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mercury,Inception Labs,https://chat.inceptionlabs.ai/,90,,Dense,"8,000",███,███,2.8,,69,51,3.4,"synthetic, web-scale",Jun/2025,🟢,D,https://www.inceptionlabs.ai/introducing-mercury-our-general-chat-model,Diffusion,Diffusion large language model (dLLM).,621,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mu,Microsoft,https://blogs.windows.com/windows-insider/2025/06/13/announcing-windows-11-insider-preview-build-26200-5651-dev-channel/,0.33,,Dense,███,"1,516:1",███,0.04,,,,,"synthetic, web-scale",Jun/2025,🟢,C,https://blogs.windows.com/windowsexperience/2025/06/23/introducing-mu-language-model-and-how-it-enabled-the-agent-in-windows-settings/,,"""distillation from Microsoft’s Phi models...Mu is an efficient 330M encoder–decoder language model optimized for small-scale deployment, particularly on the NPUs on Copilot+ PCs. It follows a transformer encoder–decoder architecture""",620,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Robotics On-Device,Google DeepMind,https://docs.google.com/forms/u/0/d/1sM5GqcVMWv-KmKY3TOMpVtQ-lDFeAftQ-d9xQn92jCE/viewform?ts=67cef986&edit_requested=true,20,1,MoE,"10,000",500:1,███,1.5,,,,,███,Jun/2025,🟢,D,https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/,,See Mar/2025 Gemini Robotics-ER model for comparison. Announce: https://deepmind.google/discover/blog/gemini-robotics-on-device-brings-ai-to-local-robotic-devices/,619,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ICONN-1,ICONNAI,https://huggingface.co/collections/ICONNAI/iconn-1,88,4.4,MoE,"10,000",███,███,3.1,,,,,"synthetic, web-scale",Jun/2025,🟢,C,https://huggingface.co/blog/Enderchef/iconn,,"""ICONN-1 (this version) is optimized for natural, emotionally resonant, and conversational interactions. ICONN-e1 is a specialized variant of the model fine-tuned for advanced reasoning, critical analysis, and complex problem-solving.""",618,███,███,███,███,Apache 2.0,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-M1,MiniMax,https://huggingface.co/MiniMaxAI/MiniMax-M1-80k,456,45.9,MoE,"7,200",16:1,███,6.0,,81.1,70,8.4,web-scale,Jun/2025,███,C,https://arxiv.org/abs/2506.13585,Reasoning,Announce: https://www.minimax.io/news/minimaxm1,617,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Magistral Medium,Mistral,https://chat.mistral.ai/chat,50,,Dense,"12,000",240:1,███,2.6,,,70.8,,"synthetic, web-scale",Jun/2025,███,D,https://mistral.ai/static/research/magistral.pdf,Reasoning,Magistral Small=24B. Announce: https://mistral.ai/news/magistral,616,███,███,███,███,███,"131,072",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Comma v0.1-2T,EleutherAI,https://huggingface.co/common-pile/comma-v0.1-2t,7,,Dense,"2,000",286:1,███,0.4,49.8,,,,web-scale,Jun/2025,🟢,A,https://arxiv.org/abs/2506.05209,███,"""Comma v0.1-2T is a decoder-only transformer that uses the same architecture as Llama 3. Training was done in two stages: first on 1.93 trillion tokens with a cosine learning rate schedule, and second a ""cool-down"" training phase on 75.5 billion tokens from high-quality sources. The final model is the average of 10 checkpoints during this cool-down phase. Both training phases use a batch size of 8.3 million tokens per step. Training was performed using lingua on 512 AMD MI300A GPUs.""",615,███,███,███,███,Apache 2.0,"16,384",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ dots.llm1,Xiaohongshu/RedNote,https://huggingface.co/rednote-hilab/dots.llm1.base,142,14,MoE,"11,200",79:1,███,4.2,83.2,61.9,52.6,,web-scale,Jun/2025,███,A,https://github.com/rednote-hilab/dots.llm1/blob/main/dots1_tech_report.pdf,,"""dots.llm1, a large-scale MoE model that activates 14 billion parameters out of a total of 142 billion parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on 11.2T high-quality tokens and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints at every one trillion tokens, providing valuable insights into the learning dynamics of large language models.""",614,███,███,███,███,███,"32,768",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 2.5 Pro 06-05,Google DeepMind,https://deepmind.google/models/gemini-diffusion/,400,,Dense,"80,000",200:1,███,███,,,86.4,21.6,"synthetic, web-scale",Jun/2025,🟢,D,https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf,"Reasoning, SOTA","""an upgraded preview of Gemini 2.5 Pro, our most intelligent model yet. Building on the version we released in May and showed at I/O, this model will be the generally available, stable version starting in a couple of weeks, ready for enterprise-scale applications.""",613,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiMo-7B-RL-0530,Xiaomi,███,7,,Dense,"25,000","3,572:1",███,1.4,,58.6,60.6,,"synthetic, web-scale",May/2025,🟢,A,https://arxiv.org/abs/2505.07608,Reasoning,"""[2025.05.30] During the RL training, by continuously expanding the training window size (from 32K to 48K), the performance of MiMo-7B-RL-0530 on AIME24 can be continuously improved and eventually surpass that of DeepSeek R1... MiMo-7B-Base is pre-trained on approximately 25 trillion tokens.""",612,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepTransformers,Google DeepMind,,1.3,,Dense,100,77:1,███,0.04,███,,,,"synthetic, web-scale",May/2025,🔴,B,https://arxiv.org/abs/2505.23735,,"""Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture.""",611,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Atlas,Google DeepMind,,1.3,,Dense,100,77:1,███,0.04,,,,,"synthetic, web-scale",May/2025,🔴,B,███,,"Atlas, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture.",610,███,███,███,███,Proprietary,"1,024",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-R1-0528,DeepSeek-AI,https://chat.deepseek.com/,685,37,MoE,"14,800",22:1,███,10.6,93.4,85,81,███,"synthetic, web-scale",May/2025,🟢,D,https://huggingface.co/deepseek-ai/DeepSeek-R1-0528,"Reasoning, SOTA","Censorship increased significantly. ""overall performance is now approaching that of leading models, such as o3 and Gemini 2.5 Pro."" MMLU shows MMLU-Redux score with lower error rate.",609,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fathom-R1-14B,Fractal Analytics,███,14,,Dense,"18,000","1,286:1",███,1.7,,,66.16,,"synthetic, web-scale",May/2025,🟢,A,https://huggingface.co/FractalAIResearch/Fathom-R1-14B,Reasoning,"Base R1-distilled-14B model, based on Qwen 14B. Media release.",608,███,███,███,███,MIT,███,India,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ QwenLong-L1-32B,Alibaba,https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B,32,,Dense,"18,000",563:1,███,2.5,,,,,"synthetic, web-scale",May/2025,🟢,A,███,Reasoning,"""the first long-context LRM trained with reinforcement learniing for long-context reasoning.""",607,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude Opus 4,Anthropic,https://claude.ai/,6000,250,MoE,"100,000",17:1,███,81.6,███,,83.3,,"synthetic, web-scale",May/2025,🟢,D,https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf,"Reasoning, SOTA","""Claude Opus 4 is our most intelligent model to date, pushing the frontier in coding, agentic search, and creative writing. With advanced reasoning and powerful collaboration capabilities…Both models can also alternate between reasoning and tool use—like web search—to improve responses…Claude Opus 4 can work continuously for hours on complex, long-running tasks""",606,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon-H1,TII,https://huggingface.co/tiiuae/Falcon-H1-34B-Instruct-GGUF,34,,Dense,"18,000",███,███,2.6,84.05,58.73,49.66,,"synthetic, web-scale",May/2025,🟢,A,https://huggingface.co/papers/2507.22448,,"""hybrid architecture that combines the strengths of the classical Transformer-based attention mechanism with the State Space Model (SSM), known for its superior long-context memory and computational efficiency.""",605,███,███,███,███,Other,███,UAE,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Diffusion,Google DeepMind,https://deepmind.google/models/gemini-diffusion/,███,,Dense,"16,000",400:1,███,2.7,,,40.4,,"synthetic, web-scale",May/2025,🟢,D,https://deepmind.google/models/gemini-diffusion/,Diffusion,"""Gemini Diffusion’s external benchmark performance is comparable to much larger models [like Gemini-2.0-Flash-Lite], whilst also being faster.""",604,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma 3n,Google DeepMind,https://ai.google.dev/gemma/docs/gemma-3n,███,,MatFormer,"8,000","2,000:1",███,0.6,62.1,,,,"synthetic, web-scale",May/2025,🟢,C,https://developers.googleblog.com/en/introducing-gemma-3n/,,Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M).,603,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ParScale,Alibaba,https://huggingface.co/ParScale/ParScale-4.7B-P8-Python,4.7,,Dense,"1,000",213:1,███,0.2,35.1,,,,"synthetic, web-scale",May/2025,🟢,███,https://arxiv.org/abs/2505.10475,,"""We introduce the third scaling paradigm for scaling LLMs: leverages parallel computation during both training and inference time (Parallel Scaling, or ParScale)... ParScale can use up to 22× less memory increase and 6× less latency increase compared to parameter scaling that achieves the same performance improvement. It can also recycle an off-the-shelf pre-trained model into a parallelly scaled one by post-training on a small amount of tokens, further reducing the training budget."" MMLU shows for 1.8B models, not the 4.7B models.",602,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ codex-1,OpenAI,https://chatgpt.com/codex,600,30,MoE,"100,000",167:1,███,███,,,,,"synthetic, web-scale",May/2025,🟢,D,https://openai.com/index/introducing-codex/,"Reasoning, SOTA","o3 base. ""codex-1, a version of OpenAI o3 optimized for software engineering. It was trained using reinforcement learning on real-world coding tasks in a variety of environments to generate code that closely mirrors human style and PR preferences, adheres precisely to instructions, and can iteratively run tests until it receives a passing result.""",601,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon-Edge,TII,https://huggingface.co/tiiuae/Falcon-E-3B-Instruct,3,,Dense,███,500:1,███,0.2,55.7,27.16,23.59,,"synthetic, web-scale",May/2025,🟢,A,https://huggingface.co/blog/tiiuae/falcon-edge,,"""Falcon-Edge series - a collection of powerful, universal, and fine-tunable language models available in ternary format, based on the BitNet architecture.""",600,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SWE-1,Windsurf,https://windsurf.com/blog/windsurf-wave-9-swe-1,50,,Dense,"8,000",███,███,2.1,,,,,"synthetic, web-scale",May/2025,🟢,D,https://windsurf.com/blog/windsurf-wave-9-swe-1,,"""SWE-1, optimized for the entire software engineering process, not just the task of coding.""",599,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ INTELLECT-2,Prime Intellect,https://chat.primeintellect.ai/,32,,Dense,"18,000",563:1,███,2.5,,,66.8,,web-scale,May/2025,🟢,███,https://storage.googleapis.com/public-technical-paper/INTELLECT_2_Technical_Report.pdf,Reasoning,QwQ-32B base. Announce: https://www.primeintellect.ai/blog/intellect-2-release Finished training 30/Apr/2025: https://app.primeintellect.ai/intelligence/intellect-2,598,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pangu Ultra MoE,Huawei,https://github.com/pangu-tech/pangu-ultra,718,39,MoE,"13,000",19:1,███,10.2,91.5,83.5,75.3,,"synthetic, web-scale",May/2025,🔴,C,https://arxiv.org/abs/2505.04519,Reasoning,"Trained on 6,000 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).",███,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Medium 3,Mistral,https://chat.mistral.ai/chat,50,,Dense,"12,000",240:1,███,2.6,,77.2,57.1,,"synthetic, web-scale",May/2025,███,D,https://mistral.ai/news/mistral-medium-3,,"Multimodal. 50B param estimate based on ""Mistral Medium 3 can also be deployed on any cloud, including self-hosted environments of four GPUs and above."". Note: ""With the launches of Mistral Small in March and Mistral Medium today, it’s no secret that we’re working on something ‘large’ over the next few weeks. With even our medium-sized model being resoundingly better than flagship open source models such as Llama 4 Maverick, we’re excited to ‘open’ up what’s to come :) """,596,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite-4.0-Tiny-Preview,IBM,https://huggingface.co/ibm-granite/granite-4.0-tiny-preview,7,1,███,"2,500",358:1,███,0.4,60.4,,,,"synthetic, web-scale",May/2025,🟢,A,https://www.ibm.com/new/announcements/ibm-granite-4-0-tiny-preview-sneak-peek,Reasoning,"""the model is only partially trained—it has only seen 2.5T of a planned 15T or more training tokens...Granite 4.0 Tiny-Preview, specifically, is a fine-grained hybrid mixture of experts (MoE) model, with 7B total parameters and only 1B active parameters at inference time... Like its predecessors in Granite 3.2 and Granite 3.3, Granite 4.0 Tiny Preview offers toggleable thinking on and thinking off functionality (though its reasoning-focused post-training is very much incomplete).""",595,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nova Premier,Amazon,https://aws.amazon.com/bedrock/,470,,Dense,"10,000",22:1,███,7.2,87.4,,57.1,,web-scale,Apr/2025,🟢,D,https://assets.amazon.science/e5/e6/ccc5378c42dca467d1abe1628ec9/amazon-nova-premier-technical-report-and-model-card.pdf,,███,594,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-4-reasoning-plus,Microsoft,https://huggingface.co/microsoft/Phi-4-reasoning-plus,14,,Dense,"10,016",716:1,███,1.2,,76,69.3,,"synthetic, web-scale",Apr/2025,🟢,A,https://arxiv.org/abs/2504.21318,,███,593,███,███,███,███,MIT,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Bamba-9B-v2,IBM,https://huggingface.co/ibm-ai-platform/Bamba-9B-v2,9,███,Dense,"3,000",334:1,███,0.5,67.92,25.41,5.93,,"synthetic, web-scale",Apr/2025,🟢,A,https://huggingface.co/blog/ibm-ai-platform/bamba-9b-v2,,"""During Christmas of 2024, IBM, Princeton, CMU, and UIUC released, Bamba v1, a performant Mamba2 based pretrained model with full data lineage trained to 2T tokens. Since then, we have been busy cooking an update with new datasets. Today, we are excited to release Bamba v2, trained for an additional 1T tokens that significantly improves on Bamba v1. The L1 and L2 leaderboard scores outperform Llama 3.1 8B, which was trained with nearly 5x the amount of data. All of this with the inference speedup that we get from Mamba2 based architecture, which with the latest vLLM is 2-2.5x faster than similar sized transformer models.""",592,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-235B-A22B,Alibaba,https://huggingface.co/Qwen/Qwen3-235B-A22B,235,22,MoE,"36,000",154:1,███,9.7,87.81,68.18,47.47,███,"synthetic, web-scale",Apr/2025,🟢,A,https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf,Reasoning,"Qwen3-235B-A22B. Qwen3-30B-A3B. ""Qwen3 is pre-trained on 36 trillion tokens across 119 languages""",591,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen3-0.6B,Alibaba,https://huggingface.co/Qwen/Qwen3-0.6B,0.6,,Dense,"36,000","60,000:1",███,0.5,,,,,"synthetic, web-scale",Apr/2025,🟢,A,https://github.com/QwenLM/Qwen3/blob/main/Qwen3_Technical_Report.pdf,Reasoning,"Record data ratio 60,000:1. ""Qwen3 is pre-trained on 36 trillion tokens across 119 languages""",███,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE X1 Turbo,Baidu,https://huggingface.co/spaces/PaddlePaddle/ernie_x1_turbo_demo,200,22,MoE,"30,000",150:1,███,8.2,███,,69,,"synthetic, web-scale",Apr/2025,🟢,D,https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-5-turbo-ernie-x1-turbo-and-new-suite-of-ai-tools-to-empower-developers-and-supercharge-ai-innovation-302438584.html,Reasoning,Announce: https://x.com/Baidu_Inc/status/1915603080336597310 Dataset: Estimate only. Params: Turbo = distilled/smaller; ~half of 4.5 family,589,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE 4.5 Turbo,Baidu,https://huggingface.co/spaces/PaddlePaddle/ernie_4.5_turbo_demo,200,22,███,"30,000",150:1,███,8.2,90,,,,"synthetic, web-scale",Apr/2025,🟢,D,https://www.prnewswire.com/news-releases/baidu-launches-ernie-4-5-turbo-ernie-x1-turbo-and-new-suite-of-ai-tools-to-empower-developers-and-supercharge-ai-innovation-302438584.html,,Announce: https://x.com/Baidu_Inc/status/1915603080336597310 Dataset: Estimate only. Turbo = distilled/smaller of ERNIE 4.5,588,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MAI-DS-R1,Microsoft,https://huggingface.co/microsoft/MAI-DS-R1,685,37,MoE,███,22:1,███,10.6,86.8,,,,"synthetic, web-scale",Apr/2025,🟢,A,https://techcommunity.microsoft.com/blog/machinelearningblog/introducing-mai-ds-r1/4405076,Reasoning,"DeepSeek-R1 base. ""MAI-DS-R1, a new open weights DeepSeek R1 model variant... post-trained by the Microsoft AI team to improve its responsiveness on blocked topics and its risk profile, while maintaining its reasoning capabilities and competitive performance.""",587,███,███,███,███,███,"163,840",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 2.5 Flash Preview,Google DeepMind,███,80,4,MoE,"20,000",250:1,███,4.2,,,78.3,12.1,"synthetic, web-scale",Apr/2025,🟢,D,https://deepmind.google/technologies/gemini/flash/,Reasoning,"Context in=1M, out=64k. Knowledge cutoff Jan/2025. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/",586,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o4-mini,OpenAI,https://chatgpt.com/?model=o4-mini-high,200,10,MoE,"40,000",200:1,███,9.4,88,,81.4,███,"synthetic, web-scale",Apr/2025,🟢,D,https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf,"Reasoning, SOTA",https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.,585,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o3,OpenAI,https://chatgpt.com/?model=o3,600,30,MoE,"100,000",167:1,███,25.8,91.2,,83.3,24.9,"synthetic, web-scale",Apr/2025,🟢,D,███,"Reasoning, SOTA",https://openai.com/index/introducing-o3-and-o4-mini/ MMLU shows a translated LOTE.,584,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BitNet b1.58 2B4T,Microsoft,███,2,,Dense,"4,000","2,000:1",███,0.3,53.17,,,,"synthetic, web-scale",Apr/2025,🟢,A,https://arxiv.org/abs/2504.12285,,"""the first open-source, native 1-bit Large Language Model (LLM) at the 2-billion parameter scale. Trained on a corpus of 4 trillion tokens""",583,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite 3.3 8B Instruct,IBM,https://huggingface.co/ibm-granite/granite-3.3-8b-instruct,8,,Dense,"12,000","1,500:1",███,1.0,65.54,,,███,"synthetic, web-scale",Apr/2025,🟢,C,https://www.ibm.com/new/announcements/ibm-granite-3-3-speech-recognition-refined-reasoning-rag-loras,Reasoning,"""Built on top of an updated Granite 3.3 base model and fine-tuned through multi-stage reinforcement learning using TPO and Group Relative Policy Optimization (GRPO), both Granite 3.3 Instruct models demonstrated significant improvement on the highly technical benchmarks conventionally associated with “reasoning” capabilities.""",582,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-4-0414,Zhipu AI (Tsinghua),https://huggingface.co/THUDM/GLM-Z1-32B-0414,32,,Dense,"15,000",469:1,███,2.3,,,66.1,,"synthetic, web-scale",Apr/2025,🟢,A,https://github.com/THUDM/GLM-4/tree/main?tab=readme-ov-file,Reasoning,"Family: GLM-4-32B-Base-0414, GLM-4-32B-0414, GLM-Z1-32B-0414 (reasoning), GLM-Z1-Rumination-32B-0414 (reasoning + deep research).",███,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SEA-LION v3.5 70B R,AI Singapore,https://huggingface.co/aisingapore/Llama-SEA-LION-v3.5-70B-R,70,,Dense,"15,000",215:1,███,3.4,,███,,,"synthetic, web-scale",Apr/2025,🟢,A,https://sea-lion.ai/sea-lion-v3-5-and-updated-v3-enhanced-language-models-for-southeast-asia/,Reasoning,"""Based on Llama 3.1 70B. SEA-LION v3.5, our first set of hybrid reasoning models trained on Southeast Asian data. Mode selection is managed through the tokenizer’s chat template and offers versatile functionality, handling both complex reasoning tasks and general text generation.""",580,███,███,███,███,Llama 3.1,███,Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4.1,OpenAI,https://platform.openai.com/playground/p/HqaxY9MEZ8Ta0zFbzfASn5bJ?mode=chat,300,15,MoE,"20,000",67:1,███,8.2,90.2,███,66.3,5.4,"synthetic, web-scale",Apr/2025,🟢,D,https://openai.com/index/gpt-4-1/,SOTA,"Outperforms GPT‑4o ""across the board, with major gains in coding and instruction following. They also have larger context windows—supporting up to 1 million tokens of context—and are able to better use that context with improved long-context comprehension. They feature a refreshed knowledge cutoff of June 2024.""",579,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DolphinGemma,Google DeepMind,https://blog.google/technology/ai/dolphingemma/,0.4,,Dense,"2,000","5,000:1",███,0.09,,,,,"synthetic, web-scale",Apr/2025,🟢,███,https://blog.google/technology/ai/dolphingemma/,,"""trained on Atlantic spotted dolphin sounds, we anticipate its potential utility for researchers studying other cetacean species, like bottlenose or spinner dolphins... Developed by Google, this AI model makes use of specific Google audio technologies: the SoundStream tokenizer efficiently represents dolphin sounds, which are then processed by a model architecture suited for complex sequences. This ~400M parameter model is optimally-sized to run directly on the Pixel phones WDP uses in the field.""",578,███,███,███,███,Gemma,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Apriel-5B,ServiceNow,https://huggingface.co/ServiceNow-AI/Apriel-5B-Instruct,5,,Dense,"4,500",900:1,███,0.5,61.3,29.19,28.36,,"synthetic, web-scale",Apr/2025,🟢,███,https://huggingface.co/ServiceNow-AI/Apriel-5B-Instruct,,"SLAM - ServiceNow Language Models Lab. The first release in the Apriel model family, designed to support research on foundation models.",577,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Seed-Thinking-v1.5,ByteDance,https://github.com/ByteDance-Seed/Seed-Thinking-v1.5,200,20,MoE,"15,000",75:1,███,5.8,,███,77.3,,"synthetic, web-scale",Apr/2025,🟢,C,https://github.com/ByteDance-Seed/Seed-Thinking-v1.5,Reasoning,"""Seed-Thinking-v1.5, capable of reasoning through thinking before responding, resulting in improved performance on a wide range of benchmarks. Seed-Thinking-v1.5 achieves 86.7 on AIME 2024, 55.0 on Codeforces and 77.3 on GPQA, demonstrating excellent reasoning abilities in STEM and coding.""",576,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Dream 7B,Huawei,https://github.com/HKUNLP/Dream,7,,Dense,580,83:1,███,0.2,69.5,43.3,33,,web-scale,Apr/2025,🟢,A,https://hkunlp.github.io/blog/2025/dream/,███,"""with Huawei Noah’s Ark Lab, we [Hong Kong University] release Dream 7B (Diffusion reasoning model), the most powerful open diffusion large language model to date.""",575,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ UltraLong-8B,NVIDIA,https://huggingface.co/nvidia/Llama-3.1-8B-UltraLong-4M-Instruct,8,,Dense,"15,000","1,875:1",███,1.2,67.31,43.28,,,"synthetic, web-scale",Apr/2025,🟢,A,███,,Llama-3.1-8B-Instruct base. 4M context window.,574,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Deepcoder-14B-Preview,Together,https://www.together.ai/blog/deepcoder,14,,Dense,"14,800","1,058:1",███,1.5,,,,,web-scale,Apr/2025,███,C,https://www.together.ai/blog/deepcoder,Reasoning,Base DeepSeek-R1-Distill-Qwen-14B.,573,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pangu Ultra,Huawei,https://x.com/hbouammar/status/1911370093516185771,135,,Dense,"13,200",98:1,███,4.4,85.4,84.4,74.2,,"synthetic, web-scale",Apr/2025,🟢,███,https://arxiv.org/abs/2504.07866,,"Trained on 8,192 Ascend NPUs (Kunpeng 920 processors in Huawei Atlas 800T A2 servers).",572,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-H-56B-Base,NVIDIA,https://huggingface.co/nvidia/Nemotron-H-56B-Base-8K,56,,Dense,"20,000",358:1,███,3.5,84.2,███,,,"synthetic, web-scale",Apr/2025,🟢,A,https://arxiv.org/abs/2504.03624,Reasoning,https://research.nvidia.com/labs/adlr/nemotronh/,571,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama-3.1-Nemotron-Ultra-253B,NVIDIA,https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1,253,,Dense,"15,600",62:1,███,6.6,88.1,,76.01,,"synthetic, web-scale",Apr/2025,🟢,A,███,Reasoning,"Llama 3.1 405B base. ""Llama-3.1-Nemotron-Ultra-253B-v1 is a large language model (LLM) which is a derivative of Meta Llama-3.1-405B-Instruct (AKA the reference model). It is a reasoning model that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens. This model fits on a single 8xH100 node for inference.""",570,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 4 Behemoth,Meta AI,,2000,288,MoE,"30,000",15:1,███,25.8,███,82.2,73.7,,"synthetic, web-scale",Apr/2025,🔴,B,https://ai.meta.com/blog/llama-4-multimodal-intelligence/,SOTA,"Announced Apr/2025, abandoned Jul/2025. ""We also trained a teacher model, Llama 4 Behemoth, that outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks such as MATH-500 and GPQA Diamond... 288B active parameters, 16 experts, and nearly two trillion total parameters.""",569,███,███,███,███,Llama 4,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 4 Maverick,Meta AI,https://ai.meta.com/blog/llama-4-multimodal-intelligence/,400,███,MoE,"22,000",55:1,███,9.9,,80.5,69.8,,"synthetic, web-scale",Apr/2025,🟢,A,https://ai.meta.com/blog/llama-4-multimodal-intelligence/,SOTA,"""Our most powerful open source multimodal model. 17B active params x 128 experts, 400B total params""",568,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 4 Scout,Meta AI,https://ai.meta.com/blog/llama-4-multimodal-intelligence/,109,17,MoE,███,367:1,███,7.0,,74.3,57.2,,"synthetic, web-scale",Apr/2025,🟢,A,https://ai.meta.com/blog/llama-4-multimodal-intelligence/,,"200 languages, ""includes diverse text, image, and video datasets.""",567,███,███,███,███,Llama 4,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sec-Gemini v1,Google DeepMind,https://security.googleblog.com/2025/04/google-launches-sec-gemini-v1-new.html,███,20,MoE,"20,000",50:1,███,9.4,,,,,"synthetic, web-scale",Apr/2025,🔴,D,https://blog.google/security/google-launches-sec-gemini-v1-new/,,"""Sec-Gemini v1 achieves this by combining Gemini’s advanced capabilities with near real-time cybersecurity knowledge and tooling. This combination allows it to achieve superior performance on key cybersecurity workflows, including incident root cause analysis, threat analysis, and vulnerability impact understanding.""",566,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-GRM-27B ,DeepSeek-AI,to be released,27,,Dense,"14,000",519:1,███,2.0,,68.1,56.9,,web-scale,Apr/2025,🟢,A,https://arxiv.org/abs/2504.02495,SOTA,███,565,███,███,███,███,Gemma,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwerky-72B,Featherless AI,https://featherless.ai/models/featherless-ai/Qwerky-72B,72,,Dense,"18,000",250:1,███,3.8,77.46,,,███,"synthetic, web-scale",Apr/2025,🟢,A,https://featherless.ai/models/featherless-ai/Qwerky-72B/readme,,"""As demonstrated with our Qwerky-72B-Preview and prior models such as QRWKV6-32B Instruct Preview, we have successfully converted Qwen 2.5 72B into a RWKV variant without requiring a pretrain on the base model or retraining the model from scratch. Enabling us to test and validate the more efficient RWKV Linear attention"" Dataset from Qwen2.5=18,000 tokens.",564,███,███,███,███,███,"18,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cogito 70B,Deep Cogito,https://huggingface.co/deepcogito/cogito-v1-preview-llama-70B,70,███,Dense,"15,000",215:1,███,3.4,91,78.47,60.61,,"synthetic, web-scale",Apr/2025,🟢,A,https://www.deepcogito.com/research/cogito-v1-preview,,"""We are releasing early checkpoints of models in sizes 3B, 8B, 14B, 32B and 70B trained using this methodology, starting from pretrained Llama / Qwen base checkpoints.""",563,███,███,███,███,Llama 3.1,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Agentic-Tx,Google DeepMind,https://github.com/google-gemini/gemma-cookbook/blob/main/TxGemma/%5BTxGemma%5DAgentic_Demo_with_Hugging_Face.ipynb,200,10,MoE,"20,000",100:1,███,6.7,,,62.4,14.5,"synthetic, web-scale",Mar/2025,🟢,D,https://storage.googleapis.com/research-media/txgemma/txgemma-report.pdf,,"""a therapeutics-focused agentic system powered by Gemini 2.0 Pro. Agentic-Tx is equipped with 18 tools, including: TxGemma as a tool for multi-step reasoning""",███,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TxGemma,Google DeepMind,https://huggingface.co/google/txgemma-27b-chat,27,,Dense,"14,000",519:1,███,2.0,,,,,"synthetic, web-scale",Mar/2025,███,A,https://storage.googleapis.com/research-media/txgemma/txgemma-report.pdf,Reasoning,"""a suite of efficient, generalist large language models (LLMs) capable of therapeutic property prediction as well as interactive reasoning and explainability. Unlike task-specific models, TxGemma synthesizes information from diverse sources, enabling broad application across the therapeutic development pipeline.""",561,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 2.5 Pro Preview,Google DeepMind,https://aistudio.google.com/prompts/new_chat?model=gemini-2.0-pro-exp-02-05,400,20,███,"80,000",200:1,███,18.9,,,84,18.8,"synthetic, web-scale",Mar/2025,🟢,D,https://blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025/#building-on-best-gemini,"Reasoning, SOTA","Context in=1M, out=64k. Knowledge cutoff Jan/2025. HLE SOTA. Codename 'nebula'. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/",560,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V3 0324,DeepSeek-AI,https://chat.deepseek.com/,685,37,MoE,"14,800",22:1,███,10.6,,81.2,68.4,,███,Mar/2025,🟢,A,https://huggingface.co/deepseek-ai/DeepSeek-V3-0324,SOTA,"Non-reasoning. Significant increase in benchmark performance compared to original V3 from Dec/2024: MMLU-Pro: 75.9 ➜ 81.2, GPQA: 59.1 ➜ 68.4. 37B active.",559,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama-3.3-Nemotron-Super-49B-v1,NVIDIA,https://huggingface.co/nvidia/Llama-3_3-Nemotron-Super-49B-v1,49,,Dense,███,307:1,███,2.9,,,66.67,,web-scale,Mar/2025,🟢,C,https://build.nvidia.com/nvidia/llama-3_3-nemotron-super-49b-v1/modelcard,Reasoning,"Meta Llama-3.3-70B-Instruct derivative ""that is post trained for reasoning, human chat preferences, and tasks, such as RAG and tool calling. The model supports a context length of 128K tokens.""",558,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EXAONE Deep,LG,███,32,,Dense,"6,500",204:1,███,1.5,83,74,66.1,,web-scale,Mar/2025,🟢,A,https://arxiv.org/abs/2503.12524,Reasoning,“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio dropped from EXAONE-3 7.8B with 8T (Aug/2024) to 3.5 (Dec/2024) 7.8B with 9T to 32B (also Deep) with 6.5T. Announce: https://www.lgresearch.ai/news/view?seq=543,557,███,███,███,███,███,"32,768",South Korea,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Small 3.1,Mistral,https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503,24,,███,"8,000",334:1,███,1.5,81.01,56.03,37.5,,web-scale,Mar/2025,🟢,C,https://mistral.ai/news/mistral-small-3-1,,"""Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance.""",556,███,███,███,███,Apache 2.0,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE 4.5,Baidu,https://huggingface.co/baidu/ERNIE-4.5-VL-424B-A47B-PT,424,47,MoE,"16,000",38:1,███,8.7,,79,55,,"synthetic, web-scale",Mar/2025,🟢,C,https://www.prnewswire.com/news-releases/baidu-unveils-ernie-4-5-and-reasoning-model-ernie-x1--makes-ernie-bot-free-ahead-of-schedule-302402490.html,Reasoning,███,555,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ X1,Baidu,https://yiyan.baidu.com/,424,47,MoE,"30,000",71:1,███,11.9,,,,,"synthetic, web-scale",Mar/2025,🟢,D,███,Reasoning,"Params, dataset: Estimate only. Same MoE topology as ERNIE 4.5 (Baidu unified arch)",554,███,███,███,███,Proprietary,"32,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMo 2 32B,Allen AI,https://playground.allenai.org/?model=olmo-2-0325-32b-instruct,32,,Dense,███,200:1,███,1.5,78,,,,"synthetic, web-scale",Mar/2025,🟢,A,https://allenai.org/blog/olmo2-32B,,"""the first fully-open model (all data, code, weights, and details are freely available) to outperform GPT3.5-Turbo and GPT-4o mini on a suite of popular, multi-skill academic benchmarks. It is comparable to the leading open-weight models while requiring only a fraction of training compute.""",553,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Command A,Cohere,https://dashboard.cohere.com/playground/chat,111,,███,"8,000",73:1,███,3.1,85,,,,"synthetic, web-scale",Mar/2025,🟢,C,https://huggingface.co/CohereForAI/c4ai-command-a-03-2025,,"Context=256k. ""Command A is an open weights research release of a 111 billion parameter model optimized for demanding enterprises that require fast, secure, and high-quality AI. Compared to other leading proprietary and open-weights models Command A delivers maximum performance with minimum hardware costs, excelling on business-critical agentic and multilingual tasks while‬ being deployable on just two GPUs.""",552,███,███,███,███,███,"262,144",Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Robotics,Google DeepMind,,200,10,MoE,"20,000",100:1,███,6.7,,79.1,64.7,,robotics,Mar/2025,🔴,F,███,,"Gemini 2.0 Pro (cloud). ""The second model is Gemini Robotics, a state-of-theart Vision-Language-Action (VLA) model that connects strong embodied reasoning priors to dexterous low-level control of real-world robots to solve challenging manipulation tasks. As a generalist VLA, Gemini Robotics can perform a wide array of diverse and complicated tasks, while also closely following language guidance and generalizing to distribution shifts in instructions, visuals, and motions. To emphasize the flexibility and generality of the Gemini Robotics models, we also introduce an optional specialization stage, which demonstrates how Gemini Robotics can be adapted for extreme dexterity, for advanced reasoning in difficult generalization settings, and for controlling completely new robot embodiments.""",551,███,███,███,███,Proprietary,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Robotics-ER,Google DeepMind,,30,1.5,MoE,"30,000","1,000:1",███,3.2,███,76.4,62.1,,"synthetic, web-scale",Mar/2025,🔴,F,https://storage.googleapis.com/deepmind-media/gemini-robotics/gemini_robotics_report.pdf,,"Gemini 2.0 Flash (on device). ""The first model is Gemini Robotics-ER, a VLM with strong embodied reasoning capabilities at its core, exhibiting generalization across a wide range of embodied reasoning tasks while also maintaining its core foundation model capabilities. Gemini Robotics-ER exhibits strong performance on multiple capabilities critical for understanding the physical world, ranging from 3D perception to detailed pointing to robot state estimation and affordance prediction via code.""",550,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma 3,Google DeepMind,https://huggingface.co/ggml-org/gemma-3-27b-it-GGUF,27,,Dense,"14,000",519:1,███,2.0,78.6,67.5,42.4,,web-scale,Mar/2025,🟢,A,https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf,███,"Trained on 1T more tokens than Gemma 2. ""introduces vision understanding abilities, a wider coverage of languages and longer context – at least 128K tokens.""",549,███,███,███,███,Gemma,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Reka Flash 3,Reka AI,https://huggingface.co/RekaAI/reka-flash-3,21,███,Dense,"5,000",239:1,███,1.1,,65,61.1,,web-scale,Mar/2025,🟢,C,https://www.reka.ai/news/introducing-reka-flash,Reasoning,"""performs competitively with proprietary models such as OpenAI o1-mini, making it a good foundation to build applications that require low latency or on-device deployment. It is currently the best open model in its size category.""",548,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ QwQ-32B,Alibaba,https://huggingface.co/Qwen/QwQ-32B,32,,Dense,"18,000",563:1,███,2.5,,,,,"synthetic, web-scale",Mar/2025,███,A,https://qwenlm.github.io/blog/qwq-32b/,Reasoning,Update to QwQ-32B-Preview released Nov/2024. Scores 1/5 on latest ALPrompt 2024 H2. Qwen with Question=QwQ,547,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jamba 1.6,AI21,███,398,94,MoE,"1,200",4:1,███,2.3,81.2,53.5,36.9,,web-scale,Mar/2025,🟢,C,https://huggingface.co/ai21labs/AI21-Jamba-Large-1.6,,"""The AI21 Jamba 1.6 family of models is state-of-the-art, hybrid SSM-Transformer instruction following foundation models. The Jamba models are the most powerful & efficient long-context models on the market, which deliver up to 2.5X faster inference than leading models of comparable sizes.""",546,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Instella-3B,AMD,https://huggingface.co/amd/Instella-3B,3,,Dense,"4,160","1,387:1",███,0.4,58.31,███,30.13,,web-scale,Mar/2025,🟢,A,https://rocm.blogs.amd.com/artificial-intelligence/introducing-instella-3B/README.html,,"""trained from scratch on AMD Instinct™ MI300X GPUs. Instella models outperform existing fully open models of similar sizes and achieve competitive performance compared to state-of-the-art open-weight models such as Llama-3.2-3B, Gemma-2-2B, and Qwen-2.5-3B, including their instruction-tuned counterparts.""",545,███,███,███,███,Other,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Babel-83B,Alibaba,https://huggingface.co/Tower-Babel/Babel-83B,83,███,Dense,"15,000",181:1,███,3.7,,,,,"synthetic, web-scale",Mar/2025,🟢,C,https://arxiv.org/abs/2503.00865,,"""top 25 languages by number of speakers, including English, Chinese, Hindi, Spanish, Arabic, French, Bengali, Portuguese, Russian, Urdu, Indonesian, German, Japanese, Swahili, Filipino, Tamil, Vietnamese, Turkish, Italian, Javanese, Korean, Hausa, Persian, Thai, and Burmese. These 25 languages support over 90% of the global population...""",544,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite-3.2-8B-Instruct,IBM,https://huggingface.co/collections/ibm-granite/granite-32-language-models-67b3bc8c13508f6d064cff9a,8,,Dense,"12,000","1,500:1",███,1.0,66.79,,,,web-scale,Feb/2025,🟢,███,https://www.ibm.com/new/announcements/ibm-granite-3-2-open-source-reasoning-and-vision,Reasoning,"""The new Granite 3.2 8B Instruct [offers] experimental chain-of-thought reasoning capabilities """,543,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ C4AI Command R7B Arabic,Cohere,https://huggingface.co/CohereForAI/c4ai-command-r7b-arabic-02-2025,7,,Dense,"2,000",286:1,███,0.4,,29.4,7.9,███,web-scale,Feb/2025,🟢,C,https://huggingface.co/CohereForAI/c4ai-command-r7b-arabic-02-2025,,"""C4AI Command R7B Arabic is an open weights research release of a 7 billion parameter custom model with advanced capabilities optimized for the Arabic language (MSA dialect) along with English. The model excels at tasks that enterprises care about: instruction following, length control, RAG, and responding in the correct language. It also demonstrates excellent general purpose knowledge and understanding of Arabic language and cultures.""",542,███,███,███,███,CC-BY-NC 4.0,███,Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4.5,OpenAI,https://chat.com/,4500,225,MoE,"114,000",26:1,███,75.5,89.6,,71.4,6.4,"synthetic, web-scale",Feb/2025,🟢,D,███,SOTA,"""Our largest and best model for chat"" https://openai.com/index/introducing-gpt-4-5/ ""GPT-4.5 is not a frontier model, but it is OpenAI’s largest LLM, improving on GPT-4’s computational efficiency by more than 10x. While GPT-4.5 demonstrates increased world knowledge, improved writing ability, and refined personality over previous models, it does not introduce net-new frontier capabilities compared to previous reasoning releases, and its performance is below that of o1, o3-mini, and deep research on most preparedness evaluations.""",541,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hunyuan T1,Tencent,https://cloud.tencent.com/apply/p/i2zophus2x8,389,19.45,MoE,"7,000",18:1,███,███,,87.2,69.3,,"synthetic, web-scale",Feb/2025,🟢,D,https://mp.weixin.qq.com/s/BwQkXpEitOm1Piz60SE-4A,Reasoning,"""Based on Turbo S, by introducing technologies such as long thinking chains, retrieval enhancement and reinforcement learning, Hunyuan also launched the reasoning model T1 with deep thinking. This model has been fully launched on Tencent Yuanbao ( Tencent Hunyuan T1 model is open to all users ) , users can choose Deepseek R1 or Tencent Hunyuan T1 model to answer. The official version of Tencent Hunyuan T1 model will be launched soon, providing API access and other services to the outside world.""",540,███,███,███,███,Other,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hunyuan Turbo S,Tencent,https://cloud.tencent.com/apply/p/i2zophus2x8,389,19.45,███,"7,000",18:1,███,5.5,89.5,79,57.5,,"synthetic, web-scale",Feb/2025,🟢,D,https://mp.weixin.qq.com/s/BwQkXpEitOm1Piz60SE-4A,,"Fast thinking (""Instant reply""). ""This is also the first time in the industry that the Mamba architecture has been successfully applied losslessly to a very large MoE model.""",539,███,███,███,███,███,"262,144",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-4-multimodal,Microsoft,https://huggingface.co/microsoft/Phi-4-multimodal-instruct,5.6,,Dense,"6,100","1,090:1",███,0.6,,███,,,"synthetic, web-scale",Feb/2025,🟢,A,https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/phi_4_mm.tech_report.02252025.pdf,,"""Training data: 5T tokens, 2.3M speech hours, and 1.1T image-text tokens"" Announce: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/",538,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-4-mini,Microsoft,https://huggingface.co/microsoft/Phi-4-mini-instruct,3.8,,Dense,"5,000","1,316:1",███,0.5,67.3,52.8,30.4,,"synthetic, web-scale",Feb/2025,🟢,A,https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/phi_4_mm.tech_report.02252025.pdf,,"""Phi-4-mini’s training data includes a wide variety of sources, totaling 5 trillion tokens, and is a combination of publicly available documents filtered for quality, selected high-quality educational data, and code newly created synthetic, “textbook-like” data for the purpose of teaching math, coding, common sense reasoning, general knowledge of the world (e.g., science, daily activities, theory of mind, etc.) high quality chat format supervised data covering various topics to reflect human preferences"" Announce: https://azure.microsoft.com/en-us/blog/empowering-innovation-the-next-generation-of-the-phi-family/",███,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mercury Coder Small,Inception Labs,https://chat.inceptionlabs.ai/,40,,Dense,"5,000",125:1,███,1.5,,,███,,"synthetic, web-scale",Feb/2025,🟢,D,https://www.inceptionlabs.ai/news,Diffusion,"Diffusion large language model (dLLM). Very low 'IQ' performance (0/5 on all ALPrompts). Fast: 1,000tok/s. https://x.com/inceptionailabs/status/1894847921474150456",536,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ QwQ-Max-Preview,Alibaba,https://chat.qwen.ai/,325,,Dense,"20,000",62:1,███,8.5,,,,,"synthetic, web-scale",Feb/2025,🟢,A,https://qwenlm.github.io/blog/qwq-max-preview/,Reasoning,███,535,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude 3.7 Sonnet,Anthropic,https://claude.ai/,400,,Dense,███,50:1,███,9.4,,82.7,84.8,8.9,"synthetic, web-scale",Feb/2025,🟢,D,https://assets.anthropic.com/m/785e231869ea8b3b/original/claude-3-7-sonnet-system-card.pdf,"Reasoning, SOTA","Knowledge cutoff now November 2024 (was April 2024). ""the first hybrid reasoning model on the market."" https://www.anthropic.com/news/claude-3-7-sonnet",534,███,███,███,███,Proprietary,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Moonlight,Moonshot AI,https://huggingface.co/moonshotai/Moonlight-16B-A3B,███,3,MoE,"5,700",357:1,███,1.0,70,42.4,,,"synthetic, web-scale",Feb/2025,🟢,A,https://github.com/MoonshotAI/Moonlight/blob/master/Moonlight.pdf,,"""Scaling law experiments indicate that Muon achieves ∼ 2× computational efficiency compared to AdamW with compute optimal training."" https://github.com/MoonshotAI/Moonlight?tab=readme-ov-file",533,███,███,███,███,███,"8,192",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ S2,Figure,,7,,Dense,"2,000",286:1,███,0.4,,,,,"synthetic, web-scale",Feb/2025,🔴,D,https://www.figure.ai/news/helix,,███,532,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ S1,Figure,,0.08,,Dense,1,13:1,███,0.001,,,,,special,Feb/2025,🔴,███,https://www.figure.ai/news/helix,,"""high quality, multi-robot, multi-operator dataset of diverse teleoperated behaviors, ~500 hours in total. To generate natural language-conditioned training pairs, we use an auto-labeling VLM to generate hindsight instructions. The VLM processes segmented video clips from the onboard robot cameras, prompted with: ""What instruction would you have given the robot to get the action seen in this video?"" All items handled during training are excluded from evaluations to prevent contamination. Architecture Our system comprises two main components: S2, a VLM backbone, and S1, a latent-conditional visuomotor transformer. S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data. It processes monocular robot images and robot state information (consisting of wrist pose and finger positions) after projecting them into vision-language embedding space. Combined with natural language commands specifying desired behaviors, S2 distills all semantic task-relevant information into a single continuous latent vector, passed to S1 to condition its low-level actions. S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level control. It relies on a fully convolutional, multi-scale vision backbone for visual processing, initialized from pretraining done entirely in simulation. While S1 receives the same image and state inputs as S2, it processes them at a higher frequency to enable more responsive closed-loop control. The latent vector from S2 is projected into S1's token space and concatenated with visual features from S1's vision backbone along the sequence dimension, providing task conditioning. S1 outputs full upper body humanoid control at 200hz, including desired wrist poses, finger flexion and abduction control, and torso and head orientation targets. We append to the action space a synthetic ""percentage task completion"" action, allowing Helix to predict its own termination condition, which makes it easier to sequence multiple learned behaviors.""",531,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Baichuan-M1-14B,Baichuan,https://huggingface.co/baichuan-inc/Baichuan-M1-14B-Base,14,,Dense,"20,000","1,429:1",███,1.8,,,,,███,Feb/2025,🟢,A,https://arxiv.org/abs/2502.12671,,Medical LLM. Huge increase to 20T tokens for 14B params standard.,530,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Evo 2,Arc Institute,https://github.com/arcinstitute/evo2,40,,Dense,"8,800",220:1,███,2.0,,,,,special,Feb/2025,🟢,A,https://github.com/arcinstitute/evo2,███,"""Evo 2 is a state of the art DNA language model for long context modeling and design. Evo 2 models DNA sequences at single-nucleotide resolution at up to 1 million base pair context length using the StripedHyena 2 architecture. Evo 2 was pretrained using Savanna. Evo 2 was trained autoregressively on OpenGenome2, a dataset containing 8.8 trillion tokens from all domains of life."" Greg Brockman co-author.",529,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ R1 1776,Perplexity,https://huggingface.co/perplexity-ai/r1-1776,685,37,MoE,"14,800",22:1,███,10.6,███,,,8.6,"synthetic, web-scale",Feb/2025,🟢,A,https://www.perplexity.ai/hub/blog/open-sourcing-r1-1776,Reasoning,"Censorship reduced, based on DeepSeek-R1.",528,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok-3,xAI,https://grok.com/,3000,150,MoE,"50,000",17:1,███,40.8,,79.9,███,,"synthetic, web-scale",Feb/2025,🟢,C,https://x.ai/blog/grok-3,"Reasoning, SOTA",https://x.ai/blog/grok-3 My full analysis: https://lifearchitect.ai/whats-in-grok/,527,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Saba,Mistral,https://console.mistral.ai/,24,███,Dense,"8,000",334:1,███,1.5,81,,,,web-scale,Feb/2025,🟢,C,https://mistral.ai/en/news/mistral-saba,,"""Mistral Saba is a 24B parameter model trained on meticulously curated datasets from across the Middle East and South Asia.""",526,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Salamandra,Barcelona Supercomputing Center,https://github.com/langtech-bsc/salamandra,40,,Dense,"9,000",225:1,███,2.0,,,,,web-scale,Feb/2025,🟢,A,https://arxiv.org/abs/2502.08489,,███,525,███,███,███,███,Apache 2.0,███,Spain,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepHermes 3 Preview,Nous Research,https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview,8,,Dense,"15,200","1,900:1",███,1.2,,███,38,,"synthetic, web-scale",Feb/2025,🟢,A,https://huggingface.co/NousResearch/DeepHermes-3-Llama-3-8B-Preview,Reasoning,"Based on Llama 3 8B. GPQA score based on GPT-4o's analysis of the chart :-/ ""one of the first models in the world to unify Reasoning (long chains of thought that improve answer accuracy) and normal LLM response modes into one model."" https://x.com/NousResearch/status/1890148004029759612",524,███,███,███,███,Llama 3,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OREAL-32B,Shanghai AI Laboratory/SenseTime,https://huggingface.co/internlm/OREAL-32B,32,,Dense,"4,000",███,███,1.2,,,,,"synthetic, web-scale",Feb/2025,🟢,C,https://arxiv.org/abs/2502.06781,Reasoning,OREAL=Outcome REwArd-based reinforcement Learning.,523,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 2.0 Pro,Google DeepMind,https://aistudio.google.com/prompts/new_chat?model=gemini-2.0-pro-exp-02-05,200,10,MoE,"20,000",100:1,███,6.7,,███,64.7,,"synthetic, web-scale",Feb/2025,🟢,D,https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/,,"Context=2M. Disappointing benchmarks, this is the 'pro' (medium) not 'ultra' (large) model. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/",522,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ s1-32B,Stanford,https://github.com/simplescaling/s1,32,,Dense,"18,000",563:1,███,2.5,,,59.6,,"synthetic, web-scale",Feb/2025,🟢,A,https://arxiv.org/abs/2501.19393,Reasoning,"Based on Qwen2.5-32B-Instruct. ""we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending “Wait” multiple times to the model’s generation when it tries to end. This can lead the model to doublecheck its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24).""",███,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o3-mini,OpenAI,https://chatgpt.com/?model=o3-mini,70,███,Dense,"13,000",186:1,███,3.2,,,79.7,14,"synthetic, web-scale",Jan/2025,🟢,D,https://openai.com/index/o3-mini-system-card/,Reasoning,"GPQA=79.7 for 'high' thinking. ALPrompt 2025H1=1/5. My analysis is that this model’s performance is very poor, with responses often becoming disordered and illogical. OpenAI compared o3-mini to OpenAI’s software engineers, and it performed very poorly (o3-mini=0%, o1=12%). ""o3-mini models have the lowest performance, with scores of 0%… We suspect o3-mini’s low performance is due to poor instruction following and confusion about specifying tools in the correct format. The model often attempts to use a hallucinated bash tool rather than python despite constant, multi-shot prompting and feedback that this format is incorrect. This resulted in long conversations that likely hurt its performance."" (o3-mini paper, p31)",520,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Small 3,Mistral,https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501,24,,Dense,"8,000",334:1,███,1.5,███,54.37,45.3,,web-scale,Jan/2025,🟢,A,https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501,,"MMLU=base, -Pro=base, GPQA=instruct. ""When quantized, Mistral Small 3 can be run privately on a single RTX 4090 or a Macbook with 32GB RAM."" ""Mistral Small 3 is neither trained with RL nor synthetic data""",519,███,███,███,███,███,"32,768",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama-3.1-Tulu-3-405B,Allen AI,https://playground.allenai.org/,405,,Dense,"15,600",39:1,███,8.4,87,,,,"synthetic, web-scale",Jan/2025,🟢,A,https://huggingface.co/allenai/Llama-3.1-Tulu-3-405B,,Lower MMLU score than Llama 3.1 405B base.,███,███,███,███,███,Llama 3.1,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen2.5-Max,Alibaba,https://chat.qwenlm.ai/,325,16.25,MoE,"20,000",62:1,███,8.5,87.9,69,60.1,,"synthetic, web-scale",Jan/2025,███,A,https://qwenlm.github.io/blog/qwen2.5-max/,,"""Qwen2.5-Max emerges as a milestone in MoE development, featuring an impressive 325 billion parameters. The model has been pretrained on over 20 trillion tokens and further refined with advanced post-training methodologies such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF)."" https://wandb.ai/byyoung3/ml-news/reports/Qwen2-5-Max-Advancing-Large-Scale-Mixture-of-Expert-Models---VmlldzoxMTEyMjUyNg",517,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EvaByte,SambaNova,https://huggingface.co/EvaByte/EvaByte,6.5,,Dense,"1,500",231:1,███,0.3,50.6,,,,web-scale,Jan/2025,🟢,███,https://hkunlp.github.io/blog/2025/evabyte/,,"""efficient byte-level processing at scale... [compared to tokenizer-based LMs:] 5x less training data, excelling in coding tasks, and decoding up to 2x faster. Its token-free design also brings added flexibility, avoiding tokenizer quirks while naturally extending to multimodal applications without any architecture tweaks.""",516,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ UI-TARS-72B,ByteDance,https://github.com/bytedance/UI-TARS-desktop?tab=readme-ov-file,72,,Dense,"9,000",125:1,███,███,,,,,"synthetic, web-scale",Jan/2025,🟢,C,https://arxiv.org/abs/2501.12326,,VLM. SoTA agent 'computer use' model to 23/Jan/2024.,515,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Doubao-1.5-pro,ByteDance,https://www.volcengine.com/docs/82379/1330310#474f7dec,200,20,MoE,"9,000",45:1,███,4.5,88.6,80.1,65,,"synthetic, web-scale",Jan/2025,🟢,B,https://team.doubao.com/en/special/doubao_1_5_pro,Reasoning,"Includes 2.4B param ViT. ""Doubao-1.5-pro uses a sparse MoE architecture. In the pre-training stage, the performance of the MoE model activated with only a small number of parameters can exceed that of ultra-large dense pre-trained models such as Llama3.1-405B. Through the study of the sparsity scaling law, the team determined the sparse ratio that balances performance and efficiency, and determined based on the MoE scaling law that a model activated with a small number of parameters can achieve the performance of a world-class model.""",███,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi k1.5,Moonshot AI,https://github.com/MoonshotAI/kimi-k1.5?tab=readme-ov-file,500,,Dense,"15,000",███,███,9.1,87.4,,51.5,,"synthetic, web-scale",Jan/2025,🟢,D,https://arxiv.org/abs/2501.12599,Reasoning,"""our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities---e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista---matching OpenAI's o1"". GPQA score is my estimate from pp13–14, noting that ""the scores above come from an internal long-cot model with much smaller model size than k1.5 long-CoT model.""",513,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-R1,DeepSeek-AI,https://chat.deepseek.com/,685,37,MoE,"14,800",22:1,███,10.6,90.8,84,71.5,███,web-scale,Jan/2025,🟢,D,https://github.com/deepseek-ai/DeepSeek-R1/blob/main/DeepSeek_R1.pdf,"Reasoning, SOTA","""DeepSeek-R1 achieves performance comparable to OpenAI-o1 across math, code, and reasoning tasks. To support the research community, we have open-sourced DeepSeek-R1-Zero, DeepSeek-R1, and six dense models distilled from DeepSeek-R1 based on Llama and Qwen. DeepSeek-R1-Distill-Qwen-32B outperforms OpenAI-o1-mini across various benchmarks""",512,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4b,OpenAI,,8,███,Dense,"4,000",500:1,███,0.6,,,,,special,Jan/2025,🔴,F,https://www.technologyreview.com/2025/01/17/1110086/openai-has-created-an-ai-model-for-longevity-science/,,"Protein sequence model. ""The model was trained on examples of protein sequences from many species, as well as information on which proteins tend to interact with one another. While that’s a lot of data, it’s just a fraction of what OpenAI’s flagship chatbots were trained on, making GPT-4b an example of a “small language model” that works with a focused data set."" https://www.technologyreview.com/2025/01/17/1110086/openai-has-created-an-ai-model-for-longevity-science/",511,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Helium-1,Kyutai,https://huggingface.co/kyutai/helium-1-preview-2b,2,,Dense,"2,500","1,250:1",███,0.2,51.2,,,███,web-scale,Jan/2025,🟢,A,https://kyutai.org/2025/01/13/helium.html,,"""Helium-1 preview, an initial version of our new backbone language model with 2B parameters, targeting edge and mobile devices... We use token level distillation of a 7B parameters model to train Helium-1 preview.""",510,███,███,███,███,███,"4,096",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ InternLM3,Shanghai AI Laboratory/SenseTime,https://huggingface.co/internlm/internlm3-8b-instruct,8,,Dense,"4,000",500:1,███,0.6,76.6,57.6,37.4,,web-scale,Jan/2025,🟢,A,https://huggingface.co/internlm/internlm3-8b-instruct,███,"""InternLM3 is trained on only 4 trillion high-quality tokens, saving more than 75% of the training cost compared to other LLMs of similar scale."" Playground: https://internlm-chat.intern-ai.org.cn/",509,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniMax-Text-01,MiniMax,https://github.com/MiniMax-AI/MiniMax-01,456,45.9,MoE,"7,200",16:1,███,6.0,88.5,75.7,54.4,███,web-scale,Jan/2025,🟢,C,https://arxiv.org/abs/2501.08313,,"""The pre-training corpus for MiniMax-Text-01 encompasses a comprehensive and meticulously curated dataset, incorporating diverse sources including academic literature, books, web content, and programming code... repeatedly training high-quality documents can lead to enhanced downstream performance, with certain high-quality domains being trained up to 50 times... Our findings indicate that low-quality data suffer a substantial decrease in performance after training for more than two epochs, while high-quality data can be effectively trained for up to four epochs"" Login playground: https://www.hailuo.ai/",508,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sky-T1-32B-Preview,Berkeley,https://huggingface.co/NovaSky-AI/Sky-T1-32B-Preview,32,,███,"18,000",563:1,███,2.5,,,56.8,,"synthetic, web-scale",Jan/2025,🟢,A,https://novasky-ai.github.io/posts/sky-t1/,,"""To generate our training data we use QwQ-32B-Preview, an open-source model with reasoning capabilities comparable to o1-preview. We curate the data mixture (see later section) to cover diverse domains that require reasoning, and a reject sampling procedure to improve the data quality. We then rewrite QwQ traces with GPT-4o-mini into a well-formatted version, inspired by Still-2, to improve data quality and ease parsing... Rejection Sampling: We discard QwQ samples if they are incorrect according to the solutions provided in datasets.""",507,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cosmos Nemotron 34B,NVIDIA,https://build.nvidia.com/nvidia/cosmos-nemotron-34b,34,,Dense,200,6:1,███,0.3,,,,,special,Jan/2025,🟢,C,https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai,,"VLM. MMMU=47.33. ""VILA project becomes part of Cosmos Nemotron family"" https://github.com/NVlabs/Cosmos-Nemotron Vision Encoder: SigLIP-400M, Language Encoder: Yi-34B https://blogs.nvidia.com/blog/nemotron-model-families/",███,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cosmos 1.0,NVIDIA,https://huggingface.co/nvidia/Cosmos-1.0-Diffusion-14B-Video2World,14,███,Dense,200,15:1,███,0.2,,,,,special,Jan/2025,🟢,C,https://research.nvidia.com/publication/2025-01_cosmos-world-foundation-model-platform-physical-ai,,"WFM (world foundation model). ""The models range in size from 4 billion to 14 billion parameters, with Nano being the smallest and Ultra being the largest... ""Cosmos WFM models, were trained on 9,000 trillion tokens [9,000T] from 20 million hours of real-world human interactions, environment, industrial, robotics, and driving data..."" https://techcrunch.com/2025/01/06/nvidia-releases-its-own-brand-of-world-models/ Actual working: https://lifearchitect.ai/cosmos/",505,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ METAGENE-1,Prime Intellect,https://huggingface.co/metagene-ai,7,,Dense,370,53:1,███,0.2,,,,,special,Jan/2025,🟢,A,https://metagene.ai/metagene-1-paper.pdf,,███,504,███,███,███,███,███,"8,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sonus-1 Reasoning,Rubik's AI,https://chat.sonus.ai/sonus/,405,,Dense,"15,000",38:1,███,8.2,90.15,███,67.3,,web-scale,Jan/2025,🟢,D,https://sonus.ai/blog/sonus-1,,"Likely a Llama 3.1 405B wrapper. ALPrompt 2024H1=5/5. ALPrompt 2024H2=2/5. ALPrompt 2025H1=1/5. This is a strange model: slow and smart, but not as performant as o1. No arch details at all.",503,███,███,███,███,Proprietary,███,Mongolia,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ YuLan-Mini,Renmin,https://github.com/RUC-GSAI/YuLan-Mini,2.4,███,Dense,"1,080",450:1,███,0.2,51.79,,,,web-scale,Dec/2024,🟢,A,https://arxiv.org/abs/2412.17743,,"""1.08T tokens for training. Among them are 481B English web data, 138B general English knowledge, 227B code pre-training data, 16.7B code instruction data, 93.8B mathematics pre-training data, 15.5B mathematics instruction data, and 108B Chinese data.""",502,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V3,DeepSeek-AI,https://chat.deepseek.com/,685,███,MoE,"14,800",22:1,███,10.6,87.1,64.4,59.1,,"synthetic, web-scale",Dec/2024,🟢,A,https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf,SOTA,37B active. Explain: https://threadreaderapp.com/thread/1872318161883959485.html Announce: https://github.com/deepseek-ai/DeepSeek-V3?tab=readme-ov-file,501,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EON-8B,LinkedIn,https://www.linkedin.com/blog/engineering/generative-ai/how-we-built-domain-adapted-foundation-genai-models-to-power-our-platform,8,,Dense,"15,000",███,███,1.2,,,,,web-scale,Dec/2024,🔴,A,https://www.linkedin.com/blog/engineering/generative-ai/how-we-built-domain-adapted-foundation-genai-models-to-power-our-platform,,"""We found the EON-8B model (a domain-adapted Llama 3.1-8B variant) to be 75x and 6x cost effective in comparison to GPT-4 and GPT-4o respectively (Figure 4)... On tasks seen during training, the EON-8B model outperformed base Llama-3-8B-Instruct and its performance was comparable to SOTA GPT models.""",500,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o3-preview,OpenAI,https://lifearchitect.ai/o3/,600,30,MoE,"100,000",167:1,███,25.8,,,87.7,,███,Dec/2024,🟢,D,https://lifearchitect.ai/o3/,"Reasoning, SOTA",SoTA model for Dec/2024. Parameter estimate is very rough centrepoint for range 400B-52T.,499,███,███,███,███,Proprietary,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RWKV-7 Goose,RWKV,https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v7,0.4,,Dense,332,830:1,███,0.04,,,,,web-scale,Dec/2024,🟢,A,███,,"RWKV (pronounced RwaKuv) is an RNN: ""multilingual, supporting over 100 languages and code."". Full run is 332B tokens of 3.1T dataset.",498,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ModernBERT,International,https://huggingface.co/blog/modernbert,0.395,,Dense,"2,000","5,064:1",███,0.09,,,,,web-scale,Dec/2024,🟢,A,███,,"""a proper workhorse model, for retrieval, classification, etc."" https://bsky.app/profile/howard.fm/post/3ldod2afps62x",497,███,███,███,███,███,"8,192",International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite 3.1 8B,IBM,https://huggingface.co/ibm-granite/granite-3.1-8b-instruct,8,,Dense,"12,000","1,500:1",███,1.0,,,,,web-scale,Dec/2024,🟢,A,https://github.com/ibm-granite/granite-3.1-language-models?tab=readme-ov-file,███,"Dense IBM enterprise model trained on 12T tokens, 128K context via progressive training on ~500B additional tokens; Apache 2.0 licensed; outperforms models of similar parameter sizes on the HuggingFace OpenLLM Leaderboard at release; data curated to governance, risk, and compliance (GRC) criteria for enterprise use. Open-weight.",496,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Bamba-9B,IBM,https://huggingface.co/blog/bamba,9,,Dense,"2,200",245:1,███,0.5,60.77,17.53,4.14,███,web-scale,Dec/2024,🟢,A,https://huggingface.co/blog/bamba,,"""trained by IBM, Princeton, CMU, and UIUC on completely open data. At inference time, the model demonstrates 2.5x throughput improvement and 2x latency speedup compared to standard transformers in vLLM.""",495,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o1-2024-12-17,OpenAI,https://chatgpt.com/ ,200,10,MoE,"20,000",100:1,███,6.7,91.8,,75.7,8.8,web-scale,Dec/2024,🟢,D,███,,"""o1-2024-12-17 sets new state-of-the-art results on several benchmarks, improving cost-efficiency and performance.""",494,███,███,███,███,Proprietary,"200,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon 3,TII,https://huggingface.co/tiiuae/Falcon3-10B-Base,10,███,Dense,"16,000","1,600:1",███,1.3,73.1,42.5,34.1,,"synthetic, web-scale",Dec/2024,🟢,A,https://huggingface.co/blog/falcon3,,"""We conducted a single large-scale pretraining run on the 7B model, using 1024 H100 GPU chips, leveraging 14 trillion tokens... upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2 trillion tokens of high-quality data.""",493,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Command R7B,Cohere,https://cohereforai-c4ai-command.hf.space/models/command-r7b-12-2024,7,,Dense,"2,000",286:1,███,███,,28.5,7.7,,web-scale,Dec/2024,🟢,C,https://huggingface.co/CohereForAI/c4ai-command-r7b-12-2024,,"Dense 7B model with 128K context; hybrid sliding-window (4096) + global attention with RoPE; supports 23 languages; fine-tuned for RAG, tool use, and function calling; ranked 1st among similarly-sized open-weights models on the HuggingFace Open LLM Leaderboard at release. CC-BY-NC licensed.",492,███,███,███,███,CC-BY-NC 4.0,███,Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Maya,Cohere,https://huggingface.co/maya-multimodal/maya,8,,Dense,"4,800",600:1,███,0.7,,,,,"synthetic, web-scale",Dec/2024,███,C,https://arxiv.org/abs/2412.07112,,VLM.,491,███,███,███,███,███,"4,000",Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BLT,Meta AI,https://github.com/facebookresearch/blt,8,,Dense,"4,500",563:1,███,0.6,███,,,,web-scale,Dec/2024,🟢,A,https://ai.meta.com/research/publications/byte-latent-transformer-patches-scale-better-than-tokens/,,"Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance",490,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Large Concept Model,Meta AI,https://github.com/facebookresearch/large_concept_model?tab=readme-ov-file,7,,Dense,"2,700",386:1,███,0.5,███,,,,web-scale,Dec/2024,🟢,A,https://ai.meta.com/research/publications/large-concept-models-language-modeling-in-a-sentence-representation-space/,,"""autoregressive sentence prediction in an embedding space."" 7.7T tokens is a misprint, should be 2.2T as in paper.",489,███,███,███,███,███,"1,024",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-4,Microsoft,https://huggingface.co/microsoft/phi-4,14,,Dense,███,715:1,███,1.2,84.8,70.4,56.1,,"synthetic, web-scale",Dec/2024,🟢,A,https://arxiv.org/abs/2412.08905,SOTA,Use unsloth: https://huggingface.co/unsloth/phi-4-GGUF & https://www.reddit.com/r/singularity/comments/1i0kso4/i_fixed_4_bugs_in_microsofts_opensource_phi4_model/,488,███,███,███,███,MIT,"16,384",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 2.0 Flash exp,Google DeepMind,https://console.cloud.google.com/vertex-ai/generative/multimodal/create/text?model=gemini-2.0-flash-exp,30,1.5,MoE,"30,000","1,000:1",███,███,87,76.4,62.1,,"synthetic, web-scale",Dec/2024,🟢,D,https://cloud.google.com/vertex-ai/generative-ai/docs/gemini-v2,SOTA,"Gemini 2.0 Flash was first model released, 11/Dec/2024. ""New Modalities: Gemini 2.0 introduces native image generation and controllable text-to-speech capabilities"" Announce: https://blog.google/technology/google-deepmind/google-gemini-ai-update-december-2024/ Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/",487,███,███,███,███,Proprietary,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Moxin-7B,International,https://github.com/moxin-org/Moxin-LLM,7,,Dense,"2,000",286:1,███,0.4,60.97,,,,web-scale,Dec/2024,🟢,A,https://arxiv.org/abs/2412.06845,,███,486,███,███,███,███,Apache 2.0,"32,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 1T,Cerebras,███,1000,,Dense,"20,000",20:1,███,14.9,,,,,web-scale,Dec/2024,🔴,C,https://cerebras.ai/press-release/cerebras-demonstrates-trillion-parameter-model-training-on-a-single-cs-3-system,,"""For Sandia’s trillion parameter training run, Cerebras configured a 55 terabyte MemoryX device.""",485,███,███,███,███,Proprietary,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ InternVL 2.5,Shanghai AI Laboratory/SenseTime,https://huggingface.co/spaces/OpenGVLab/InternVL,78,,Dense,"18,120",233:1,███,4.0,86.1,71.1,49,,"synthetic, web-scale",Dec/2024,🟢,███,https://arxiv.org/abs/2412.05271,Reasoning,"Benchmarks are estimates based on Qwen2.5 72B Instruct as the base LLM (InternVL 2.5=InternViT-6B-448px-V2.5 5.5B + Qwen2.5-72B-Instruct). ""Notably, Qwen2-VL processed a cumulative total of 1.4T tokens, while our InternVL2.5-78B is trained on just ∼120B tokens [of vision].""Dataset... we identify repetitive generation as one of the most detrimental issues. In many open-source or synthetic datasets, a small number of repetitive samples—comprising merely thousands of examples in our Stage 2 data mixture—can cause the model to spiral into repetitive loops, particularly in long-form outputs or CoT reasoning tasks. This phenomenon undermines the effectiveness of test-time scaling strategies. To address this challenge and support future research, we designed an efficient data filtering pipeline to remove low-quality samples, thereby minimizing the risk of repetitive generation."" Repo: https://github.com/OpenGVLab/InternVL",484,███,███,███,███,MIT,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 3.3,Meta AI,https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct,███,,Dense,"15,000",215:1,███,3.4,86,68.9,50.5,,"synthetic, web-scale",Dec/2024,🟢,A,https://github.com/meta-llama/llama-models/blob/main/models/llama3_3/MODEL_CARD.md,SOTA,"Drop-in replacement for Llama 3.1 70B, comparable performance to Llama 3.1 405B.",483,███,███,███,███,Llama 3.3,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EXAONE-3.5,LG,https://huggingface.co/collections/LGAI-EXAONE/exaone-35-674d0e1bb3dcd2ab6f39dbb4,32,,Dense,"6,500",204:1,███,1.5,78.3,███,39.7,,web-scale,Dec/2024,🟢,A,https://arxiv.org/abs/2412.04862,,“EXAONE”=“EXpert AI for EveryONE”. Training tokens/ratio dropped from EXAONE-3 7.8B with 8T (Aug/2024) to this (Dec/2024) 7.8B with 9T to 32B with 6.5T.,482,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Deepthought-8B,Ruliad,https://chat.ruliad.co/,8,,Dense,"15,000","1,875:1",███,1.2,,,,,web-scale,Dec/2024,🟢,A,https://huggingface.co/ruliad/deepthought-8b-llama-v0.01-alpha,███,No evals. Llama 3.1 8B base.,481,███,███,███,███,Llama 3.1,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sailor2,SAIL,https://huggingface.co/spaces/sail/Sailor2-20B-Chat,20,███,Dense,"18,510",926:1,███,2.0,,,,,web-scale,Dec/2024,🟢,A,https://github.com/sail-sg/sailor2,,SEA languages. Continual pretraining based on Qwen2.5. Project page: https://sea-sailor.github.io/blog/sailor2/,480,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pleias 1.0,PleIAs,https://huggingface.co/PleIAs/Pleias-3b-Preview,3,,Dense,"1,086",362:1,███,0.2,,,,,███,Dec/2024,🟢,A,https://huggingface.co/blog/Pclanglais/common-models,,"Trained on the Jean Zay supercomputer, 192x H100s for 20 days. Dataset is new CC + Synthetic: https://huggingface.co/datasets/PleIAs/common_corpus",479,███,███,███,███,███,"4,096",Netherlands,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o1,OpenAI,https://chatgpt.com/ ,200,███,MoE,"20,000",100:1,███,6.7,92.3,91,79,8.8,web-scale,Dec/2024,🟢,D,https://openai.com/index/introducing-chatgpt-pro/,"Reasoning, SOTA","""a version of our most intelligent model that thinks longer for the most reliable responses"" System card about safety only: https://cdn.openai.com/o1-system-card-20241205.pdf",478,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nova Pro,Amazon,https://aws.amazon.com/bedrock/,90,,Dense,"10,000",███,███,3.2,85.9,,46.9,,web-scale,Dec/2024,🟢,D,https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card,,"Multimodal, same performance as Llama 3.2 90B ∴ est 90B. Model card was hidden: https://assets.amazon.science/9f/a3/ae41627f4ab2bde091f1ebc6b830/the-amazon-nova-family-of-models-technical-report-and-model-card.pdf via https://www.amazon.science/publications/the-amazon-nova-family-of-models-technical-report-and-model-card",477,███,███,███,███,Proprietary,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EuroLLM,Consortium,https://huggingface.co/utter-project/EuroLLM-9B-Instruct,9,,Dense,"4,000",445:1,███,0.6,52.45,17.6,,,"synthetic, web-scale",Dec/2024,🟢,A,https://arxiv.org/abs/2506.04079,,"24 official languages are Bulgarian, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Hungarian, Irish, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese, Romanian, Slovak, Slovenian, Spanish, and Swedish. ""we use 400 Nvidia H100 GPUs of the Marenostrum 5 supercomputer"" Also: https://eurollm.io/",███,███,███,███,███,███,"32,000",International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DisTrO 15B,Nous Research,https://distro.nousresearch.com/,15,,Dense,100,7:1,███,0.1,23.48,,,,web-scale,Dec/2024,🟢,A,https://github.com/NousResearch/DisTrO?tab=readme-ov-file,███,"""About 14 DGXes scattered around the globe. Sometimes more sometimes less, it varies depending on availability. On average, around 112 H100s."" https://x.com/bloc97_/status/1863675225810043331 ""we introduce DisTrO, a family of architecture-agnostic and network-agnostic distributed optimizers that reduces the inter-GPU communication requirements by four to five orders of magnitude without relying on amortized analysis, enabling low-latency training of large neural networks on slow internet bandwidths with heterogeneous networking hardware.""",475,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ INTELLECT-1,Prime Intellect,https://huggingface.co/PrimeIntellect/INTELLECT-1,10,,Dense,"1,000",███,███,0.3,49.89,,28.32,,web-scale,Nov/2024,🟢,A,https://github.com/PrimeIntellect-ai/prime/blob/main/INTELLECT_1_Technical_Report.pdf,,"Training complete 22/Nov/2024. Fully distributed training: ""the first decentralized training run of a 10-billion-parameter model, inviting anyone to contribute compute and participate. This brings us one step closer towards open source AGI.""",474,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ QwQ-32B-Preview,Alibaba,https://huggingface.co/spaces/Qwen/QwQ-32B-preview,32,,Dense,"18,000",███,███,2.5,,,65.2,,"synthetic, web-scale",Nov/2024,🟢,A,https://qwenlm.github.io/blog/qwq-32b-preview/,Reasoning,Scores 1/5 on latest ALPrompt 2024 H2. Qwen with Question=QwQ,473,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Teuken-7B,OpenGPT-X,https://huggingface.co/openGPT-X/Teuken-7B-instruct-research-v0.4,7,,Dense,"4,000",572:1,███,███,50,,,,"synthetic, web-scale",Nov/2024,🟢,A,https://arxiv.org/abs/2410.03730,,"24 EU languages (60% non-English): bg, cs, da, de, el, en, es, et, fi, fr, ga, hr, hu, it, lt, lv, mt, nl, pl, pt, ro, sk, sl, sv. https://opengpt-x.de/models/teuken-7b-de/ & paper date is Sep/2024.",472,███,███,███,███,Other,███,Germany,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMo 2,Allen AI,https://huggingface.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc,13,,Dense,"5,600",431:1,███,0.9,68.6,,,,"synthetic, web-scale",Nov/2024,🟢,A,https://arxiv.org/abs/2501.00656,,Open Language Model (OLMo) 2 Apache 2.0 license for research and educational use. Paper coming. Data: 5 trillion tokens (1.2 epochs of 4T tokens) + 100B tokens (3 runs) + 300B tokens (1 run) merged. https://huggingface.co/allenai/OLMo-2-1124-13B & playground: https://playground.allenai.org/,███,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Bi-Mamba,CMU,,2.7,,Dense,"1,260",467:1,███,0.2,,,,,███,Nov/2024,🔴,B,https://arxiv.org/abs/2411.11843,,"Unreleased, but will be replicated. ""a scalable and powerful 1-bit Mamba architecture designed for more efficient large language models""",470,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ k0-math,Moonshot AI,https://kimi.moonshot.cn/,100,,Dense,"2,000",20:1,███,███,,,,,web-scale,Nov/2024,🟢,C,https://www.globaltimes.cn/page/202411/1323248.shtml,Reasoning,"Reasoning, maths only. Very little info available. Chinese. Long context. No paper.",469,███,███,███,███,███,"4,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Marco-o1,Alibaba,https://huggingface.co/AIDC-AI/Marco-o1,7,,Dense,"7,000","1,000:1",███,0.7,,,,,"synthetic, web-scale",Nov/2024,🟢,███,https://arxiv.org/abs/2411.14405,Reasoning,"No evals. Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset.",468,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TÜLU 3,Allen AI,https://playground.allenai.org/,70,,Dense,"15,600",223:1,███,3.5,83.1,65.8,45.1,,"synthetic, web-scale",Nov/2024,🟢,A,https://allenai.org/papers/tulu-3-report.pdf,,"Llama 3.1 post-training, worse performance on most benchmarks. Post training methods include new Reinforcement Learning with Verifiable Rewards (RLVR). ""We perform supervised fine-tuning on new capability-focused synthetic data mixed with existing instruction datasets. We then perform preference tuning on on-policy synthetic preference data. We finish training Llama Tülu3 with our new method, Reinforcement Learning with Verifiable Rewards.""",███,███,███,███,███,Llama 3.1,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ gpt-4o-2024-11-20,OpenAI,https://chat.com/,200,10,MoE,"20,000",███,███,6.7,85.7,,46,3.1,web-scale,Nov/2024,🟢,D,https://platform.openai.com/docs/models#gpt-4o,,"Material decrease in benchmark scores (GPQA: -13.37%, MMLU: -3.38%) compared to Aug/2024. Pruned? Quantized? https://github.com/openai/simple-evals",466,███,███,███,███,Proprietary,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-R1-Lite,DeepSeek-AI,https://chat.deepseek.com/,67,,███,"2,000",30:1,███,1.2,,,58.5,,web-scale,Nov/2024,🟢,D,https://x.com/deepseek_ai/status/1859200141355536422,Reasoning,"Scores 0/5 on latest ALPrompt 2024 H2 ""DeepSeek-R1-Lite is currently still in the iterative development stage. It currently only supports web usage and does not support API calls. The base model used by DeepSeek-R1-Lite is also a relatively small model, unable to fully unleash the potential of long reasoning chains. At present, we are continuously iterating on the inference series models. In the future, the official DeepSeek-R1 model will be fully open-sourced. We will publicly release the technical report and deploy API services."" https://mp-weixin-qq-com.translate.goog/s/e1YnTxZlzFvjcmrLLTA8fw?_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=zh-TW",465,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Xmodel-LM,XiaoduoAI,https://github.com/XiaoduoAILab/XmodelLM,1.1,,Dense,"2,064","1,877:1",███,0.2,25.9,,,,███,Nov/2024,🟢,A,https://arxiv.org/abs/2411.10083,,SLM,464,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pixtral Large,Mistral,https://huggingface.co/mistralai/Pixtral-Large-Instruct-2411,124,,Dense,"6,000",49:1,███,2.9,,,,███,web-scale,Nov/2024,🟢,C,https://mistral.ai/news/pixtral-large/,,Open-weights multimodal model built on top of Mistral Large 2.,463,███,███,███,███,███,"131,072",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ f1,Fireworks,███,405,,Compound,"15,600",39:1,███,8.4,,,42.4,,web-scale,Nov/2024,🟢,D,https://fireworks.ai/blog/fireworks-compound-ai-system-f1,,"""a compound AI model specialized in complex reasoning, that interweaves multiple open models at the inference layer."" Dataset: Placeholder only: Fireworks f1 is a compound AI system ""interweaving multiple open models"" at inference, not a single trained model; per-token training concept does not apply. Params: Largest at this time was Llama 3.1 405B.",462,███,███,███,███,███,"32,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen2.5-Coder,Alibaba,███,32.5,,Dense,"5,500",170:1,███,1.4,79.1,,,,"synthetic, web-scale",Nov/2024,🟢,A,https://arxiv.org/abs/2412.15115,,https://qwenlm.github.io/blog/qwen2.5-coder-family/ Jack Clark from Anthropic is saying it’s actually 18T tokens from Qwen2.5 + 5.5T tokens for a total of 23.5T tokens. That doesn’t seem right from my interpretation of the technical report.,461,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fox-1,TensorOpera,https://huggingface.co/tensoropera/Fox-1-1.6B-Instruct-v0.1,1.6,,Dense,███,"1,879:1",███,0.2,44.99,,,,web-scale,Nov/2024,🟢,A,https://arxiv.org/abs/2411.05281,,Gold standard for dataset documentation,460,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hunyuan-Large,Tencent,https://huggingface.co/tencent/Tencent-Hunyuan-Large,███,52,MoE,"7,000",18:1,███,5.5,89.9,60.2,42.4,,"synthetic, web-scale",Nov/2024,🟢,A,https://arxiv.org/abs/2411.02265,,"Hunyuan-Large is pre-trained on 7T tokens, which contains nearly 1.5T tokens of high-quality and diverse synthetic data.' '389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens'",459,███,███,███,███,Other,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SEA-LIONv3,AI Singapore,https://huggingface.co/aisingapore/gemma2-9b-cpt-sea-lionv3-base,9.24,,Dense,"8,200",888:1,███,0.9,,,,,web-scale,Nov/2024,🟢,A,https://www.linkedin.com/posts/leslieteo01_ai-machinelearning-nlp-activity-7258042808891027456-Tqab/,,███,458,███,███,███,███,Gemma,███,Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AMD OLMo,AMD,https://huggingface.co/amd/AMD-OLMo,1,,Dense,"1,308","1,308:1",███,0.1,30.52,,,,web-scale,Nov/2024,███,A,https://www.amd.com/en/developer/resources/technical-articles/introducing-the-first-amd-1b-language-model.html,,1 billion parameter LMs trained from scratch using 1.3T tokens on a cluster of AMD Instinct MI250 GPUs.,457,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SmolLM2,Hugging Face,https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9,1.7,,Dense,"1,000",589:1,███,0.1,42.3,,,,"synthetic, web-scale",Nov/2024,🟢,███,https://huggingface.co/collections/HuggingFaceTB/smollm2-6723884218bcda64b34d7db9,,"Base and instruct versions, with Apache 2.0 license",456,███,███,███,███,Apache 2.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Aya-Expanse-32B,Cohere,https://huggingface.co/CohereForAI/aya-expanse-32b,32,,Dense,"8,000",250:1,███,1.7,,,,,"synthetic, web-scale",Oct/2024,🟢,C,https://cohere.com/blog/aya-expanse-connecting-our-world,,███,455,███,███,███,███,CC-BY-NC 4.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude 3.5 Sonnet (new),Anthropic,https://claude.ai/,400,,Dense,"20,000",50:1,███,9.4,90.5,78,65,,web-scale,Oct/2024,███,D,https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf#page=51,SOTA,Absurd naming scheme. Paper addendum pp51-64: https://assets.anthropic.com/m/61e7d27f8c8f5919/original/Claude-3-Model-Card.pdf#page=51,454,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite 3.0 8B,IBM,https://huggingface.co/ibm-granite/granite-3.0-8b-base,8,,Dense,"12,000",███,███,1.0,65.54,33.27,32.13,,web-scale,Oct/2024,🟢,A,http://ibm.biz/granite-report,,Announce: https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models,453,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite-3.0-3B-A800M-Instruct,IBM,https://huggingface.co/ibm-granite/granite-3.0-3b-a800m-instruct,3,0.8,MoE,"10,000","3,334:1",███,0.6,50.16,20.51,26.85,,web-scale,Oct/2024,🟢,███,http://ibm.biz/granite-report,,Announce: https://www.ibm.com/new/ibm-granite-3-0-open-state-of-the-art-enterprise-models,452,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ aiXcoder-7B,aiXcoder,https://github.com/aixcoder-plugin/aixcoder-7b,7,,Dense,"1,200",172:1,███,0.3,,,,,"code, The Stack",Oct/2024,🟢,A,https://arxiv.org/abs/2410.13187v1,,███,451,███,███,███,███,Apache 2.0,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama-3.1-Nemotron-70B,NVIDIA,https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct,70,,Dense,"15,000",███,███,3.4,,,,,web-scale,Oct/2024,🟢,C,https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-instruct/modelcard,,Related paper: https://arxiv.org/abs/2410.01257,450,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ministral 8B,Mistral,https://huggingface.co/mistralai/Ministral-8B-Instruct-2410,8,,Dense,"3,000",375:1,███,0.5,65,,,███,web-scale,Oct/2024,🟢,A,https://mistral.ai/news/ministraux/,,"""Introducing the world’s best edge models""",449,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yi-Lightning,01-ai,https://platform.lingyiwanwu.com/,200,10,MoE,"10,000",50:1,███,4.7,,,,,web-scale,Oct/2024,🟢,D,https://platform.lingyiwanwu.com/docs#%E6%A8%A1%E5%9E%8B%E4%B8%8E%E8%AE%A1%E8%B4%B9,,███,448,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Zamba2-7B,Zyphra,https://huggingface.co/Zyphra/Zamba2-7B,7,,Dense,"3,100",443:1,███,███,67.2,,,,web-scale,Oct/2024,🟢,A,https://www.zyphra.com/post/zamba2-7b,,"Mamba2 ""trained on 128 H100 GPUS for approximately 50 days using our internal training framework developed atop Megatron-LM""",447,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ nGPT,NVIDIA,https://github.com/lucidrains/nGPT-pytorch,1,,Dense,400,400:1,███,0.07,,,,,web-scale,Oct/2024,🟢,A,https://arxiv.org/abs/2410.01131,,███,446,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Inflection-3 Pi (3.0),Inflection AI,https://developers.inflection.ai/,1200,,Dense,"20,000",17:1,███,16.3,,███,,,web-scale,Oct/2024,🟢,D,https://developers.inflection.ai/docs,,"Inference via Intel Gaudi® 3 128 GB, on-premise available. Minimum spend $100 credits.",445,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Inflection-3 Productivity (3.0),Inflection AI,https://developers.inflection.ai/,1200,,Dense,"20,000",17:1,███,16.3,,,,,web-scale,Oct/2024,███,D,https://developers.inflection.ai/docs,,"Inference via Intel Gaudi® 3 128 GB, on-premise available. Minimum spend $100 credits.",444,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LFM-40B,Liquid AI,███,40,12,MoE,"2,000",50:1,███,0.9,78.76,55.63,,,web-scale,Sep/2024,🟢,C,https://www.liquid.ai/liquid-foundation-models,,"Some controversy/concern over company. Liquid Foundation Models (LFM). ""Human preference optimization techniques have not been applied extensively to our models yet.""",443,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SFR-LLaMA-3.1-70B-Judge,Salesforce,https://blog.salesforceairesearch.com/sfr-judge/,70,,Dense,"15,000",215:1,███,3.4,,,███,,web-scale,Sep/2024,🔴,A,https://arxiv.org/abs/2409.14664,,"Code coming soon: https://github.com/SalesforceAIResearch/SFRJudge ""we opt to focus on datasets that evaluate modern (2023 and beyond) LLM responses, as older datasets likely contain lower quality responses from less capable models, with correspondingly stale annotations. We supplement human-annotated data with synthetically generated data to endow our judge models with specific capabilities (e.g., following fine-grained rubrics in evaluation)""",442,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Emu3,BAAI,https://huggingface.co/BAAI/Emu3-Gen,8,,Dense,"1,000",125:1,███,0.3,███,,,,special,Sep/2024,🟢,C,https://arxiv.org/abs/2409.18869,,"VLM. Dataset estimates are based on the unrelated UW/Salesforce dataset MINT-1T (3.4B images, 927M documents) https://arxiv.org/abs/2406.11271v1",441,███,███,███,███,Apache 2.0,"9,216",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NVLM 1.0,NVIDIA,███,72,,Dense,"18,000",250:1,███,3.8,82,,,,web-scale,Sep/2024,🟢,A,https://arxiv.org/abs/2409.11402,,"Flamingo clone. ""we use Qwen2-72B-Instruct as the default text-only LLM backbone. We also employ Nous-Hermes-2-Yi-34B for ablation study and faster experimentation... we use InternViT-6B as the default vision encoder""",440,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Unnamed 1T,China Telecom Artificial Intelligence Research Institute,███,1000,,Dense,"20,000",20:1,███,14.9,,,,,web-scale,Sep/2024,🔴,D,https://www.scmp.com/tech/big-tech/article/3280588/china-telecom-say-ai-model-1-trillion-parameters-trained-chinese-chips,,"Trained on Chinese GPUs: ""Ascend Atlas 800T A2 training server – a Huawei product listed as supporting the Kunpeng 920 7265 or Kunpeng 920 5250 processors"" https://www.theregister.com/2024/10/02/china_telecom_model_trained_local_tech/",439,███,███,███,███,███,"4,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TeleChat2-115B,China Telecom Artificial Intelligence Research Institute,https://modelscope.cn/models/TeleAI/TeleChat2-115B,115,,Dense,"10,000",87:1,███,3.6,80.9,,███,,web-scale,Sep/2024,🟢,A,https://arxiv.org/abs/2507.18013,,"Trained on Chinese GPUs: ""Ascend Atlas 800T A2 training server – a Huawei product listed as supporting the Kunpeng 920 7265 or Kunpeng 920 5250 processors"" https://www.theregister.com/2024/10/02/china_telecom_model_trained_local_tech/",438,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AMD-Llama-135m,AMD,https://huggingface.co/amd/AMD-Llama-135m,0.135,,Dense,670,"4,963:1",███,0.03,23.02,,,,books,Sep/2024,███,A,https://www.amd.com/en/developer/resources/technical-articles/introducing-amd-first-slm-135m-model-fuels-ai-advancements.html,,"Small language model (SLM). Trained on AMD Instinct™ MI250 accelerators. ""Pretrain Dataset: We employed the SlimPajama and Project Gutenberg dataset to pretrain the 135M model. Project Gutenberg is a library of over 70,000 free eBooks approximately. This sums up to 670B tokens""",437,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 3.2 90B,Meta AI,https://www.llama.com/,90,,Dense,███,100:1,███,3.0,,,,,web-scale,Sep/2024,🟢,A,https://www.llama.com/,,Vision (VLM),436,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 3.2 3B,Meta AI,https://www.llama.com/,3.21,,Dense,"9,000","2,804:1",███,0.6,63.4,,32.8,,web-scale,Sep/2024,███,A,https://www.llama.com/,,"Text (LLM). ""Pre-training. [For Llama 3.2 3B] We prune the models from their 8B siblings and use logits from the 8B and 70B models as token-level targets (token-level distillation). We then use knowledge distillation to recover performance.""",435,███,███,███,███,Llama 3.2,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Molmo,Allen AI,https://molmo.allenai.org/,72,,Dense,"7,000",98:1,███,2.4,,███,,,web-scale,Sep/2024,🟢,A,https://molmo.allenai.org/paper.pdf,,ViT: Llava as Qwen2 (or Olmo) + CLIP. Multimodal Open Language Model built by Ai2. Announce: https://molmo.allenai.org/blog,434,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini-1.5-Pro-002 ,Google DeepMind,https://aistudio.google.com/app/prompts/new_chat,200,10,MoE,"30,000",150:1,███,8.2,,75.8,59.1,,web-scale,Sep/2024,███,D,https://developers.googleblog.com/en/updated-production-ready-gemini-models-reduced-15-pro-pricing-increased-rate-limits-and-more/,SOTA,Sparse MoE. Context window=2M. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/,433,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen2.5,Alibaba,https://huggingface.co/Qwen/Qwen2.5-72B-Instruct,72,,Dense,"18,000",250:1,███,3.8,86.1,71.1,49,,web-scale,Sep/2024,🟢,███,https://arxiv.org/abs/2412.15115,,"Dense 72B (80 layers, GQA, SwiGLU, RoPE base 1M); 128K context via YARN + Dual Chunk Attention; pretrained on 18T tokens (up from 7T in Qwen2) with SFT on 1M+ examples and GRPO online RL; outperforms Llama-3-405B-Instruct on MMLU-redux (86.8 vs. 86.2) and MATH (83.1 vs. 73.8) while being ~5x smaller. Open-weight.",432,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GRIN MoE,Microsoft,https://huggingface.co/microsoft/GRIN-MoE,███,6.6,MoE,"4,025",68:1,███,1.6,79.4,,,,"synthetic, web-scale",Sep/2024,🟢,A,https://huggingface.co/microsoft/GRIN-MoE/blob/main/GRIN_MoE.pdf,,"16x3.8B ""only 6.6B activate parameters"". GRIN=GRadient-INformed. ""GRIN MoE is pre-trained on 4T tokens as a Causal Language Model. The same training dataset has been used to train Phi-3 dense models""",431,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Data-Gemma,Google DeepMind,https://huggingface.co/google/datagemma-rig-27b-it,27,,███,"13,000",482:1,███,2.0,,,,,web-scale,Sep/2024,🟢,A,https://docs.datacommons.org/papers/DataGemma-FullPaper.pdf,,"RAG/RIG: ""the LLM is fine-tuned to produce natural language Data Commons queries alongside statistics""",430,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ o1-preview,OpenAI,https://chatgpt.com/,200,10,MoE,"20,000",100:1,███,6.7,92.3,91,78.3,8.8,web-scale,Sep/2024,🟢,███,https://openai.com/index/introducing-openai-o1-preview/,"Reasoning, SOTA","OpenAI's first public thinking model using chain-of-thought RL; reaches 89th percentile on competitive Codeforces problems, top-500 on a USAMO qualifier, and exceeds PhD-level accuracy on GPQA; preceded the full o1 release. API-only, closed weights.",429,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Reader-LM,Jina AI,https://huggingface.co/jinaai/reader-lm-1.5b,1.54,,Dense,███,2:1,███,0.007,,,,,special,Sep/2024,🟢,A,https://jina.ai/news/reader-lm-small-language-models-for-cleaning-and-converting-html-to-markdown/,,"HTML->Markdown. Specialist small model; outperforms GPT-4o general model, does not outperform Gemini Pro 1.5.",428,███,███,███,███,███,"256,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pixtral-12b-240910,Mistral,https://huggingface.co/mistralai/Pixtral-12B-2409,12,███,Dense,"6,000",500:1,███,0.9,69.2,,,,web-scale,Sep/2024,🟢,C,https://mistral.ai/news/pixtral-12b/,,"""Pixtral was trained to be a drop-in replacement for Mistral Nemo 12B.""",427,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V2.5,DeepSeek-AI,https://huggingface.co/deepseek-ai/DeepSeek-V2.5,236,21,MoE,"10,200",44:1,███,███,,,,,web-scale,Sep/2024,🟢,A,https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf,,"""DeepSeek-V2.5 is an upgraded version that combines DeepSeek-V2-Chat and DeepSeek-Coder-V2-Instruct.""",426,███,███,███,███,DeepSeek,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yi-Coder,01-ai,https://huggingface.co/collections/01-ai/yi-coder-66bdb00f5bdd611f9a008f30,9,,Dense,"6,200",689:1,███,0.8,,,,,web-scale,Sep/2024,🟢,A,███,,"6B=3T tokens, 9B=+0.8T tokens, 9B-Coder=+2.4T tokens=6.2T tokens. See Yi 1.5 34B in this table",425,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMoE-1B-7B,Allen AI,https://huggingface.co/collections/allenai/olmoe-66cf678c047657a30c8cd3da,6.9,1,MoE,"5,900",856:1,███,0.7,███,,23,,web-scale,Sep/2024,🟢,A,https://arxiv.org/abs/2409.02060v1,,"Open Language (OL) Mixture of Experts (MoE). ""We train OLMoE-1B-7B for 5 trillion tokens, however, some recent dense models train significantly longer, such as Llama 3 with 15 trillion tokens. To the best of our knowledge, there has been no large MoE that has been overtrained as much as OLMoE-1B-7B. Specifically, taking the active parameters of OLMoE-1B-7B, our token multiplier is around 5,000 (5T / 1B). There are likely benefits to training even longer, but to what degree overtraining is effective for MoEs and how it differs from dense models still requires more research.""",424,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PLLuM,Consortium,,20,,Dense,"2,000",100:1,███,0.7,,,,,web-scale,Aug/2024,🟢,F,https://opi.org.pl/en/the-launch-of-the-first-polish-open-large-language-model-pllum/,,Polish Large Language Model. Not yet available as of Sep/2024,███,███,███,███,███,███,"8,000",International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ xLAM,Salesforce,https://huggingface.co/Salesforce/xLAM-8x22b-r,141,39,MoE,"8,000",57:1,███,3.5,,,,,web-scale,Aug/2024,🟢,C,https://huggingface.co/Salesforce/xLAM-8x22b-r,███,"64K sequence length. Released under Apache-2.0. Dataset: Mixtral-8x22B base (~8T tokens, Mistral-era norm) + xlam-function-calling-60k fine-tune (negligible) = ~8000B.",422,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LTM-2-mini,Magic,https://magic.dev/blog/100m-token-context-windows,20,,Dense,"2,000",100:1,███,███,,,,,web-scale,Aug/2024,🔴,D,https://magic.dev/blog/100m-token-context-windows,,Context=100M tokens equals ~10 million lines of code or ~750 novels.,421,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Rene,Cartesia,https://huggingface.co/cartesia-ai/Rene-v0.1-1.3b-pytorch,1.3,,Dense,"1,500","1,154:1",███,0.1,32.6,,,███,web-scale,Aug/2024,🟢,A,https://cartesia.ai/blog/2024-08-27-on-device,,"On-device. ""hybrid architecture based on Mamba-2, with feedforward and sliding window attention layers interspersed""",420,███,███,███,███,Apache 2.0,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 1.5 Flash-8B,Google DeepMind,https://ai.google.dev/,8,,███,"8,000","1,000:1",███,0.8,68.1,,30.8,,web-scale,Aug/2024,🟢,C,https://arxiv.org/abs/2403.05530,,Announce: https://x.com/OfficialLoganK/status/1828480085353234535 1M context for all modalities. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/,419,███,███,███,███,Proprietary,"1,000,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pharia-1-LLM-7B,Aleph Alpha,https://huggingface.co/Aleph-Alpha/Pharia-1-LLM-7B-control,7,,███,"7,700","1,100:1",███,0.8,,,,,web-scale,Aug/2024,🟢,A,https://aleph-alpha.com/introducing-pharia-1-llm-transparent-and-compliant/,,"Aleph Alpha's 7B GPT-style decoder (27 layers, GQA, RoPE, 128K vocab) trained on 7.7T tokens across two phases; 8,192-token context; optimized for German, French, Spanish, and English with EU regulatory compliance; released under Open Aleph License in two variants: 7B-control (instruction-tuned) and 7B-control-aligned (KTO safety-aligned).",418,███,███,███,███,███,"8,192",Germany,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TTT-Linear,Stanford,https://github.com/test-time-training/ttt-lm-jax,1.3,,Dense,26,20:1,███,0.02,,,,,███,Aug/2024,🟢,A,https://arxiv.org/abs/2407.04620,,"Test-Time Training (TTT) layers. Real-time learning by Stanford, UC, and Meta. Potential for frontier models in 2025+.",417,███,███,███,███,MIT,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jamba 1.5,AI21,https://huggingface.co/collections/ai21labs/jamba-15-66c44befa474a917fcf55251,398,94,MoE,"1,200",4:1,███,2.3,81.2,53.5,36.9,,web-scale,Aug/2024,🟢,C,https://arxiv.org/abs/2408.12570,,"Jamba 1.5 Mini (12B active/52B total) and Jamba 1.5 Large (94B active/398B total) are also optimized for business use cases and capabilities such as function calling, structured output (JSON), and grounded generation.",███,███,███,███,███,███,"262,144",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ phi-3.5-MoE,Microsoft,https://huggingface.co/microsoft/Phi-3.5-MoE-instruct,42,6.6,MoE,"4,900",117:1,███,1.5,78.9,54.3,36.8,███,"synthetic, web-scale",Aug/2024,🟢,A,https://arxiv.org/abs/2404.14219,,"""Phi-3.5-MoE with 16 x 3.8B parameters and 6.6B active parameters achieves superior performance in language reasoning, math, and code compared to Llama 3.1 and Mixtral, and on par with Gemini-1.5-Flash and GPT-4o-mini."" Trained on 4.9T tokens (10% multilingual); 128K context; 22 languages; MMLU 78.9, HumanEval 70.7.",415,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ phi-3.5-mini,Microsoft,███,3.8,,Dense,"3,400",895:1,███,0.4,65.5,47.4,25.2,,"synthetic, web-scale",Aug/2024,🟢,A,https://arxiv.org/abs/2404.14219,,"3.8B dense decoder trained on 4.9T tokens with 128K context; achieves MMLU 69.0, ARC-C 84.6, GSM8K 86.2, HumanEval 62.8; competitive with 7B-12B models on reasoning and math. Recommended with RAG for knowledge-heavy tasks.",414,███,███,███,███,MIT,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Minitron-4B,NVIDIA,https://huggingface.co/nvidia/Minitron-4B-Base,4,,Dense,94,24:1,███,0.06,███,,,,web-scale,Aug/2024,🟢,C,https://arxiv.org/abs/2407.14679,,Pruned and distilled from Nemotron-4 15B: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/,413,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ sarvam-2b,Sarvam AI,https://huggingface.co/sarvamai/sarvam-2b-v0.5,2,,Dense,"4,000","2,000:1",███,0.3,,,,,web-scale,Aug/2024,🟢,A,https://huggingface.co/sarvamai/sarvam-2b-v0.5,███,"Indic languages supported are: Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.",412,███,███,███,███,Other,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok-2,xAI,https://huggingface.co/xai-org/grok-2,400,,Dense,"15,000",38:1,███,8.2,87.5,75.5,███,3.9,web-scale,Aug/2024,🟢,D,https://x.ai/blog/grok-2,SOTA,"MMLU-Pro=75.5=SOTA. Claude 3.5S MMLU-Pro=72.83. ""Grok-2 has been tested on the LMSYS leaderboard under the name ""sus-column-r."" At the time of this blog post, it is outperforming both Claude 3.5 Sonnet and GPT-4-Turbo."" [Alan: Grok is Heinlein, Sixth Column is also Heinlein: https://en.wikipedia.org/wiki/Sixth_Column ]",411,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EXAONE 3.0,LG,https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct,7.8,,Dense,"8,000","1,026:1",███,0.8,,27.4,10.1,,web-scale,Aug/2024,🟢,A,https://arxiv.org/abs/2408.03541,███,“EXAONE”=“EXpert AI for EveryONE”,410,███,███,███,███,Other,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon Mamba 7B,TII,https://falconllm.tii.ae/falcon-models.html,7,,Dense,"6,000",858:1,███,0.7,███,14.47,8.05,,web-scale,Aug/2024,🟢,C,https://falconllm.tii.ae/tii-releases-first-sslm-with-falcon-mamba-7b.html,,https://huggingface.co/spaces/tiiuae/falcon-mamba-playground,409,███,███,███,███,Other,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Palmyra-Med-70B,Writer,https://huggingface.co/Writer/Palmyra-Med-70B-32K,70,,Dense,"1,200",18:1,███,███,,,,,special,Jul/2024,🟢,C,https://writer.com/blog/palmyra-med-fin-models/,,Medical. MMLU Medical Genetics=94.0,408,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Palmyra-Fin-70B,Writer,https://huggingface.co/Writer/Palmyra-Fin-70B-32K,70,,Dense,"1,200",18:1,███,1.0,,,███,,special,Jul/2024,🟢,C,https://writer.com/blog/palmyra-med-fin-models/,,"Financial. ""across a variety of real-world financial use cases. It outperformed popular models like Claude 3.5 Sonnet, GPT-4o, and Mixtral-8x7b""",407,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Zamba2-small,Zyphra,https://huggingface.co/Zyphra/Zamba2-2.7B,2.7,,Dense,"3,100","1,149:1",███,███,55,,,,web-scale,Jul/2024,🟢,A,https://www.zyphra.com/post/zamba2-small,,Mamba2,406,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Minitron-8B,NVIDIA,https://huggingface.co/nvidia/Mistral-NeMo-Minitron-8B-Base,8,,Dense,94,12:1,███,0.09,63.8,,,,web-scale,Jul/2024,🟢,C,https://blogs.nvidia.com/blog/mistral-nemo-minitron-8b-small-language-model/,███,Pruned and distilled from Nemotron-4 15B: https://developer.nvidia.com/blog/how-to-prune-and-distill-llama-3-1-8b-to-an-nvidia-llama-3-1-minitron-4b-model/,405,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Large 2,Mistral,https://huggingface.co/mistralai/Mistral-Large-Instruct-2407,███,,Dense,"8,000",66:1,███,3.3,84,,,,web-scale,Jul/2024,🟢,C,https://mistral.ai/news/mistral-large-2407/,,Fits on a single node for inference.,404,███,███,███,███,███,"131,072",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 3.1 405B,Meta AI,https://www.meta.ai/,405,,Dense,"15,600",39:1,███,8.4,88.6,73.3,███,,web-scale,Jul/2024,🟢,A,https://ai.meta.com/research/publications/the-llama-3-herd-of-models/,SOTA,Announce: https://ai.meta.com/blog/meta-llama-3-1/ Model card: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/MODEL_CARD.md,403,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4o mini,OpenAI,https://chatgpt.com/,8,0.4,MoE,"13,000","1,625:1",███,1.1,82,,40.2,,web-scale,Jul/2024,🟢,C,https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/,,"Omnimodel. ""OpenAI would not disclose exactly how large GPT-4o mini is, but said it’s roughly in the same tier as other small AI models, such as Llama 3 8b, Claude Haiku and Gemini 1.5 Flash."" https://techcrunch.com/2024/07/18/openai-unveils-gpt-4o-mini-a-small-ai-model-powering-chatgpt/ ""tested GPT-4o to identify potential risks, which we have addressed and plan to share the details of in the forthcoming GPT-4o system card and Preparedness scorecard."" And related paper about instruction hierarchy: https://arxiv.org/abs/2404.13208",███,███,███,███,███,Proprietary,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NeMo,Mistral,https://huggingface.co/mistralai/Mistral-Nemo-Base-2407,███,,Dense,"2,000",167:1,███,0.5,68,,,,web-scale,Jul/2024,🟢,C,https://mistral.ai/news/mistral-nemo/,,"With NVIDIA. ""Drop-in replacement of Mistral 7B"". ""trained using Megatron-LM, part of NVIDIA NeMo, with 3,072 H100 80GB Tensor Core GPUs"" https://blogs.nvidia.com/blog/mistral-nvidia-ai-model/",401,███,███,███,███,Apache 2.0,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Codestral Mamba,Mistral,https://huggingface.co/mistralai/mamba-codestral-7B-v0.1,7,███,Dense,"2,000",286:1,███,0.4,,,,,web-scale,Jul/2024,🟢,C,https://mistral.ai/news/codestral-mamba/,,"""Unlike Transformer models, Mamba models offer the advantage of linear time inference and the theoretical ability to model sequences of infinite length.""",400,███,███,███,███,Apache 2.0,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mathstral,Mistral,https://huggingface.co/mistralai/mathstral-7B-v0.1,7,,Dense,"2,000",286:1,███,0.4,███,,,,web-scale,Jul/2024,🟢,C,https://mistral.ai/news/mathstral/,,"""We’re contributing Mathstral to the science community to bolster efforts in advanced mathematical problems requiring complex, multi-step logical reasoning.""",399,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SpreadsheetLLM,Microsoft,,1760,88,MoE,"13,000",8:1,███,15.9,,,,,web-scale,Jul/2024,🔴,F,https://arxiv.org/abs/2407.09025v1,,"Notable finetune of GPT4-0125-preview ""outperforming the vanilla approach by 25.6% in GPT4’s in-context learning setting""",███,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Spectra,Consortium,https://huggingface.co/SpectraSuite,3.9,,Dense,300,77:1,███,0.1,32.8,,███,,"synthetic, web-scale",Jul/2024,🟢,A,https://arxiv.org/abs/2407.12327,,"AKA TriLM. ""Spectra LLM suite, the first open suite of LLMs spanning multiple bit-widths, including FloatLMs, QuantLMs, and TriLMs, ranging from 99M to 3.9B parameters trained on 300B tokens.""",397,███,███,███,███,Apache 2.0,███,International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ next-gen,DeepL,https://www.deepl.com/en/translator,7,,Dense,"1,000",███,███,0.3,,,,,special,Jul/2024,🟢,D,https://www.deepl.com/en/blog/next-gen-language-model,,"""Built using our own groundbreaking, specialized LLM technology and proprietary training data, designed specifically for translation""",396,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SmolLM,Hugging Face,https://huggingface.co/collections/HuggingFaceTB/smollm-6695016cad7167254ce15966,1.7,,Dense,"1,000",589:1,███,0.1,39.97,,,,web-scale,Jul/2024,🟢,A,███,,"Dataset includes new Cosmopedia v2 synthetic data. 135M and 360M models,each trained on 600B tokens from Smollm-Corpus. 1.7B model trained on 1T tokens from Smollm-Corpus.",395,███,███,███,███,Apache 2.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mockingbird,Vectara,https://vectara.com/platform/,9,,Dense,"1,000",112:1,███,0.3,███,,,,web-scale,Jul/2024,🟢,C,https://vectara.com/blog/mockingbird-a-rag-and-structured-output-focused-llm/,,"""At <10B parameters it's an LLM trained to provide optimal results for RAG and structured outputs.""",394,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FLAMe,Google DeepMind,,24,,Dense,"1,000",42:1,███,███,,,,,dialogue,Jul/2024,🔴,D,https://arxiv.org/abs/2407.10817v1,,LLM-as-a-Judge autorater. Foundational Large Autorater Models (FLAMe). Uses an instruction-tuned PaLM-2-24B model. Unrelated to Microsoft FLAME Jan/2023.,393,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Step-2,StepFun,https://platform.stepfun.com/#language-step2,1000,50,███,"13,000",13:1,███,12.0,82.9,63,,,web-scale,Jul/2024,🟢,C,https://platform.stepfun.com/docs/llm/text,,"Launched early Jul/2024: https://pandaily.com/stepfun-releases-three-large-models-of-the-step-series/ ""StepFun, founded in April 2023 with the mission to “Scale-up possibilities for everyone,” unites top talent in artificial intelligence from both domestic and international backgrounds, and is dedicated to advancing toward AGI. The company has already launched the Step series of foundation models, which includes Step-2, a cutting-edge trillion-parameter Mixture of Experts (MoE) language model; Step-1.5V, a powerful multimodal large model; and Step-1V, an innovative image generation model, among others.""",392,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ H2O-Danube3-4B,H2O.ai,https://h2o.ai/platform/danube/personal-gpt/,4,███,Dense,"6,000","1,500:1",███,0.5,55.18,,,,"synthetic, web-scale",Jul/2024,🟢,A,https://arxiv.org/abs/2407.09276,,"Runs natively and fully offline on mobile phone. ""H2O-Danube3 is a family of decoder only LLM models that use the general Llama model architecture adopting core principles from Llama 2 and Mistral with custom parameters determining the shape of each layer and total parameter count. We use the Mistral tokenizer..."" MMLU for chat=54.74, base=55.18 via https://huggingface.co/h2oai/h2o-danube3-4b-base",391,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Causal Axioms,Microsoft,,0.067,,Dense,1,18:1,███,0.001,,,,,"synthetic, web-scale",Jul/2024,🔴,B,https://arxiv.org/abs/2407.07612v1,,"""the training dataset follows a specific structure, we develop a custom tokenizer. Alphanumeric node names are tokenized at a character level, while special terms such as ‘causes’, ‘Does’, ‘cause’, ‘Yes’, and ‘No’ are tokenized at the word level... Our training setup consists of around 175k instances of sequential chains with size of chains ranging from 3 to 6 nodes... All models are trained for 100 epochs. [LifeArchitect.ai estimate is 12 tokens per node x 6 nodes x 175,000 instances x 100 epochs = 1.26B tokens]"" Based on GPT-2 arch.",███,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SenseNova 5.5,SenseTime,https://platform.sensenova.cn/home#/home,600,30,MoE,"10,000",17:1,███,8.2,,,███,,"synthetic, web-scale",Jul/2024,🟢,D,https://www.sensetime.com/en/news-detail/51168278?categoryId=1072,,"""The model training was based on over 10TB tokens [sic, taken as 10T tokens instead of 10TB=2T tokens] of high-quality training data, including a large amount of synthetically-generated reasoning chain data, which help to enhance its reasoning capabilities."" & ""The updates include SenseNova 5o, the first real-time multimodal model in China, which provides a new AI interaction model on par with GPT-4o’s streaming interaction capabilities""",389,███,███,███,███,Proprietary,"128,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Helium 7B,Kyutai,███,7,,Dense,"1,000",143:1,███,0.3,,,,,"synthetic, web-scale",Jul/2024,🟢,C,https://youtu.be/hm2IJSKcYvo,,"""1. The model is fine-tuned on 100K transcripts generated by Helium itself. 2. These transcripts are highly detailed, heavily annotated with emotion and style, and conversational. 3. Text to Speech Engine is further fine-tuned on 20 hours of audio recorded by Alice and licensed.""",388,███,███,███,███,Proprietary,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ InternLM2.5,Shanghai AI Laboratory/SenseTime,https://huggingface.co/internlm/internlm2_5-20b-chat,20,,Dense,"2,600",130:1,███,0.8,73.5,,38.4,,web-scale,Jul/2024,🟢,C,https://github.com/InternLM/InternLM/blob/main/model_cards/internlm2.5_7b.md,,"""The release of InternLM2.5 series contains 7B model size for now and we are going to release the 1.8B and 20B versions soon"" [20B released around 1/Aug/2024]",███,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Tele-FLM-1T,BAAI,https://huggingface.co/CofeAI/Tele-FLM-1T,███,,Dense,"15,700",16:1,███,13.2,,,,,web-scale,Jul/2024,🟢,A,https://arxiv.org/abs/2407.02783,,"Technical arch testing only, ratio is too low for decent performance.",386,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ YuLan-Base-12B,Renmin,https://github-com.translate.goog/RUC-GSAI/YuLan-Chat?_x_tr_sl=zh-CN&_x_tr_tl=en&_x_tr_hl=en&_x_tr_pto=sc,12,,Dense,"1,700",142:1,███,0.5,55.7,,███,,web-scale,Jul/2024,🟢,A,https://arxiv.org/abs/2406.19853,,"""YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks.""",385,███,███,███,███,MIT,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE 4.0 Turbo,Baidu,https://yiyan.baidu.com/,200,,Dense,"20,000",100:1,███,6.7,,,,,web-scale,Jun/2024,🟢,D,https://www.reuters.com/technology/artificial-intelligence/baidu-launches-upgraded-ai-model-says-user-base-hits-300-mln-2024-06-28/,,███,384,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma 2,Google DeepMind,https://huggingface.co/google/gemma-2-27b-it,27,,Dense,"13,000",███,███,2.0,75.2,,,,web-scale,Jun/2024,🟢,A,https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf,,Announce: https://blog.google/technology/developers/google-gemma-2/,383,███,███,███,███,Gemma,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CriticGPT,OpenAI,,3,,Dense,"1,000",334:1,███,0.2,,,,███,special,Jun/2024,🔴,F,https://cdn.openai.com/llm-critics-help-catch-llm-bugs-paper.pdf,,"""LLM Critics Help Catch LLM Bugs"" Announce: https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/",382,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 4M-21,Apple,https://github.com/apple/ml-4m/,3,,Dense,"1,000",334:1,███,0.2,,,,,special,Jun/2024,🟢,C,https://arxiv.org/abs/2406.09406,███,"Vision model based on T5-XXL. Modalities: RGB, Caption, Bounding boxes, Semantic segmentation, Depth, Human poses, Surface normals, CLIP, DINOv2, ImageBind, Metadata, Canny edges, SAM edges, SAM instances, Color palette. Project page: https://4m.epfl.ch/",381,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ESM3,EvolutionaryScale,https://github.com/evolutionaryscale/esm,98,,Dense,771,8:1,███,███,,,,,special,Jun/2024,🟡,A,https://www.evolutionaryscale.ai/blog/esm3-release,,"Biology large language model: ""sequence, structure, and function are all masked and predicted during training, ESM3 can generate in all three modalities."" 1.4B only released.",380,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PanGu 5.0 Super,Huawei,https://www.huaweicloud.com/intl/en-us/product/modelarts.html,1000,50,MoE,███,20:1,███,14.9,,,,,web-scale,Jun/2024,🟡,C,https://www.huaweicentral.com/huawei-cloud-unveils-pangu-large-model-5-0/,,https://x.com/faridofanani96/status/1804079517193113850/photo/1,379,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude 3.5 Sonnet,Anthropic,https://poe.com/Claude-3.5-Sonnet,400,,Dense,"15,000",38:1,███,8.2,88.7,███,67.2,4.8,web-scale,Jun/2024,🟢,D,https://www.anthropic.com/news/claude-3-5-sonnet,SOTA,MMLU=90.4 with prompting. Model card: https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf,378,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-Coder-V2,DeepSeek-AI,https://chat.deepseek.com/coder,236,21,███,"10,200",44:1,███,5.2,79.2,63.63,,,web-scale,Jun/2024,🟢,A,https://github.com/deepseek-ai/DeepSeek-Coder-V2/blob/main/paper.pdf,,DeepSeek-V2 with additional 6 trillion tokens.,377,███,███,███,███,DeepSeek,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DCLM-Baseline 7B 2.6T,International,https://huggingface.co/apple/DCLM-Baseline-7B,7,,Dense,"2,600",372:1,███,0.4,63.7,,,,web-scale,Jun/2024,🟡,A,https://arxiv.org/abs/2406.11794,,███,376,███,███,███,███,███,"2,000",International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-4-340B,NVIDIA,https://build.nvidia.com/nvidia/nemotron-4-340b-instruct,340,,Dense,"9,000",27:1,███,5.8,81.1,,,,web-scale,Jun/2024,🟢,A,https://d1qx31qr3h6wln.cloudfront.net/publications/Nemotron_4_340B_8T.pdf,███,"Open-source equiv of Mar/2023 GPT-4 (1760MoE≈340B, 13T), same param count but 2x the tokens of May/2023 PaLM 2 (340B, 3.6T), competitor to Nov/2023 Grok-1 (314B, 6T). Trained on 6,144 H100s. ~1.3TB for inference. 50+ natural and 40+ coding languages. Trained between December 2023 and May 2024. MMLU 0-shot for instruct=78.7, 5-shot for base=81.1. Permalink for paper: https://research.nvidia.com/publication/2024-06_nemotron-4-340b",375,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Apple On-Device model Jun/2024,Apple,https://github.com/apple/corenet/tree/main/projects/openelm,3.04,,Dense,"1,500",494:1,███,0.2,26.76,,███,,web-scale,Jun/2024,🟢,A,https://arxiv.org/abs/2404.14619,,"https://lifearchitect.ai/apple/ Likely to be the Apple OpenELM model (Apr/2024). ""two of these models — a ~3 billion parameter on-device language model, and a larger server-based language model available with Private Cloud Compute"". https://machinelearning.apple.com/research/introducing-apple-foundation-models The server-based model is possibly Ferret, although it is more properly called a multimodal model (not just language). It could also be Apple GPT based on their Ajax framework: https://archive.md/f3C0r",374,███,███,███,███,Other,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MatMul-Free LM,UCSC,https://github.com/ridgerchu/matmulfreellm,2.7,,Dense,100,38:1,███,0.05,,,,,web-scale,Jun/2024,🟢,A,https://arxiv.org/abs/2406.02528,███,"""we explore alternative methods for mixing tokens without relying on matrix multiplications."" Compared with Transformer++ based on Llama-2, not to be confused with the pre-GPT-3 American Express Transformer++ paper from 2/Mar/2020. Instead, Transformer++ is defined in the Mamba paper: 'Transformer++: A Transformer with an improved architecture, namely rotary positional encodings (Su et al. 2021) and SwiGLU MLP (Shazeer 2020)'",373,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Luna,Galileo,https://www.rungalileo.io/blog/introducing-galileo-luna-a-family-of-evaluation-foundation-models,0.44,,Dense,162,███,███,0.03,,,,,web-scale,Jun/2024,🟢,C,https://arxiv.org/abs/2406.00975,,Based on DeBERTA-large (440M). RoBERTa=162B token dataset.,372,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen2,Alibaba,https://huggingface.co/spaces/Qwen/Qwen2-72B-Instruct,72,,Dense,"7,000",███,███,2.4,84.2,55.6,37.9,,web-scale,Jun/2024,🟢,A,https://arxiv.org/abs/2407.10671,,Instruct MMLU=82. Instruct GPQA=41.9. https://qwenlm.github.io/blog/qwen2/,371,███,███,███,███,Apache 2.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen2-57B-A14B,Alibaba,https://github.com/QwenLM/Qwen2?tab=readme-ov-file ,57,14,MoE,"4,500",79:1,███,███,76.5,43,34.3,,web-scale,Jun/2024,🟢,A,https://arxiv.org/abs/2407.10671,,https://qwenlm.github.io/blog/qwen2/,370,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Skywork MoE 16x13B,Kunlun Tech,https://huggingface.co/Skywork/Skywork-MoE-Base,146,███,MoE,"3,200",22:1,███,2.3,77.4,,,,web-scale,Jun/2024,🟢,C,https://github.com/SkyworkAI/Skywork-MoE/blob/main/skywork-moe-tech-report.pdf,,"CN + EN. ""(MoE) model with 146 billion parameters, 16 experts, and 22 billion activated parameters. This model is initialized from the pre-existing dense checkpoints of our Skywork-13B model."" Dataset: Skywork-MoE upcycled from Skywork-13B (3.2T tokens per Skywork tech report) + continued MoE training; HF PDF inaccessible via fetch but base dominates ~3200B.",369,███,███,███,███,Other,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mamba-2,CMU,https://github.com/state-spaces/mamba,2.7,,Dense,300,112:1,███,0.09,███,,,,web-scale,May/2024,🟢,A,https://arxiv.org/abs/2405.21060,,Analysis: https://tridao.me/blog/2024/mamba2-part1-model/,368,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MAP-Neo,International,https://map-neo.github.io/,7,,Dense,"4,500",643:1,███,0.6,58.14,,,███,web-scale,May/2024,🟢,A,https://arxiv.org/abs/2405.19327,,"""first fully open-sourced bilingual LLM with comparable performance to existing state-of-the-art LLMs... we open-source all details to reproduce our MAP-Neo, where the cleaned pre-training corpus, data cleaning pipeline, checkpoints, and well-optimized training/evaluation framework are provided.""",367,███,███,███,███,███,"8,000",International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ K2,LLM360,https://huggingface.co/LLM360/K2,65,,Dense,"1,400",22:1,███,1.0,64.8,███,,,web-scale,May/2024,🟢,A,https://www.llm360.ai/blog/several-new-releases-to-further-our-mission.html,,"""K2-65B is a fully reproducible LLM outperforming Llama 2 70B using 35% less compute.""",366,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Codestral,Mistral,https://huggingface.co/mistralai/Codestral-22B-v0.1,22,,Dense,"2,000",91:1,███,0.7,,███,,,web-scale,May/2024,🟢,C,https://mistral.ai/news/codestral/,,Fluent in 80+ programming languages,365,███,███,███,███,Non-commercial research,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Aya-23-35B,Cohere,https://huggingface.co/spaces/CohereForAI/aya-23,35,,Dense,"4,800",138:1,███,1.4,,,,,web-scale,May/2024,🟢,C,███,,"""Aya 23 serves 23 languages"" with open weights at 8B and 35B sizes, built on a high-performance pretrained base plus the Aya collection; prioritizes depth over breadth vs. the prior Aya 101 (101 languages); outperforms Gemma, Mistral, and Mixtral on discriminative and generative multilingual tasks.",364,███,███,███,███,███,"8,192",Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yi-XLarge,01-ai,https://platform.01.ai/,2000,100,MoE,"20,000",10:1,███,21.1,85.1,,48.2,,web-scale,May/2024,🟢,D,https://www.aixinzhijie.com/article/6845768,███,"Still training as of May/2024: https://appserversrc.8btc.cn/FnDYlEC4STBhphu6M3NL4CKH43FW dead link, use: https://finance.china.com.cn/roll/20240513/6116857.shtml",363,███,███,███,███,███,"32,768",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yi-Large,01-ai,https://platform.01.ai/,1000,,Dense,"15,000",15:1,███,12.9,83.8,58.1,43.5,,web-scale,May/2024,███,D,https://www.aixinzhijie.com/article/6845768,,API-only frontier model from 01.ai; exact parameter count undisclosed (reported ~1T); built on the Yi transformer architecture trained on English and Chinese data; strong on MMLU and Chatbot Arena. Closed/API-only; see Yi technical report (arXiv:2403.04652) for the open 6B/34B family.,362,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Chameleon,Meta AI,https://ai.meta.com/resources/models-and-libraries/chameleon-downloads/?gk_enable=chameleon_web_flow_is_live,34,,Dense,"9,200",271:1,███,1.9,65.8,,,,███,May/2024,🟢,A,https://arxiv.org/abs/2405.09818,,Multimodal,361,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LearnLM,Google DeepMind,https://learning.google.com/experiments/learn-about/signup,1500,75,MoE,"30,000",20:1,███,22.4,72,███,,,web-scale,May/2024,🟡,D,https://storage.googleapis.com/deepmind-media/LearnLM/LearnLM_paper.pdf,,"Fine-tuned + prompted Gemini (Dec/2023). ""The results of LearnLM-Tutor reproduce the performance of Gemini Pro, for example an MMLU score of 0.72 and MATH score of 0.33.""",360,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sparse Llama 7B,Cerebras,https://huggingface.co/spaces/neuralmagic/llama-2-sparse-transfer-chat-deepsparse,7,7,Hybrid,145,21:1,███,0.1,,,███,,web-scale,May/2024,🟢,A,https://arxiv.org/abs/2405.03594,,"https://www.cerebras.net/blog/introducing-sparse-llama-70-smaller-3x-faster-full-accuracy ""For the 50% sparse model, we utilized 45 billion tokens of pretraining data, while an additional 100 billion tokens were used for the 70% model. This represents approximately 2% to 8% of the original 2 trillion tokens used to train the base Llama-2 model.""",359,███,███,███,███,███,"8,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 1.5 Flash,Google DeepMind,https://aistudio.google.com/app/prompts/new_chat,8,0.4,MoE,"10,000","1,250:1",███,0.9,78.9,59.1,39.5,,web-scale,May/2024,🟢,D,███,,1M context length. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/,358,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4o,OpenAI,https://chatgpt.com/,200,10,MoE,"20,000",100:1,███,6.7,88.7,72.6,53.6,███,web-scale,May/2024,🔴,D,https://openai.com/index/gpt-4o-system-card/,SOTA,"gpt-4o-2024-05-13 no longer easily available, so hidden in the Model Table rankings. Omnimodel. ‘[GPT-4o is] likely an early checkpoint of GPT-5’. https://twitter.com/drjimfan/status/1790089671365767313 ELO: https://twitter.com/LiamFedus/status/1790064963966370209 Demo: https://youtu.be/DQacCB9tDaw",357,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon 2 11B,TII,https://huggingface.co/tiiuae/falcon-11B,11,,Dense,"5,500",500:1,███,0.8,58.37,,,,web-scale,May/2024,🟢,A,https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas,,Announce: https://www.tii.ae/news/falcon-2-uaes-technology-innovation-institute-releases-new-ai-model-series-outperforming-metas,███,███,███,███,███,Other,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fugaku-LLM,Fujitsu,███,13,,Dense,380,30:1,███,0.2,,,,,web-scale,May/2024,🟢,A,https://www.fujitsu.com/global/about/resources/news/press-releases/2024/0510-01.html,,"Japanese. CPU trained: 158,976+ A64FX CPUs (7M+ cores), zero GPUs. https://en.wikipedia.org/wiki/Fugaku_(supercomputer)",355,███,███,███,███,███,"4,000",Japan,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yi 1.5 34B,01-ai,https://huggingface.co/01-ai/Yi-1.5-34B-Chat,34.4,,Dense,"3,600",105:1,███,███,76.8,52.3,,,web-scale,May/2024,🟢,A,https://github.com/01-ai/Yi-1.5,,Uses 600B more training tokens than Yi 1.0 (Nov/2023).,354,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ YOCO,Microsoft,https://github.com/microsoft/unilm/tree/master/YOCO,3,,Dense,"1,600",534:1,███,0.2,,,,,web-scale,May/2024,🟢,A,https://arxiv.org/abs/2405.05254,███,"With Tsingua. You Only Cache Once (YOCO). Long context ""1M context length with near-perfect needle retrieval accuracy""",353,███,███,███,███,███,"1,000,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-V2,DeepSeek-AI,https://chat.deepseek.com/,236,21,MoE,"8,100",35:1,███,4.6,78.5,54.8,,,web-scale,May/2024,🟢,███,https://arxiv.org/abs/2405.04434,,"Huge dataset, 12% Chinese ""Therefore, we acknowledge that DeepSeek-V2 still has a slight gap in basic English capabilities with LLaMA3 70B"".",352,███,███,███,███,███,"131,072",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ChuXin,Independent,https://huggingface.co/chuxin-llm/Chuxin-1.6B-Base,1.6,,Dense,"2,300","1,438:1",███,0.2,41.07,,,,web-scale,May/2024,🟢,███,https://arxiv.org/abs/2405.04828,,"""results on the ”Needle In A Haystack”(NIAH) tests indicate that ChuXin-1M performs well across all context window lengths up to 1M.""",351,███,███,███,███,███,"4,096",,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RWKV-v6 Finch,RWKV,https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2,7.63,,Dense,"2,500",328:1,███,0.5,,,,,web-scale,May/2024,🟢,A,███,,RWKV (pronounced RwaKuv) is an RNN: https://twitter.com/BlinkDL_AI/status/1787834625211158562,350,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ xLSTM,ELLIS,,2.7,,Dense,15,6:1,███,0.02,,,,,web-scale,May/2024,🔴,B,https://arxiv.org/abs/2405.04517,███,"New method LSTM to xLSTM, see also RNNs. Code/weights doesn't seem to be released. https://github.com/AI-Guru/xlstm-resources",349,███,███,███,███,Other,███,Germany,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite Code,IBM,https://github.com/ibm-granite/granite-code-models,34,,Dense,"3,500",103:1,███,1.1,50,,,,code,May/2024,🟢,A,███,,"MMLU=50 for 8B model only. Dataset: publicly available datasets (e.g., GitHub Code Clean, Starcoder data), public code repositories, and issues from GitHub.",348,███,███,███,███,Apache 2.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen-Max,Alibaba,https://chat.lmsys.org/,300,,Dense,"6,000",20:1,███,4.5,,███,,,web-scale,May/2024,🟢,D,https://help.aliyun.com/zh/dashscope/developer-reference/model-introduction,,https://twitter.com/JustinLin610/status/1787584325367529509,347,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Med-Gemini-L 1.0,Google DeepMind,https://twitter.com/alan_karthi/status/1785117450528264216,200,,Dense,"30,000",150:1,███,8.2,,,███,,web-scale,May/2024,🔴,D,https://arxiv.org/abs/2404.18416,,"Med-Gemini-M 1.0 and Med-Gemini-L 1.0 (Pro and Ultra finetunes) ""For language tasks that require less complex reasoning, such as summarizing medical notes and creating referral letters, we introduce Med-Gemini-M 1.0 by fine-tuning the Gemini 1.0 Pro model. For other tasks that require more advanced reasoning, we introduce Med-Gemini-L 1.0 by fine-tuning the Gemini 1.0 Ultra model using a self-training method to enable the models to efficiently use web search.""",346,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TinyStories,Microsoft,https://huggingface.co/roneneldan/TinyStories-33M,0.033,,Dense,50,"1,516:1",███,0.004,,,,,web-scale,Apr/2024,🟢,C,https://arxiv.org/abs/2305.07759,███,Precursor to phi.,345,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Tele-FLM,BAAI,https://huggingface.co/CofeAI/Tele-FLM,52,,Dense,"2,000",39:1,███,1.1,64,,,,web-scale,Apr/2024,🟢,███,https://arxiv.org/abs/2404.16645,,"Also known as FLM-2. ""We will open-source a 1T model checkpoint, namely Tele-FLM-1T, to advance further training and research."" Discussion paper Jul/2024: https://arxiv.org/abs/2407.02783",344,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen-1.5 110B,Alibaba,https://huggingface.co/spaces/Qwen/Qwen1.5-110B-Chat-demo,111,,Dense,"3,000",28:1,███,1.9,80.4,███,35.9,,web-scale,Apr/2024,🟢,C,https://qwenlm.github.io/blog/qwen1.5-110b/,,"Worse performance on GPQA (72B=36.3, 110B=35.9).",343,███,███,███,███,███,"32,768",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Arctic,Snowflake AI Research,https://arctic.streamlit.app/,480,17,Hybrid,"3,500",8:1,███,4.3,67.3,,,███,web-scale,Apr/2024,🟢,A,https://www.snowflake.com/blog/arctic-open-efficient-foundation-language-models-snowflake/,,"""Arctic uses a unique Dense-MoE Hybrid transformer architecture. It combines a 10B dense transformer model with a residual 128×3.66B MoE MLP resulting in 480B total and 17B active parameters chosen using a top-2 gating.""",342,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SenseNova 5.0,SenseTime,,600,30,MoE,███,17:1,███,8.2,84.78,,42.93,,web-scale,Apr/2024,🟢,B,https://news.futunn.com/en/post/41290101/a-large-shangtang-multi-modal-model-with-600-billion-parameters,,GPT-4 scale; low media coverage; no demo in Western world. https://www.techinasia.com/sensetime-pauses-trading-stock-rises-30-model-launch,341,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OpenELM,Apple,https://huggingface.co/apple/OpenELM-3B-Instruct,3.04,,Dense,███,494:1,███,0.2,26.76,,,,web-scale,Apr/2024,🟢,A,https://arxiv.org/abs/2404.14619,,"On-device model (laptop, phone). Open-source Efficient Language Models (OpenELM). https://venturebeat.com/ai/apple-releases-openelm-small-open-source-ai-models-designed-to-run-on-device/",340,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ phi-3-medium,Microsoft,███,14,,Dense,"4,800",343:1,███,0.9,78.2,55.7,,,"synthetic, web-scale",Apr/2024,🟢,A,https://arxiv.org/abs/2404.14219,,"Preview only, benchmarks being investigated as of May/2024.",339,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ phi-3-mini,Microsoft,https://huggingface.co/microsoft/Phi-3-mini-128k-instruct,3.8,,Dense,"3,300",869:1,███,0.4,68.8,45.7,,███,"synthetic, web-scale",Apr/2024,🟢,A,https://arxiv.org/abs/2404.14219,,"""phi3-mini can be quantized to 4-bits so that it only occupies ≈ 1.8GB of memory. We tested the quantized model by deploying phi-3-mini on iPhone 14 with A16 Bionic chip running natively on-device and fully offline achieving more than 12 tokens per second.""",338,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 3 70B,Meta AI,https://meta.ai/,███,,Dense,"15,000",215:1,███,3.4,82,52.8,,,web-scale,Apr/2024,🟢,A,https://ai.meta.com/blog/meta-llama-3/,SOTA,Instruct MMLU-Pro=56.2,337,███,███,███,███,Llama 3,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Zamba 7B,Zyphra,https://huggingface.co/Zyphra/Zamba-7B-v1,7,,Dense,"1,050",150:1,███,0.3,57.72,,,,web-scale,Apr/2024,🟢,███,https://arxiv.org/html/2405.16712v1,,Mamba1,336,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ HLAT,Amazon,,7,,Dense,"1,800",258:1,███,0.4,41.318,,,,web-scale,Apr/2024,🔴,B,███,,HLAT=High-quality LLM pre-trained on AWS Trainium. Same arch as Llama 7B. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium accelerators. Read more about Trainium: https://www.aboutamazon.com/news/aws/what-you-need-to-know-about-the-aws-ai-chips-powering-amazons-partnership-with-anthropic,335,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Idefics2,Hugging Face,https://huggingface.co/HuggingFaceM4/idefics2-8b,8.4,,Dense,███,953:1,███,0.9,,,,,web-scale,Apr/2024,🟢,C,https://huggingface.co/blog/idefics2,,"Clone of Flamingo now using Mistral 7B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) Dataset: Mistral-7B-v0.1 base (~8T tokens) + SigLIP-SO400M-384 vision (0.9B) + multimodal pretrain (OBELICS, LAION-COCO, PDFA, WebSight) + Cauldron fine-tune; total dominated by Mistral base ~8000B.",334,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Reka Core,Reka AI,https://poe.com/RekaCore,300,,Dense,"10,000",34:1,███,5.8,83.2,,38.2,,web-scale,Apr/2024,███,D,https://publications.reka.ai/reka-core-tech-report.pdf,,https://www.reka.ai/news/reka-core-our-frontier-class-multimodal-language-model,333,███,███,███,███,███,"131,072",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WizardLM-2-8x22B,Microsoft,███,141,39,MoE,"2,000",15:1,███,1.8,,,,,web-scale,Apr/2024,🟢,C,https://wizardlm.github.io/WizardLM2/,,Base model = mistral-8x22b.,332,███,███,███,███,███,"65,536",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pile-T5,EleutherAI,https://huggingface.co/EleutherAI/pile-t5-xxl,███,,Dense,"2,000",182:1,███,0.5,53.84,,,,web-scale,Apr/2024,🟢,A,https://blog.eleuther.ai/pile-t5/,,"T5 retrained from scratch on The Pile (800GB) for 2T tokens using the LLaMA tokenizer and umT5 transformer; Pile-T5-XXL achieves 90.08 on SuperGLUE vs. 82.43 for T5-v1.1-XXL; released with 154 intermediate checkpoints every 10K steps. Base, Large, XL, and XXL sizes available.",331,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Zephyr 141B-A35B,Hugging Face,https://huggingface.co/HuggingFaceH4/zephyr-orpo-141b-A35b-v0.1,141,35,MoE,"2,000",15:1,███,███,,,,,web-scale,Apr/2024,🟢,C,https://arxiv.org/abs/2403.07691,,mixtral-8x22b finetune using Odds Ratio Preference Optimization (ORPO).,330,███,███,███,███,███,"65,536",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Rerank 3,Cohere,https://docs.cohere.com/reference/rerank-1,104,,Dense,"4,000",39:1,███,2.1,,,,,web-scale,Apr/2024,🟢,C,https://txt.cohere.com/rerank-3/,,"RAG + semantic search, possibly backed by Command-R+.",███,███,███,███,███,Proprietary,███,Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ gpt-4-turbo-2024-04-09,OpenAI,https://chat.openai.com/,70,3.5,MoE,"13,000",186:1,███,3.2,86.5,63.7,49.1,███,web-scale,Apr/2024,🟢,D,https://cdn.openai.com/papers/gpt-4.pdf,SOTA,"This is such a significantly better model that I've added it here. This GPQA=46.5%, old GPT-4 GPQA=36%. https://twitter.com/EpochAIResearch/status/1778463039932584205 MMLU scores are unclear, but may have improved by 1%: https://twitter.com/OpenAI/status/1778602770784002136. Final benchmarks are here: https://archive.md/6Cc0Z",328,███,███,███,███,███,"128,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MiniCPM-2.4B,Tsinghua,███,2.4,,Dense,"1,100",459:1,███,0.2,,,,,web-scale,Apr/2024,🟢,A,https://arxiv.org/abs/2404.06395,,MoE option=https://huggingface.co/openbmb/MiniCPM-MoE-8x2B,327,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ferret-UI,Apple,https://github.com/apple/ml-ferret,13,,Dense,███,154:1,███,0.5,,,,,"web-scale, special",Apr/2024,🟢,A,https://arxiv.org/abs/2404.05719,,"Vicuna base, multimodal. Extension of Ferret from Oct/2023.",326,███,███,███,███,Other,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mixtral-8x22b,Mistral,███,141,39,MoE,"2,000",15:1,███,1.8,77.75,,,,web-scale,Apr/2024,🟢,C,https://mistral.ai/news/mixtral-8x22b/,,"MoE=22Bx8, seq=65536.",325,███,███,███,███,Apache 2.0,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sailor,SAIL,https://huggingface.co/sail,███,,Dense,200,29:1,███,0.1,,,,,web-scale,Apr/2024,🟢,A,https://arxiv.org/abs/2404.03608v1,,"SEA languages. Based on Qwen-1.5. https://github.com/sail-sg/sailor-llm ""Generally Sailor models consume around 200B tokens, completing a full pass through the SailCraft corpus once. However, the Sailor-0.5B model undergoes training with 400B tokens, equivalent to 2 epochs.""",324,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ JetMoE-8B,MIT,https://www.lepton.ai/playground/chat?model=jetmoe-8b-chat,8,0.4,MoE,"1,250",157:1,███,███,49.2,,,,web-scale,Apr/2024,🟢,A,https://huggingface.co/jetmoe/jetmoe-8b,,"8B-total / 2.2B-active MoE with Mixture-of-Attention-heads (MoA) and Mixture-of-MLP-Experts per block; 8 experts, top-2 routing; trained on 1.25T tokens (RefinedWeb, Pile, GitHub) for under $0.1M on 96 H100s; MMLU 49.2, GSM8K 70.2, MT-Bench 6.681 -- outperforming LLaMA-2-7B and -13B-chat.",323,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Eurus,Tsinghua,https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5,70,,Dense,"2,000",29:1,███,1.2,,,,███,web-scale,Apr/2024,🟢,A,https://huggingface.co/collections/openbmb/eurus-660bc40bec5376b3adc9d1c5,,Fine-tune of Mistral-7B and CodeLlama-70B.,322,███,███,███,███,███,"4,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Command-R+,Cohere,https://huggingface.co/spaces/CohereForAI/c4ai-command-r-plus,104,,Dense,"4,000",39:1,███,2.1,75.7,,,,web-scale,Apr/2024,🟢,C,https://huggingface.co/CohereForAI/c4ai-command-r-plus,,purpose-built to excel at real-world enterprise use cases. Announce with no arch details: https://txt.cohere.com/command-r-plus-microsoft-azure/,███,███,███,███,███,CC-BY-NC 4.0,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Viking,Silo AI,,33,,Dense,"2,000",61:1,███,0.9,,,,,web-scale,Apr/2024,🟢,B,https://www.silo.ai/blog/viking-7b-13b-33b-sailing-the-nordic-seas-of-multilinguality,,"Viking uses an architecture similar to Llama 2, with flash attention, rotary embeddings, grouped query attention and supports a 4k sequence length'",███,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMo-Bitnet-1B,Nous Research,https://huggingface.co/NousResearch/OLMo-Bitnet-1B,1,,Dense,60,60:1,███,0.03,,,,,web-scale,Apr/2024,🟢,A,███,,1.58-bit quantized (ternary weights) means we can run a 70B model in ~14GB VRAM. See also BitNet b1.58,319,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Aurora-M,International,https://huggingface.co/collections/aurora-m/aurora-m-models-65fdfdff62471e09812f5407,15.5,,Dense,███,132:1,███,0.6,,,,,web-scale,Mar/2024,🟢,A,https://arxiv.org/abs/2404.00399,,"15B continual pretraining of StarCoderPlus on 435B additional tokens (total >2T) in English, Finnish, Hindi, Japanese, Vietnamese, and code; first open-source multilingual model fine-tuned on human-reviewed safety instructions aligned with the Biden-Harris AI Executive Order; demonstrates robustness against catastrophic forgetting.",318,███,███,███,███,OpenRAIL-M,███,International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ReALM-3B,Apple,,3,███,Dense,134,45:1,███,0.07,,,,,web-scale,Mar/2024,🔴,D,https://arxiv.org/abs/2403.20329,,FLAN-T5 (Oct/2022) finetune.,317,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen1.5-MoE-A2.7B,Alibaba,https://qwenlm.github.io/blog/qwen-moe/,14.3,2.7,MoE,"1,500",105:1,███,0.5,62.5,,,,web-scale,Mar/2024,🟢,C,███,,"MoE. ""Of particular significance is the fact that, through upcycling, the necessity for training an equivalent volume of tokens as in the original model has been eliminated."" I assumed half of the original 3T tokens",316,███,███,███,███,Qwen,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok-1.5,xAI,https://grok.x.ai/,180,9,MoE,"6,000",34:1,███,3.5,81.3,,,,web-scale,Mar/2024,🟢,D,https://x.ai/blog/grok-1.5,,Context=128k.,███,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jamba 1,AI21,https://huggingface.co/ai21labs/Jamba-v0.1,52,12,MoE,"1,200",24:1,███,0.8,67.4,,,,web-scale,Mar/2024,███,C,https://arxiv.org/abs/2403.19887,,"MoE. Open weights, licensed under Apache 2.0. Announce: https://arxiv.org/abs/2403.19887",314,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DBRX,MosaicML,https://huggingface.co/spaces/databricks/dbrx-instruct,132,36,MoE,"12,000",91:1,███,4.2,73.7,███,,,web-scale,Mar/2024,🟢,A,https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm,,"MoE. Trained for $10M on 3,072 NVIDIA H100s connected by 3.2Tbps Infiniband.",313,███,███,███,███,███,"32,768",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Stable Code Instruct 3B,Stability AI,https://huggingface.co/stabilityai/stable-code-instruct-3b,2.7,,Dense,560,208:1,███,0.1,,,███,,"code, The Stack",Mar/2024,🟢,A,https://stability.ai/news/introducing-stable-code-instruct-3b,,"Context window=16,384. Trained on The Stack dataset.",312,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ EvoLLM-JP,Sakana AI,https://huggingface.co/SakanaAI/EvoLLM-JP-v1-10B,10,,Dense,800,80:1,███,0.3,,,███,,web-scale,Mar/2024,🟢,C,https://arxiv.org/abs/2403.13187,,"Japanese. Model merge 'our EvoLLM-JP-A is a merge of shisa-gamma-7b-v1, Arithmo2-Mistral-7B, and Abel7B-002' https://sakana.ai/evolutionary-model-merge/",311,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RakutenAI-7B,Rakuten Group,███,7,,Dense,"3,000",429:1,███,0.5,61.31,,,,web-scale,Mar/2024,🟢,C,https://arxiv.org/abs/2403.15484,,Japanese. Mistral 7B derivative.,310,███,███,███,███,Apache 2.0,███,Japan,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Parakeet,Independent,https://colab.research.google.com/drive/1gI8CM9Bz9ov0-E6aL2jF808rE56UtZyF?usp=sharing,0.378,,Dense,3,8:1,███,0.004,,,,,web-scale,Mar/2024,🟢,C,███,,Tiny model (378M) for testing,309,███,███,███,███,███,"4,000",,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RWKV-v5 EagleX,RWKV,https://huggingface.co/recursal/EagleX_1-7T,7.52,,Dense,███,227:1,███,0.4,40.14,,,,web-scale,Mar/2024,🟢,A,https://substack.recursal.ai/p/eaglex-17t-soaring-past-llama-7b,,RWKV (pronounced RwaKuv) is an RNN: Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost),308,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MM1,Apple,,30,,Dense,███,67:1,███,0.8,,,,,special,Mar/2024,🔴,B,https://arxiv.org/abs/2403.09611,,"VLM, outperforms Flamingo 80B (Apr/2022) across benchmarks. 2T text tokens + ~10B+ other text (estimate). Unreleased.",307,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RFM-1,Covariant,https://vimeo.com/921866765,8,,Dense,160,20:1,███,0.1,,,,,web-scale,Mar/2024,███,C,https://covariant.ai/insights/introducing-rfm-1-giving-robots-human-like-reasoning-capabilities/,,"Commercial, multimodal for robotics",306,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Command-R,Cohere,Cohere,35,,Dense,700,20:1,███,0.5,,███,,,web-scale,Mar/2024,🟢,C,https://txt.cohere.com/command-r/,,RAG and tool use,305,███,███,███,███,███,"131,072",Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-VL,DeepSeek-AI,https://github.com/deepseek-ai/DeepSeek-VL?tab=readme-ov-file,7,,Dense,"2,000",286:1,███,0.4,,,,,███,Mar/2024,🟢,A,https://arxiv.org/abs/2403.05525,,"Vision, based on DeepSeek-LLM-7B",304,███,███,███,███,DeepSeek,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AnyGPT,Fudan University,https://junzhan2000.github.io/AnyGPT.github.io/,7,,Dense,"2,000",286:1,███,███,,,,,web-scale,Mar/2024,🟢,A,https://arxiv.org/abs/2402.12226,,Llama 2 7B backbone with new matrices ('reshaping the embedding matrix and prediction layer'),303,███,███,███,███,Llama 2,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Stable Beluga 2.5,Stability AI,,70,,Dense,"2,000",29:1,███,1.2,,,,,web-scale,Mar/2024,🟢,B,https://stability.ai/news/putting-the-ai-supercomputer-to-work,,███,302,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Inflection-2.5,Inflection AI,https://inflection.ai/inflection-2,1200,,Dense,"20,000",17:1,███,16.3,85.5,,38.4,,███,Mar/2024,🟢,D,https://inflection.ai/inflection-2-5,,"Inflection AI's flagship model powering the Pi assistant; achieves >94% of GPT-4's average performance using only 40% of the compute for training; strong STEM, math, and coding gains; integrates real-time web search. Parameter count not publicly disclosed; closed/API-only via Pi.",301,███,███,███,███,Proprietary,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Apollo,SRIBD/CUHK,https://apollo.llmzoo.com/,7,,███,"2,500",358:1,███,0.4,,,,,web-scale,Mar/2024,🟢,A,https://arxiv.org/abs/2403.03640,,Qwen 1.8B as base. Medical focus.,300,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude 3 Opus,Anthropic,https://claude.ai/,2500,125,MoE,"40,000",███,███,33.3,86.8,68.5,59.5,,web-scale,Mar/2024,🟢,D,https://www.anthropic.com/claude-3-model-card,SOTA,"Original MMLU=86.8 (GPT-4=86.4). MMLU=88.2 with CoT prompting. Original GPQA=50.4. 200k context, 1M for researchers.",299,███,███,███,███,███,"200,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-4 15B,NVIDIA,,15,,Dense,"8,000",534:1,███,1.2,64.2,,,,web-scale,Feb/2024,███,B,https://arxiv.org/abs/2402.16819,,"""NVIDIA's 15B dense transformer trained on 8T multilingual text tokens; outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves best multilingual performance at its scale, surpassing models more than four times larger on multilingual tasks.""",298,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TowerLLM,Unbabel,https://unbabel.com/meet-towerllm/,7,,Dense,███,146:1,███,0.3,,,,,web-scale,Feb/2024,🟢,A,https://arxiv.org/abs/2402.17733,,"Commercial product, Llama-2 as base.",297,███,███,███,███,███,"8,000",Portugal,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hawk,Google DeepMind,,7,,Dense,300,43:1,███,0.2,35,,,,web-scale,Feb/2024,🟢,███,https://arxiv.org/abs/2402.19427,,MMLU=35. RNN.,296,███,███,███,███,Proprietary,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Griffin,Google DeepMind,,14,,Dense,300,22:1,███,0.2,49.5,,,███,web-scale,Feb/2024,🟢,B,https://arxiv.org/abs/2402.19427,,MMLU=49.5. RNN.,295,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BitNet b1.58,Microsoft,https://huggingface.co/1bitLLM/bitnet_b1_58-xl,70,,Dense,"2,000",29:1,███,1.2,,,,,web-scale,Feb/2024,🟢,A,https://arxiv.org/abs/2402.17764,███,"""Every single parameter of the LLM is ternary {-1, 0, 1},"" matching full-precision FP16/BF16 Transformer performance at equivalent model size and token count while delivering superior latency, memory, throughput, and energy; defines a new scaling law for 1-bit LLMs and opens the door to dedicated 1-bit hardware accelerators.",294,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Samba-1,SambaNova,https://trysambanova.ai/,1400,1400,CoE,"20,000",15:1,███,███,,,,,web-scale,Feb/2024,🟡,C,https://sambanova.ai/press/secure-one-trillion-parameter-generative-ai-model-for-the-enterprise,,CoE: Collection of experts: Llama2 7B / 13B / 70B Mistral 7B DeepSeek Coder 1.3B / 6.7B / 33B Falcon 40B DePlot CLIP Llava,293,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Aya-101,Cohere,https://huggingface.co/CohereForAI/aya-101,13,███,Dense,"1,000",77:1,███,0.4,,,,,web-scale,Feb/2024,🟢,A,https://cohere.com/research/aya/aya-model-paper.pdf,,mT5 base.,292,███,███,███,███,███,"4,096",Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cosmo-1B,Hugging Face,https://huggingface.co/HuggingFaceTB/cosmo-1b,1.8,,Dense,180,100:1,███,0.06,,,,,synthetic,Feb/2024,███,A,https://huggingface.co/blog/cosmopedia,,Synthetic data (25B tokens of synthetic data for 6 epochs + code). MMLU=32.4,291,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Poro,Silo AI,https://huggingface.co/LumiOpen/Poro-34B,34.2,,Dense,"1,000",30:1,███,0.6,,,,███,web-scale,Feb/2024,🟢,A,https://arxiv.org/abs/2404.01856v1,,"Uses a BLOOM architecture with ALiBi embeddings to allow for context window extrapolation. While model architecture for the initial model has been kept simple, future models under progress will support additional capabilities, such as flash attention, rotary embeddings and grouped query attention.'",290,███,███,███,███,Apache 2.0,███,Finland,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ StarCoder 2,ServiceNow,,15,,Dense,███,287:1,███,0.8,,,,,"code, The Stack",Feb/2024,🟢,B,https://arxiv.org/abs/2402.19173,,"The Stack v2=900B tokens, 5 epochs to 4.3T tokens",289,███,███,███,███,OpenRAIL-M,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 530B,ByteDance,,530,,Dense,300,███,███,1.3,,,,,web-scale,Feb/2024,🔴,D,https://arxiv.org/abs/2402.15627,,"Trained using 12,288 A100 GPUs, replicating MT-NLG size",288,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 175B,ByteDance,,175,,Dense,300,2:1,███,0.8,,,,,web-scale,Feb/2024,🔴,███,https://arxiv.org/abs/2402.15627,,"Trained using 12,288 A100 GPUs, replicating GPT-3 size",287,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Small,Mistral,https://chat.mistral.ai/chat,7,,Dense,"3,000",429:1,███,0.5,72.2,,,,███,Feb/2024,🟢,D,https://mistral.ai/news/mistral-large/,,Optimised for latency and cost.,286,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral Large,Mistral,https://poe.com/Mistral-Large,300,,Dense,"8,000",27:1,███,5.2,81.2,,,,web-scale,Feb/2024,🟢,D,https://mistral.ai/news/mistral-large/,SOTA,███,285,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hanooman,Reliance,,40,,Dense,800,20:1,███,0.6,,,,,███,Feb/2024,🟢,D,https://www.hanooman.ai/,,"11 Indian languages like Hindi, Tamil, and Marathi Dataset: Estimate: Reliance Hanooman 40B, 11 Indian languages, no public training disclosure; assume undertrained ~20x = 800B (Indic-LLM era norm 2024).",284,███,███,███,███,███,"4,000",India,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ask,Apple,,███,,Dense,400,20:1,███,0.3,,,,,web-scale,Feb/2024,🔴,F,https://www.macrumors.com/2024/02/22/applecare-advisors-testing-new-ask-tool/,,"Internal employee model only Dataset: Estimate: Apple internal employee tool (AppleCare), no public disclosure; Chinchilla-optimal 20x for 20B = 400B.",283,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Reka Edge,Reka AI,https://chat.reka.ai/,7,,Dense,"4,500",643:1,███,0.6,63.1,,,,web-scale,Feb/2024,🟢,███,https://publications.reka.ai/reka-core-tech-report.pdf,,"Reka's 7B multimodal model (part of the Edge/Flash/Core family) handling text, images, video, and audio inputs; designed for on-device/edge deployment with low latency; competitive with larger models on multimodal understanding benchmarks. Closed weights; available via Reka API.",282,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Reka Flash,Reka AI,███,21,,Dense,"5,000",239:1,███,1.1,73.5,,34,,web-scale,Feb/2024,🟢,A,https://publications.reka.ai/reka-core-tech-report.pdf,,My testing shows very poor performance equiv with tiny model,281,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemma,Google DeepMind,https://labs.pplx.ai/,7,,Dense,"6,000",858:1,███,0.7,64.3,33.7,███,,web-scale,Feb/2024,🟢,C,https://storage.googleapis.com/deepmind-media/gemma/gemma-report.pdf,,"MMLU=64.3 (Llama 2 70B=68.9, ChatGPT 20B=70). Text only. Probably dense. Largest trained dataset (6T) besides frontier models.",280,███,███,███,███,Gemma,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini 1.5 Pro,Google DeepMind,https://aistudio.google.com/app/prompts/new_chat,200,10,MoE,"30,000",150:1,███,8.2,85.9,69,███,,web-scale,Feb/2024,🟢,D,https://goo.gle/GeminiV1-5,SOTA,Sparse MoE. Context window=1M and 10M for research. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/,279,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen-1.5 72B,Alibaba,https://huggingface.co/spaces/Qwen/Qwen1.5-72B-Chat,72,,Dense,"3,000",42:1,███,1.5,77.5,52.6,36.3,███,web-scale,Feb/2024,🟢,A,https://qwenlm.github.io/blog/qwen1.5/,,"Qwen 1.5 series largest dense model at 72B; 32,768-token context across all sizes; MMLU 77.5, GSM8K 79.5, HumanEval 41.5; chat variant scores 8.61 on MT-Bench and 27.18% win rate on AlpacaEval 2.0, surpassing Claude-2.1 and GPT-3.5-Turbo. Aligned with DPO and PPO; compatible with transformers 4.37+.",278,███,███,███,███,Qwen,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MobileLLM,Meta AI,https://huggingface.co/collections/facebook/mobilellm-6722be18cb86c20ebe113e95,1,,Dense,"1,000","1,000:1",███,0.1,,,███,,web-scale,Feb/2024,🟢,A,https://arxiv.org/abs/2402.14905,,Optimizing Sub-billion Parameter Language Models for On-Device Use Cases,277,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GOODY-2,BRAIN,https://www.goody2.ai/chat,70,,Dense,"2,000",29:1,███,1.2,,,,,web-scale,Feb/2024,🟢,D,https://www.goody2.ai/goody2-modelcard.pdf,███,"Satire (and hilarious). Probably Llama 2 with aggressive prompt. Wired interview: https://archive.md/toxHq Dataset: Satire model per modelcard (""Probably Llama 2 with aggressive prompt""); Llama-2 base ~2000B tokens, no actual additional training. Params: Likely Llama-2 (most likely 70B variant).",276,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Natural-SQL-7B,ChatDB,,7,,Dense,"2,000",286:1,███,0.4,███,,,,code,Feb/2024,🟢,B,https://huggingface.co/chatdb/natural-sql-7b,,Based on DeepSeek-Coder 6.7B.,275,███,███,███,███,Other,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sea-Lion,AI Singapore,https://aisingapore.org/aiproducts/sea-lion/,7.5,,Dense,980,131:1,███,0.3,,,,,███,Feb/2024,🟢,A,https://huggingface.co/aisingapore/sealion7b,,"MPT base. MMLU=26.87. Southeast Asian languages like Thai, Vietnamese and Bahasa Indonesia. https://www.computerweekly.com/feature/Sea-Lion-explained-Southeast-Asias-first-large-language-model",274,███,███,███,███,MIT,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TimesFM,Google,https://huggingface.co/collections/google/timesfm-release-66e4be5fdb56e960c1e482a6,0.2,,Dense,100,500:1,███,0.01,,,,███,special,Feb/2024,🟢,A,https://blog.research.google/2024/02/a-decoder-only-foundation-model-for.html,,Time-series forecasting only. 'a large pretraining corpus of 100B real world time-points' may be more than 100B tokens.,273,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OLMo,Allen AI,https://huggingface.co/allenai/OLMo-7B,7,,Dense,"2,500",358:1,███,0.4,,,,,web-scale,Feb/2024,🟢,A,https://allenai.org/olmo/olmo-paper.pdf,,███,272,███,███,███,███,Apache 2.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Audio Flamingo,NVIDIA,https://huggingface.co/spaces/nvidia/audio-flamingo-demo,1,,Dense,███,20:1,███,0.01,,,,,special,Feb/2024,🟡,C,https://arxiv.org/abs/2402.01831,,Project page: https://audioflamingo.github.io/,271,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FLOR-6.3B,Cerebras,https://huggingface.co/projecte-aina/FLOR-6.3B,6.3,,███,481,77:1,███,0.2,,,,,web-scale,Jan/2024,🟢,A,https://www.cerebras.net/press-release/cerebras-systems-and-barcelona-supercomputing-center-train-industry-leading-multilingual-spanish-catalan-english-llm,,"Spanish, Catalan. Bloom-7.1B (341B tok) + continued pre-training on 140B tok. Trained on Cerebras hardware.",270,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Weaver,AIWaves.cn,https://www.wawawriter.com/,34,,Dense,"2,018",60:1,███,0.9,███,,,,books,Jan/2024,🟢,C,https://arxiv.org/abs/2401.17268,,Llama? 'All Weaver models are initialized from powerful open-source LLMs.' English waitlist: https://www.wawawriter.com/en/,269,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ miqu 70b,Mistral,https://huggingface.co/miqudev/miqu-1-70b,70,,Dense,"3,000",43:1,███,1.5,,,,,web-scale,Jan/2024,🟢,C,███,,"Leaked, proper version soon: https://venturebeat.com/ai/mistral-ceo-confirms-leak-of-new-open-source-ai-model-nearing-gpt-4-performance/",268,███,███,███,███,Other,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ iFlytekSpark-13B,iFlyTek,https://gitee.com/iflytekopensource/iFlytekSpark-13B,13,,Dense,"3,000",231:1,███,0.7,63.02,,,,web-scale,Jan/2024,🟢,A,https://www.ithome.com/0/748/030.htm,███,"pre-trained on a massive high-quality data set with a total of more than 3 trillion tokens, and then fine-tuned on fine-tuned diversified alignment data.'",267,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Xinghuo 3.5 (Spark),iFlyTek,,200,,Dense,"4,000",20:1,███,3.0,,███,,,web-scale,Jan/2024,🟢,F,https://www.laitimes.com/en/article/6f50u_6vhbm.html,,GPT-4 competitor. https://www.shine.cn/biz/tech/2401304331/,266,███,███,███,███,███,"8,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MGIE,Apple,https://github.com/tsujuifu/pytorch_mgie,7,███,Dense,"2,000",286:1,███,0.4,,,,,web-scale,Jan/2024,🟢,A,https://openreview.net/forum?id=S1RKWSyZ2Y,Diffusion,MLLM and diffusion model initialized from LLaVA-7B (Llama 2 + Vicuna) + StableDiffusion-v1.5.,265,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CodeLlama-70B,Meta AI,https://huggingface.co/codellama/CodeLlama-70b-hf,███,,Dense,"2,000",29:1,███,1.2,,,,,web-scale,Jan/2024,🟢,A,https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/,,Paper link is to 34B from Aug/2023. This 70B model finished training Jan/2024.,264,███,███,███,███,Llama 2,"16,384",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RWKV-v5 Eagle 7B,RWKV,https://huggingface.co/spaces/BlinkDL/RWKV-Gradio-2,7.52,,RNN,"1,100",147:1,███,0.3,33.21,,,,web-scale,Jan/2024,🟢,A,https://blog.rwkv.com/p/eagle-7b-soaring-past-transformers,,"RWKV (pronounced RwaKuv) is an RNN: Built on the RWKV-v5 architecture (a linear transformer with 10-100x+ lower inference cost), Trained on 1.1 Trillion Tokens across 100+ languages. Original paper: https://arxiv.org/abs/2305.13048",███,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MaLA-500,LMU,https://huggingface.co/MaLA-LM/mala-500,10,,Dense,"2,000",███,███,0.5,,,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.13303,,Extends Llama 2 7B to 10B using 534 languages.,262,███,███,███,███,Llama 2,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MambaByte,Cornell,https://github.com/kyegomez/MambaByte,0.972,,SSM,38,39:1,███,0.02,███,,,,"books, code",Jan/2024,🔴,A,https://arxiv.org/abs/2401.13660,,"Used bytes instead of tokens. 4 bytes≈1 token, so 150B bytes≈37.5B tokens",261,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek-Coder,DeepSeek-AI,https://coder.deepseek.com/,33,,Dense,"2,000",61:1,███,0.9,,███,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.14196,,surpasses existing closed-source models like Codex and GPT-3.5... permissive license that allows for both research and unrestricted commercial use.',260,███,███,███,███,███,"16,384",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FuseLLM,Tencent,https://github.com/fanqiwan/FuseLLM,7,,Dense,"2,000",286:1,███,0.4,,,,███,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.10491,,"Fusion of Llama-2-7B (2T tok), OpenLLaMA-7B (2T tok), and MPT-7B (1T tok).",259,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fuyu-Heavy,Adept,,120,,Dense,"5,000",42:1,███,2.6,,,███,,web-scale,Jan/2024,🟡,F,https://www.adept.ai/blog/adept-fuyu-heavy,,"Fuyu-Heavy is the world’s third-most-capable multimodal model, behind only GPT4-V and Gemini Ultra, which are 10-20 times bigger.' Token estimate is based on Adept Persimmon-8B using many more tokens.",258,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Orion-14B,OrionStar,https://github.com/OrionStarAI/Orion,14,,Dense,"2,500",███,███,0.6,69.6,,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.12246,,"English, Chinese, Japanese, Korean, and other languages.",257,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ InternLM2,Shanghai AI Laboratory/SenseTime,https://github.com/InternLM/InternLM,20,,Dense,"2,600",130:1,███,0.8,███,,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2403.17297,,"""InternLM2 is pre-trained on diverse data types including text, code, and long-context data,"" with context scaling from 4K to 32K during training and 200K needle-in-a-haystack evaluation; aligned with COOL RLHF (Conditional Online RLHF); evaluated across 6 dimensions and 30 benchmarks. Released in 1.8B, 7B, and 20B sizes.",256,███,███,███,███,███,"200,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-4,Zhipu AI (Tsinghua),https://open.bigmodel.cn/,200,███,Dense,"4,000",20:1,███,3.0,81.5,,,,web-scale,Jan/2024,🟢,D,https://pandaily.com/zhipu-ai-unveils-glm-4-model-with-advanced-performance-paralleling-gpt-4/,,Best Chinese model to date based on analysis. Follows OpenAI roadmap. MMLU=81.5. 'hundreds of billions of parameters' https://www.chatglm.cn/,255,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeekMoE,DeepSeek-AI,https://huggingface.co/deepseek-ai/deepseek-moe-16b-base,16,2.7,MoE,"2,000",125:1,███,0.6,,,,,███,Jan/2024,🟢,A,https://arxiv.org/abs/2401.06066,,"MoE activated parameters is 10-15% of dense, so I need to rethink ALScore for MoE. 'preliminary efforts to scale up DeepSeekMoE to 145B'",254,███,███,███,███,DeepSeek,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeepSeek,DeepSeek-AI,https://chat.deepseek.com/,67,,Dense,"2,000",30:1,███,1.2,71.3,,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.02954,,Chinese/English. Outperforms Llama 2. MMLU=71.3 outperforms GPT-3.5.,███,███,███,███,███,DeepSeek,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LLaMA Pro,Tencent,https://huggingface.co/TencentARC/LLaMA-Pro-8B,8.3,,Dense,"2,080",251:1,███,███,,,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.02415,,We pre-train LLAMA PRO’s expanded blocks on 80B tokens using open-source code and math data for 2830 GPU Hours (16 NVIDIA H800 GPUs for about 7 days).,252,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Palmyra X,Writer,,72,,Dense,"1,200",17:1,███,███,70.2,,,,special,Jan/2024,🟢,D,https://writer.com/blog/palmyra-helm-benchmark/,,"Palmyra X V2, Palmyra X V3, Palmyra X V4. https://venturebeat.com/ai/why-writers-palmyra-llm-is-the-little-ai-model-that-could-for-enterprises/",251,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TinyLlama,SUTD/Independent,https://github.com/jzhang38/TinyLlama,1.1,,Dense,"3,000","2,728:1",███,0.2,,,,,web-scale,Jan/2024,🟢,A,https://arxiv.org/abs/2401.02385,███,"Overtrained' using 2,727 tokens per parameter. Dataset was 1T: 3 epochs to 3T seen. Singapore",250,███,███,███,███,Apache 2.0,███,Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DocLLM,JPMorgan,,7,,Dense,"2,000",286:1,███,0.4,███,,,,web-scale,Jan/2024,🔴,B,https://arxiv.org/abs/2401.00908,,Document spatial layout structure.,249,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MACE-MP-0,Cambridge,https://huggingface.co/mace-foundations/mace-mp-0,0.00469,0.00469,MACE,0,1:1,███,0.000,,,,,materials science,Dec/2023,🟢,███,https://arxiv.org/abs/2401.00096,,"""Uses 4-body equivariant messages; covers 89 elements; supports fine-tuning for ab initio accuracy with minimal data.""",248,███,███,███,███,MIT,"1,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Unified-IO 2,Allen AI,https://unified-io-2.allenai.org/,7,,Dense,"1,000",143:1,███,0.3,,,,,web-scale,Dec/2023,███,A,https://arxiv.org/abs/2312.17172,,"600TB dataset (plus 120+ fine-tuning datasets) includes '1B imagetext pairs, 1T text tokens, 180M video clips, 130M interleaved image & text, 3M 3D assets, and 1M agent trajectories.'",247,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WaveCoder-DS-6.7B,Microsoft,,6.7,,Dense,"2,000",299:1,███,0.4,,,,,web-scale,Dec/2023,🔴,███,https://arxiv.org/abs/2312.14187,,"To obtain WaveCoder models, We choose StarCoder-15B, CodeLLaMa (7B and 13B), DeepseekCoder-6.7B as the base model and fine-tune all the base model for 3 epochs Dataset: DeepseekCoder 6.7B base (~2T tokens per Deepseek paper) + 20k CodeOcean instructions x 3 epochs fine-tune (negligible) = ~2000B.",246,███,███,███,███,MIT,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ YunShan,Huawei,,7,,Dense,"1,748",250:1,███,0.4,,,,,web-scale,Dec/2023,🔴,B,███,,Finance + law fine-tune of PanGu-π,245,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PanGu-Pi,Huawei,,7,,Dense,"1,600",229:1,███,0.4,,,,,web-scale,Dec/2023,███,B,https://arxiv.org/abs/2312.17276,,"Dense, named PanGu-π",244,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ YAYI 2,Wenge,https://huggingface.co/wenge-research/yayi2-30b,30,,Dense,"2,650",89:1,███,0.9,80.5,,,,web-scale,Dec/2023,🟢,A,███,,Dataset=240TB filtered to 10.6TB for 2.65T tokens,243,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Emu2,BAAI,https://baaivision.github.io/emu2/,37,███,Dense,4,1:1,███,0.04,,,,,web-scale,Dec/2023,🟢,A,https://arxiv.org/abs/2312.13286,,"VLM. Gemini clone. Outperforms Flamingo 80B. The Pile for text, but only sampled 3.6B tokens (1.4% of the dataset).",242,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MedLM,Google DeepMind,https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/medlm,340,,Dense,"3,600",11:1,███,3.7,,███,,,web-scale,Dec/2023,🟡,D,https://cloud.google.com/static/vertex-ai/docs/generative-ai/medlm/MedLM-model-card.pdf,,Available to 'white-listed' orgs only. Dataset: Estimate: MedLM is PaLM 2-based medical model (Med-PaLM 2 lineage); PaLM 2 reportedly trained on ~3.6T tokens; medical fine-tune small.,241,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SOLAR-10.7B,Upstage AI,https://huggingface.co/upstage/SOLAR-10.7B-v1.0,███,,Dense,"8,000",748:1,███,1.0,,,,,web-scale,Dec/2023,🟢,C,https://arxiv.org/abs/2312.15166,,"South Korean. Llama-2 arch. SOTA for its size (Dec/2023). Dataset: Mistral 7B init (~8T tokens) + depth up-scaling to 10.7B + continued pretraining (Alpaca-GPT4, OpenOrca, math); total dominated by Mistral base ~8000B.",240,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeciLM-7B,Deci,https://console.deci.ai/infery-llm-demo,7.04,,Dense,200,29:1,███,0.1,,,███,,web-scale,Dec/2023,🟢,C,https://deci.ai/blog/introducing-DeciLM-7b-the-fastest-and-most-accurate-7b-large-language-model-to-date,,4.4x times faster than Mistral. English only.,239,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral-medium,Mistral,https://poe.com/,180,,Dense,"3,500",20:1,███,2.6,███,,,,web-scale,Dec/2023,🟢,D,https://mistral.ai/news/la-plateforme/,,"MMLU=75.3% (GPT-3.5-turbo 20B=70%, Llama 2 70B=68.9%)",238,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mixtral-8x7b-32kseqlen,Mistral,https://www.together.ai/blog/mixtral,46.7,███,MoE,"8,000",172:1,███,2.0,70.6,43.3,,,web-scale,Dec/2023,🟢,C,https://arxiv.org/abs/2401.04088,,"MoE=7Bx8, aka mistral-small. 'Concretely, Mixtral has 45B total parameters but only uses 12B parameters per token. It, therefore, processes input and generates output at the same speed and for the same cost as a 12B model.'",237,███,███,███,███,███,"32,768",France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ StripedHyena 7B,Together,https://api.together.xyz/playground/language/togethercomputer/StripedHyena-Hessian-7B,7.65,,Hybrid,"2,000",262:1,███,0.4,,,,,███,Dec/2023,🟢,C,https://www.together.ai/blog/stripedhyena-7b,,"RedPajama (C4), new arch beyond just Transformers Dataset: Trained from scratch on RedPajama mix + long-context augmentation per Together blog; no token total; estimate ~2T (LLaMA-1/RedPajama era norm for 7B).",236,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NexusRaven-V2 13B,Nexusflow.ai ,https://huggingface.co/spaces/Nexusflow/NexusRaven-V2-Demo,13,,Dense,"2,500",193:1,███,0.6,,,,,code,Dec/2023,🟢,C,https://github.com/nexusflowai/NexusRaven-V2/tree/master,,Based on CodeLlama. 'surpasses GPT-4 by up to 7% in function calling success rates in human-generated use cases involving nested and composite functions.' Dataset: CodeLlama-13B base (LLaMA-2 13B 2000B + 500B CodeLlama code continued = 2500B) + ~5k function-calling fine-tune (negligible); total ~2500B.,███,███,███,███,███,Other,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gemini Ultra 1.0,Google DeepMind,███,1500,,Dense,"30,000",20:1,███,22.4,83.7,,35.7,,web-scale,Dec/2023,🟢,D,https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf,SOTA,"Original MMLU=83.7. MMLU=90.04 with prompting. Chinchilla (20:1), dense, maybe 600B-2000T. Note: Gemini outputs are watermarked. I do not use GDM models. https://lifearchitect.ai/watermarking/",234,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mamba,CMU,https://huggingface.co/havenhq/mamba-chat,2.8,,SSM,300,108:1,███,0.10,26.2,,,,web-scale,Dec/2023,🟢,A,https://arxiv.org/abs/2312.00752,,"The Pile, new arch beyond just Transformers. 2.7B MMLU=26.2. 7B MMLU=33.3.",███,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LVM-3B,Berkeley/JHU,,3,,Dense,420,140:1,███,0.1,███,,,,special,Dec/2023,🔴,B,https://arxiv.org/abs/2312.00785,,Paper is 25MB. First Large Vision Model (LVM); no text. Based on Llama and LAION 5B (1.49B).,232,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SeaLLM-13b,Alibaba,https://github.com/damo-nlp-sg/seallms,13,,Dense,"2,000",154:1,███,███,,,,,web-scale,Dec/2023,🟢,A,https://arxiv.org/abs/2312.00738,,"Llama 2 for Southeast Asian (SEA) languages: Vietnamese 🇻🇳, Indonesian 🇮🇩, Thai 🇹🇭, Malay 🇲🇾, Khmer🇰🇭, Lao🇱🇦, Tagalog🇵🇭 and Burmese🇲🇲",231,███,███,███,███,Other,"8,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ pplx-70b-online,Perplexity,https://labs.perplexity.ai/,70,,Dense,"2,000",29:1,███,1.2,,,███,,web-scale,Nov/2023,🟢,A,https://blog.perplexity.ai/blog/introducing-pplx-online-llms,,Web access. Higher 'freshness' and 'truth' scores.,230,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SeamlessM4T-Large v2,Meta AI,https://seamless.metademolab.com/expressive/,2.3,,Dense,500,218:1,███,0.1,,,,,special,Nov/2023,🟢,C,███,,"Based on NLLB and older models. https://github.com/facebookresearch/seamless_communication Dataset: Paper: SeamlessAlign 114,800 hours + NLLB-200 translation backbone; no precise token total; estimate ~500B tokens-equiv for 2.3B speech-text model.",229,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Q-Transformer,Google DeepMind,https://qtransformer.github.io/,0.06,,Dense,1,9:1,███,0.001,,,,,robotics,Nov/2023,🔴,D,https://qtransformer.github.io/assets/qtransformer.pdf,,███,228,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yuan 2.0,IEIT,https://github.com/IEIT-Yuan/Yuan-2.0/blob/main/README-EN.md,102.6,,Dense,288,3:1,███,0.6,,,,,web-scale,Nov/2023,🟢,███,https://arxiv.org/abs/2311.15786,,"Chinese + EN dataset include The Pile: DM, arxiv, wikipedia, book3, stack exchange, Freelaw and medical",227,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MEDITRON,EPFL,https://huggingface.co/epfl-llm/meditron-70b,70,,Dense,"2,000",29:1,███,1.2,,,,███,web-scale,Nov/2023,🟢,A,https://arxiv.org/abs/2311.16079,,"Llama 2 trained on med data using NVIDIA Megatron-LM. ""outperforms Llama-2-70B, GPT-3.5 (text-davinci-003, 8-shot), and Flan-PaLM on multiple medical reasoning tasks.""",226,███,███,███,███,Llama 2,███,Switzerland,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Transformers-Arithmetic,Microsoft,,0.1,,Dense,0,3:1,███,███,,,,,special,Nov/2023,🔴,B,https://arxiv.org/abs/2311.14737,,Proving maths is not memorized. Uses GPT-2-style model. Sébastien Bubeck,225,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Starling-7B,Berkeley,https://huggingface.co/berkeley-nest/Starling-LM-7B-alpha,7,,Dense,"2,000",286:1,███,0.4,,37.9,,,web-scale,Nov/2023,🟢,███,https://starling.cs.berkeley.edu/,,Llama 2 7B -> OpenChat 7B -> Starling-7B (RLAIF),224,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Inflection-2,Inflection AI,https://inflection.ai/inflection-2,1200,███,Dense,"20,000",17:1,███,16.3,,,,,web-scale,Nov/2023,🟢,D,https://inflection.ai/inflection-2,,"“now the 2nd best LLM in the world”. Finished training 19/Nov/2023, waiting for fine-tuning and release.",223,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude 2.1,Anthropic,https://claude.ai/,130,,Dense,"2,500",20:1,███,1.9,78.5,,███,,web-scale,Nov/2023,🟢,D,https://www.anthropic.com/index/claude-2-1,,"Less hallucinations, 200k context length, tool use",222,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ TÜLU 2,Allen AI,https://huggingface.co/allenai/tulu-2-dpo-70b,70,,Dense,"2,000",29:1,███,1.2,,,,,███,Nov/2023,🟢,A,https://arxiv.org/abs/2311.10702,,Llama 2 finetune with RLHF direct preference optimization (DPO).,221,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-3 22B,NVIDIA,https://huggingface.co/nvidia/nemotron-3-8b-base-4k,22,,Dense,"3,800",173:1,███,1.0,54.4,,,,web-scale,Nov/2023,🟢,A,https://developer.nvidia.com/blog/nvidia-ai-foundation-models-build-custom-enterprise-chatbots-and-co-pilots-with-production-ready-llms/,,███,220,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Nemotron-2 43B,NVIDIA,███,43,,Dense,"3,800",89:1,███,1.3,,,,,web-scale,Nov/2023,🔴,D,https://arxiv.org/abs/2311.09528,,Used to train HelpSteer (16/Nov/2023): https://arxiv.org/abs/2311.09528,219,███,███,███,███,Other,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Orca 2,Microsoft,,13,,Dense,"2,001",154:1,███,0.5,,███,,,web-scale,Nov/2023,🟡,B,https://arxiv.org/abs/2311.11045,,"Llama 2 13B (2T) -> Orca 2 (GPT-4 finetune). Still an imitation model, overhyped: The False Promise of Imitating Proprietary LLMs https://arxiv.org/abs/2305.15717",218,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-2,Microsoft,https://replicate.com/lucataco/phi-2,2.7,,Dense,"1,400",519:1,███,0.2,,███,,,"synthetic, web-scale",Nov/2023,🟢,A,https://huggingface.co/microsoft/phi-2,,https://twitter.com/SebastienBubeck/status/1724854157004190095,217,███,███,███,███,MIT,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Florence-2,Microsoft,https://huggingface.co/microsoft/Florence-2-large,0.771,,Dense,100,130:1,███,0.03,,,,,web-scale,Nov/2023,🟢,C,https://arxiv.org/abs/2311.06242,███,"VLM, Flamingo alt Dataset: Paper: trained until ""3 billion effective training samples"" on FLD-5B (126M images, 5.4B annotations); estimate ~100B tokens equivalent.",216,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mirasol3B,Google DeepMind,,3,,Dense,50,17:1,███,0.04,,,,,web-scale,Nov/2023,🔴,D,███,,Combiner + autoregressive transformer for video/audio/text Dataset: Paper: ~12% of VTP video-text pairs used + AudioSet-2M audio pretraining; no token total; estimate ~50B for 3B model on video tokens.,215,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OtterHD-8B,NTU,https://github.com/Luodian/Otter,8,,Dense,737,93:1,███,0.3,,,,,web-scale,Nov/2023,🟢,A,https://arxiv.org/abs/2311.04219,,Evolution of Persimmon-9.3B and Fuyu 8B,███,███,███,███,███,MIT,███,Singapore,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gauss,Samsung,https://koreajoongangdaily.joins.com/news/2023-11-08/business/tech/Samsung-unveils-generative-AI-model-Gauss/1908889,7,,Dense,"2,000",286:1,███,███,,,,,web-scale,Nov/2023,🟡,D,https://koreajoongangdaily.joins.com/news/2023-11-08/business/tech/Samsung-unveils-generative-AI-model-Gauss/1908889,,"Gauss Language specializing in generating texts, Gauss Code on software and code description and Gauss Image for image creation. Dataset: Estimate: Samsung Gauss Language 7B, no public disclosure; LLaMA-2 era contemporary ~2000B is norm for 7B from scratch.",213,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok-1,xAI,https://grok.x.ai/,314,86,MoE,"6,000",20:1,███,4.6,,,,,███,Nov/2023,🟢,C,https://github.com/xai-org/grok-1,,Context window=8192. UI: https://twitter.com/TobyPhln/status/1721053802235621734,212,███,███,███,███,Apache 2.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Grok-0,xAI,https://grok.x.ai/,33,,Dense,"2,000",61:1,███,0.9,,███,,,web-scale,Nov/2023,🔴,C,https://web.archive.org/web/20231105051542/https://x.ai/,,"Announced Nov/2023, trained Jul/2023",211,███,███,███,███,Proprietary,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yi-34B,01-ai,https://huggingface.co/01-ai/Yi-34B,34.4,,Dense,"3,000",88:1,███,1.1,76.3,43,,,███,Nov/2023,🟢,A,https://github.com/01-ai/Yi,,Controversy about Llama 2 base. https://twitter.com/kaifulee/status/1724673131875377465 MMLU=76.3 (PaLM 2=78.3) Outperforms Llama 2. Chinese and English. https://www.bloomberg.com/news/articles/2023-11-05/kai-fu-lee-s-open-source-01-ai-bests-llama-2-according-to-hugging-face,210,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4 Turbo,OpenAI,███,70,3.5,MoE,"13,000",186:1,███,3.2,86.4,,46.5,,web-scale,Nov/2023,🟢,D,https://cdn.openai.com/papers/gpt-4.pdf,SOTA,https://openai.com/blog/new-models-and-developer-products-announced-at-devday,209,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MatFormer,Google DeepMind,,0.85,0.85,MatFormer,80,95:1,███,0.03,,███,,,,Oct/2023,🟢,B,https://arxiv.org/abs/2310.07707,,"Matryoshka Transformer or MatFormer model architecture. 850M (696M / 620M / 582M). ""850M decoder-only MatFormer language model (MatLM) allows us to extract multiple smaller models spanning from 582M to 850M parameters, each exhibiting better validation loss and one-shot downstream evaluations than independently trained counterparts.""",208,███,███,███,███,███,"8,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Skywork-13B,Kunlun Tech,,13,,Dense,"3,200",247:1,███,0.7,62.7,,,███,web-scale,Oct/2023,🟢,B,https://arxiv.org/abs/2310.19341,,CN + EN.,207,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kimi Chat,Moonshot AI,https://kimi.moonshot.cn/,███,,Dense,"2,000",20:1,███,1.5,,,,,web-scale,Oct/2023,🟢,C,https://www.chinadaily.com.cn/a/202403/22/WS65fce476a31082fc043be1b1.html,,Chinese. Long context. No paper.,206,███,███,███,███,Proprietary,"131,072",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ jina-embeddings-v2,Jina AI,https://huggingface.co/jinaai/jina-embeddings-v2-base-en,0.435,███,Dense,30,69:1,███,0.01,,,,,web-scale,Oct/2023,🟢,C,https://jina.ai/news/jina-ai-launches-worlds-first-open-source-8k-text-embedding-rivaling-openai/,,Alternative to text-embedding-ada-002. Related v1 paper: https://arxiv.org/abs/2307.11224 Dataset: Estimate: 435M encoder embedding model trained on text pairs; no disclosure; encoder models typically ~30-100B tokens (BERT-style).,205,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fuyu,Adept,https://huggingface.co/adept/fuyu-8b,8,,Dense,"1,000",███,███,0.3,,,,,web-scale,Oct/2023,🟢,C,https://www.adept.ai/blog/fuyu-8b,,"VLM. 8B available under open licence, Medium size is closed Dataset: Estimate: Adept Fuyu 8B trained from scratch on text+image, no disclosure; era-norm Chinchilla ~20x for 8B = ~1000B.",204,███,███,███,███,███,"16,384",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE 4.0,Baidu,https://yiyan.baidu.com/,1000,,Dense,"20,000",20:1,███,14.9,,,,,web-scale,Oct/2023,🟢,C,https://reuters.com/technology/chinas-baidu-unveils-latest-version-its-ernie-ai-model-2023-10-17/,SOTA,███,203,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Zephyr,Hugging Face,https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha,7.3,,Dense,800,110:1,███,0.3,,33,,,███,Oct/2023,🟢,C,https://huggingface.co/HuggingFaceH4/zephyr-7b-alpha,,Mistral with 'aligned' data removed from dataset,202,███,███,███,███,MIT,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PaLI-3,Google DeepMind,███,5,,Dense,"1,000",200:1,███,0.2,,,,,special,Oct/2023,🔴,D,https://arxiv.org/abs/2310.09199,,VLM. Next iteration of PaLI via Pathways. https://lifearchitect.ai/pathways/ Dataset: UL2 3B language base (~1T C4 tokens) + SigLIP 2B vision encoder (WebLI image-text contrastive) + multimodal stages; total dominated by UL2 pretrain ~1000B.,201,███,███,███,███,Proprietary,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Retro 48B,NVIDIA,,48,,Dense,███,25:1,███,0.8,,,,,web-scale,Oct/2023,🟢,B,https://arxiv.org/abs/2310.07713,,the largest LLM pretrained with retrieval before instruction tuning.',200,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Ferret,Apple,https://github.com/apple/ml-ferret,13,███,Dense,"2,000",154:1,███,0.5,,,,,"web-scale, special",Oct/2023,🟢,A,https://arxiv.org/abs/2310.07704,,"Vicuna base, multimodal",199,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Lemur,XLANG Lab,https://github.com/OpenLemur/Lemur,70,,Dense,"2,090",███,███,1.3,,,,,web-scale,Oct/2023,🟢,A,https://arxiv.org/abs/2310.06830,,https://arxiv.org/abs/2310.06830,198,███,███,███,███,Llama 2,███,Hong Kong,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AceGPT,KAUST/Shenzhen,https://huggingface.co/FreedomIntelligence/AceGPT-13B,13,,Dense,"2,010",155:1,███,███,,,,,web-scale,Oct/2023,🟢,A,https://github.com/FreedomIntelligence/AceGPT/tree/main,,Arabic. Llama 2 + RLAIF,197,███,███,███,███,Apache 2.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yasa-1,Reka AI,https://reka.ai/announcing-our-multimodal-ai-assistant/,21,,Dense,700,34:1,███,0.4,,,███,,web-scale,Oct/2023,🟡,D,https://reka.ai/product/,,"Multi-modal. No public arch info. Researchers from DeepMind, Google, Baidu and Meta building enterprise models Dataset: Estimate: Reka Yasa-1 multimodal, no public disclosure; era-norm proprietary 2023 chat model ~700B (Chinchilla-style for ~20-30B class). Params: Reka multimodal pre-Core; ~21B per HF/community leaks",196,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RT-X,Google DeepMind,https://robotics-transformer-x.github.io/,55,,███,"1,600",30:1,███,1.0,,,,,robotics,Oct/2023,🟢,C,https://robotics-transformer-x.github.io/paper.pdf,,"Robotics using UL2. 'RT-1 model trained using the robotic data mixture as RT-1-X, and the RT-2 model trained using the robotic data mixture as RT-2-X.' Dataset: Built on RT-2 (PaLI-X 55B base ~1500B tokens) + Open X-Embodiment robotics mixture (~1M episodes from 22 robots, ~100B token-equiv); total ~1600B.",195,███,███,███,███,Proprietary,"1,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MotionLM,Waymo,,0.09,,Dense,1,12:1,███,0.001,,,,,special,Sep/2023,🔴,D,https://arxiv.org/abs/2309.16534,███,"LLM for autonomous vehicle forecasting. https://youtu.be/jrMMNmN21I8?t=1560 Dataset: Tiny 90M autonomous-vehicle motion forecaster, trained on Waymo Open Motion Dataset trajectories; estimate ~1B token-equivalents.",194,███,███,███,███,Proprietary,"1,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GAIA-1,Wayve,https://wayve.ai/thinking/scaling-gaia-1/,9,,Dense,60,7:1,███,0.08,,,,,web-scale,Sep/2023,🔴,C,https://arxiv.org/abs/2309.17080,,"World model, generates video. Uses T5-large 770M for language + all vision parameters Dataset: Paper: 4700 hours driving video at 6.25Hz = ~105M frames x 576 image tokens + text/action tokens ~= 60B tokens; T5-Large 770M frozen for text.",███,███,███,███,███,███,"4,000",UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Qwen,Alibaba,███,72,,Dense,"3,000",42:1,███,1.5,,,,,web-scale,Sep/2023,🟢,A,https://arxiv.org/abs/2309.16609,,Chinese. Full name is 'Tongyi Qianwen' 通义千问. 'Lags behind both GPT-3.5 and GPT-4'. Originally 7B/14B params Apr/2023,192,███,███,███,███,███,"8,192",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 2 Long,Meta AI,,70,,Dense,"2,400",35:1,███,1.4,,,,,web-scale,Sep/2023,🔴,B,https://arxiv.org/abs/2309.16039,,███,191,███,███,███,███,Llama 2,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LeoLM,Hessian AI/LAION,https://huggingface.co/LeoLM/leo-hessianai-13b,13,███,Dense,"2,065",159:1,███,0.5,,,,,web-scale,Sep/2023,🟢,C,https://laion.ai/blog/leo-lm/,,Llama 2 'extended' and pretrained on 2000B Llama 2 tokens + 65B tokens of German,190,███,███,███,███,Llama 2,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Mistral 7B,Mistral,https://huggingface.co/mistralai,7.3,███,Dense,800,110:1,███,0.3,60.1,30.9,,,web-scale,Sep/2023,🟢,C,https://arxiv.org/abs/2310.06825,,"Apache 2.0, Sliding Window Attention (SWA) to handle longer sequences at smaller cost",189,███,███,███,███,Apache 2.0,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kosmos-2.5,Microsoft,,1.3,,Dense,150,116:1,███,0.05,,,,,web-scale,Sep/2023,🔴,D,███,,Dataset: Pix2Struct-Large vision encoder + 24-layer text decoder from scratch on 324.4M pages (layout PDFs + markdown); estimate ~150B tokens.,188,███,███,███,███,MIT,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Baichuan 2,Baichuan,https://github.com/baichuan-inc/Baichuan2/blob/main/README_EN.md,13,,Dense,"2,600",200:1,███,0.6,,,,,web-scale,Sep/2023,🟢,A,https://cdn.baichuan-ai.com/paper/Baichuan2-technical-report.pdf,███,Great paper. Chinese-English bilingual dataset,187,███,███,███,███,Other,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BOLT2.5B,ThirdAI,https://huggingface.co/spaces/thirdai/BOLT2.5B,███,,Dense,40,16:1,███,0.03,,,,,special,Sep/2023,🟢,A,https://medium.com/thirdai-blog/introducing-the-worlds-first-generative-llm-pre-trained-only-on-cpus-meet-thirdai-s-bolt2-5b-10c0600e1af4,,CPU trained,186,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeciLM,Deci,https://huggingface.co/Deci/DeciLM-6b,5.7,,Dense,200,███,███,0.1,,,,,web-scale,Sep/2023,🟢,C,https://deci.ai/blog/decilm-15-times-faster-than-llama2-nas-generated-llm-with-variable-gqa/,,Faster inference (4.8× throughput of Llama 2),185,███,███,███,███,Llama 2,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MoLM,IBM,https://github.com/ibm/moduleformer,8,0.7,MoE,300,38:1,███,0.2,,,,,web-scale,Sep/2023,🟢,A,https://arxiv.org/abs/2306.04640,,ModuleFormer is based on the Sparse Mixture of Experts (MoE).,███,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NExT-GPT,NUS,https://next-gpt.github.io/,7,,Dense,"1,000",143:1,███,0.3,,,,,web-scale,Sep/2023,🟢,A,███,,Multimodal. Vicuna 7B + other modalities,183,███,███,███,███,Other,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-1.5,Microsoft,https://huggingface.co/microsoft/phi-1_5,1.3,,Dense,150,116:1,███,0.05,,███,,,"synthetic, web-scale",Sep/2023,🟢,A,https://arxiv.org/abs/2309.05463,,Textbooks only. 30B-token dataset,182,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ UniLM,Apple,███,0.034,,Dense,1,30:1,███,0.001,,,,,special,Sep/2023,🟢,C,https://github.com/jackcook/predictive-spy,,Apple's Transformer model for iOS 17 + macOS Sonoma. Announce is actually Jun/2023. GPT-2 base? 128 token context window,181,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Persimmon-8B,Adept,https://www.adept.ai/blog/persimmon-8b,8,,Dense,737,93:1,███,0.3,,,,,web-scale,Sep/2023,🟢,A,https://github.com/persimmon-ai-labs/adept-inference,███,Open Apache license and publicly accessible weights.,180,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FLM-101B,BAAI,https://huggingface.co/CofeAI/FLM-101B,101,,Dense,245,███,███,0.5,,,,,web-scale,Sep/2023,🟢,A,https://arxiv.org/abs/2309.03852,,Train for $100k compute budget (on a cluster of 24 DGX-A800 GPU 8×80G servers for 21 days),179,███,███,███,███,███,"2,048",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon 180B,TII,https://huggingface.co/spaces/tiiuae/falcon-180b-demo,180,,Dense,"3,500",20:1,███,2.6,70.6,,,,web-scale,Sep/2023,🟢,A,https://arxiv.org/abs/2311.16867,███,Major milestone for open source models (largest open dense model to date).,178,███,███,███,███,Other,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Hunyuan,Tencent,https://www.tencent.com/en-us/articles/2201685.html,100,,Dense,"2,000",20:1,███,███,,,,,web-scale,Sep/2023,🟢,A,https://arxiv.org/abs/2402.01723v1,,"Tencent's 100B dense LLM with strong Chinese and English capabilities focused on industrial Chinese NLP; closed/API-only. The later open-source Hunyuan-Large (2024) is a 389B-total / 52B-active MoE with 256K context, MMLU 88.4, MATH 69.8, CMMLU 90.2.",177,███,███,███,███,███,"8,192",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ phi-CTNL,Independent,,0.1,,███,0,1:1,███,0.000,,,,,web-scale,Sep/2023,🟢,B,https://arxiv.org/abs/2309.08632,,Satire. MMLU=100. 'phi-CTNL (pronounced “fictional”) that achieves perfect results across diverse academic benchmarks',176,███,███,███,███,Other,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Granite,IBM,https://www.ibm.com/granite,13,,Dense,"2,500",193:1,███,0.6,57,,,███,web-scale,Sep/2023,🟢,A,https://www.ibm.com/downloads/cas/X9W4O6BM,,"Original trained on 1T tokens, update 15/Feb/2024 trained on 2.5T tokens: granite-13b-chat-v2 (v2.1.0). ""At IBM, we curated 6.48TB of data to train our LLM Granite.13B. This was reduced to 2.07 TB after pre-processing, a 68% decrease.""",175,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jais,Inception AI,https://huggingface.co/inception-mbzuai,███,,Dense,395,31:1,███,0.2,,,,,web-scale,Aug/2023,🟢,A,https://arxiv.org/abs/2308.16149,,"Arabic, trained in Abu Dhabi, UAE using Cerebras.",174,███,███,███,███,███,"4,000",UAE,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Code Llama 34B,Meta AI,https://github.com/facebookresearch/codellama,34,,Dense,"2,600",77:1,███,1.0,,,,,███,Aug/2023,🟢,A,https://ai.meta.com/research/publications/code-llama-open-foundation-models-for-code/,,"Outperforms GPT-3.5. Initial Llama 2 (2T tokens) trained on 500B tokens of code, 100B tokens of python",173,███,███,███,███,Llama 2,"16,384",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ IDEFICS,Hugging Face,https://huggingface.co/spaces/HuggingFaceM4/idefics_playground,80,,Dense,"1,515",19:1,███,1.2,,,,,web-scale,Aug/2023,🟢,C,███,,"Clone of Flamingo using Llama-1 65B. Named after Asterix and Obelix's dog Idefix (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) Dataset: LLaMA-1 65B base (1400B tokens) + OBELICS 115B interleaved image-text tokens (141M docs, 353M images) = 1515B.",172,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Raven,NVIDIA,,11,,Dense,40,4:1,███,0.07,,,,,web-scale,Aug/2023,🔴,B,https://arxiv.org/abs/2308.07922,███,RAG Atlas,171,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DukunLM,AzaleAI,https://huggingface.co/azale-ai/DukunLM-13B-V1.0-Uncensored,13,,Dense,"1,500",116:1,███,0.5,███,,,,web-scale,Aug/2023,🟢,A,https://huggingface.co/azale-ai/DukunLM-13B-V1.0-Uncensored,,Indonesian fine-tune of WizardLM (which is a Llama fine-tune).,170,███,███,███,███,CC-BY-NC 4.0,███,Indonesia,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WizardLM 70B,Microsoft,https://huggingface.co/WizardLM/WizardLM-70B-V1.0,70,,Dense,"2,000",███,███,1.2,,,,,web-scale,Aug/2023,🟢,A,https://github.com/nlpxucan/WizardLM,,Assume Llama-2 fine-tune. Outperforms text-davinci-003. May merge this entry with the Apr/2023 7B release,169,███,███,███,███,Llama 2,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Platypus,Boston University,https://platypus-llm.github.io/,70,,Dense,"2,000",███,███,1.2,,,,,web-scale,Aug/2023,🟢,A,https://platypus-llm.github.io/Platypus.pdf,,"Fine-tune of Llama 2, family includes merges with Beluga, Dolphin, and Camel fine-tunes.",168,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Japanese StableLM Alpha 7B,Stability AI,https://huggingface.co/stabilityai/japanese-stablelm-base-alpha-7b,7,,Dense,750,108:1,███,0.2,,,███,,web-scale,Aug/2023,🟢,A,https://stability.ai/blog/stability-ai-new-jplm-japanese-language-model-stablelm,,Best-performing openly available language model for Japanese speakers.,167,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Stable Code 3B,Stability AI,https://huggingface.co/stabilityai/stablecode-completion-alpha-3b-4k,███,,Dense,560,208:1,███,0.1,,,,,"code, The Stack",Aug/2023,🟢,A,https://stability.ai/blog/stablecode-llm-generative-ai-coding,,"Context window=16,384. Trained on The Stack dataset.",166,███,███,███,███,Apache 2.0,"16,384",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Med-Flamingo,Stanford,https://github.com/snap-stanford/med-flamingo,8.3,,███,"1,000",121:1,███,0.3,,,,,medical,Jul/2023,🟢,A,https://arxiv.org/abs/2307.15189,,"Uses LAION OpenFlamingo 9B, based on LLaMA-7B text + 1.3B vision",165,███,███,███,███,CC-BY-NC-SA 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Alfred-40B-0723,LightOn,https://huggingface.co/lightonai/alfred-40b-0723,40,,Dense,███,25:1,███,0.7,,,,,web-scale,Jul/2023,🟢,A,https://www.lighton.ai/blog/lighton-s-blog-4/introducing-alfred-40b-0723-38,,First finetuned version of Falcon with RLHF. Enterprise: https://www.lighton.ai/paradigm,164,███,███,███,███,Apache 2.0,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LLaMA-2-7B-32K,Together,https://huggingface.co/togethercomputer/LLaMA-2-7B-32K,7,,███,"2,000",286:1,███,0.4,,,,,web-scale,Jul/2023,🟢,A,https://together.ai/blog/llama-2-7b-32k,,32k context window instead of 4k (Llama 2),163,███,███,███,███,Llama 2,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Med-PaLM M,Google DeepMind,,562,,Dense,780,2:1,███,2.2,,,,███,web-scale,Jul/2023,🔴,B,https://arxiv.org/abs/2307.14334,,Uses PaLM 1. Already outperformed by Med-PaLM 2. Med-PaLM Multimodal (Med-PaLM M).,162,███,███,███,███,Proprietary,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BTLM-3B-8K,Cerebras,https://huggingface.co/cerebras/btlm-3b-8k-base,███,,Dense,627,209:1,███,0.1,,,,,web-scale,Jul/2023,🟢,A,https://www.cerebras.net/blog/btlm-3b-8k-7b-performance-in-a-3-billion-parameter-model/,,"Runs on devices with as little as 3GB of memory [iPhone, Macbook] when quantized to 4-bit",161,███,███,███,███,Apache 2.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Stable Beluga 2,Stability AI,https://huggingface.co/stabilityai/FreeWilly2,70,,Dense,"2,000",███,███,1.2,,,,,web-scale,Jul/2023,🟢,A,https://stability.ai/blog/stable-beluga-large-instruction-fine-tuned-models,,Fine-tuned Llama 2. Non-commercial use license. Codename was FreeWilly2,160,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Stable Beluga 1,Stability AI,https://huggingface.co/stabilityai/FreeWilly1-Delta-SafeTensor,65,,Dense,"1,400",22:1,███,1.0,,,,███,web-scale,Jul/2023,🟢,A,https://stability.ai/blog/stable-beluga-large-instruction-fine-tuned-models,,Fine-tuned LLaMA-1. Non-commercial use license. Codename was FreeWilly1,159,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Meta-Transformer,Shanghai AI Laboratory/CUHK,https://github.com/invictus717/MetaTransformer,███,,Dense,400,200:1,███,0.09,,,,,web-scale,Jul/2023,🟢,C,https://arxiv.org/abs/2307.10802,,"Proto-AGI. 12 modalities (text, image, point cloud, audio, video, infrared, hyperspectral, X-ray, time-series, tabular, Inertial Measurement Unit (IMU), and graph data). Dataset: Frozen ViT backbone pretrained contrastively on LAION-2B image-text pairs (note: ""2B"" in name = dataset, not params); ~400B token-equiv.",158,███,███,███,███,███,"4,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Llama 2,Meta AI,https://www.llama2.ai/,███,,Dense,"2,000",29:1,███,1.2,68.9,37.5,26.26,,web-scale,Jul/2023,🟢,A,https://ai.meta.com/research/publications/llama-2-open-foundation-and-fine-tuned-chat-models/,SOTA,"Context window=4096. MMLU=68.9 (GPT-3.5=70.0, GPT-4=86.4)",157,███,███,███,███,Llama 2,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WormGPT,(Undisclosed),,6,,███,402,67:1,███,0.2,,,,,web-scale,Jul/2023,🟡,B,https://slashnext.com/blog/wormgpt-the-generative-ai-tool-cybercriminals-are-using-to-launch-business-email-compromise-attacks/,,GPT-J (2021) finetune/module.,156,███,███,███,███,Proprietary,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Claude 2,Anthropic,https://claude.ai/,130,,███,"2,500",20:1,███,1.9,78.5,,,,web-scale,Jul/2023,🟢,D,https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf,,"More HHH, 200k context length",155,███,███,███,███,Proprietary,"100,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LongLLaMA,IDEAS/DeepMind,https://github.com/CStanKonrad/long_llama,7,,Dense,"1,000",143:1,███,0.3,,,,,web-scale,Jul/2023,███,A,https://arxiv.org/abs/2307.03170,,256k context length,154,███,███,███,███,Apache 2.0,"262,144",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ xTrimoPGLM,Tsinghua,,100,,Dense,"1,000",10:1,███,1.1,,,███,,special,Jul/2023,🔴,B,https://www.biorxiv.org/content/10.1101/2023.07.05.547496v1,,Protein language model,153,███,███,███,███,Non-commercial research,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ XGen,Salesforce,https://github.com/salesforce/xgen,7,,███,"1,500",215:1,███,0.3,,,,,web-scale,Jul/2023,🟢,A,https://blog.salesforceairesearch.com/xgen/,,8K sequence length. Released under Apache-2.0.,152,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Zhinao (Intellectual Brain),360 cn,https://ai.360.com/,100,,Dense,"2,000",███,███,1.5,,,,,web-scale,Jul/2023,🟢,D,https://arxiv.org/abs/2405.13386,,"360's Zhinao 7B model pretrained on 3.4T tokens with context lengths of 4K, 32K, and 360K via tailored continual pretraining; emphasizes data quality and composition ablations; aligned with SFT, reward models, and RLHF. The 100B figure refers to a larger internal version. Open-source at github.com/Qihoo360/360zhinao.",151,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Yasa,Reka AI,https://reka.ai/product/,7,███,Dense,500,72:1,███,0.2,,,,,web-scale,Jun/2023,🟡,D,https://reka.ai/product/,,"No public arch info. Researchers from DeepMind, Google, Baidu and Meta building enterprise models Dataset: Estimate: Reka Yasa, first proprietary model from Reka, no disclosure; likely 7-21B range with era norm of ~500B tokens. Params: Earlier Reka model, smaller iteration ~7B",150,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kosmos-2,Microsoft,https://44e505515af066f4.gradio.app/,1.6,,Dense,360,225:1,███,0.08,,,,,web-scale,Jun/2023,🟢,A,https://arxiv.org/abs/2306.14824,,Proto-AGI. Multimodal large language model (MLLM). a multimodal large language model with grounding capability built upon KOSMOS-1,███,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AudioPaLM,Google,https://google-research.github.io/seanet/audiopalm/examples/,8,,Dense,"4,600",575:1,███,0.6,,,,,speech,Jun/2023,🔴,C,https://arxiv.org/abs/2306.12925,,a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. Dataset: Palm 2 3.6T + 1T estimate,███,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Inflection-1,Inflection AI,https://docs.google.com/forms/d/e/1FAIpQLScM9Iz1KzaRlfgDrYrldoPDnXbhO5LW3-hqmQCd56YpheEN7g/viewform,120,███,Dense,"2,000",17:1,███,1.6,,,,,web-scale,Jun/2023,🟢,D,https://inflection.ai/assets/Inflection-1_0622.pdf,,"Comparable with benchmarking results from InternLM 104B, 1-2% better. ‘Inflection-1 was trained using thousands of NVIDIA H100 GPUs on a very large dataset.’",147,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Phi-1,Microsoft,,1.3,,Dense,51,███,███,0.03,,,,,"synthetic, web-scale",Jun/2023,🔴,B,https://arxiv.org/abs/2306.11644,,"Code model. ‘breaking existing scaling laws by training a 1.3B-parameter model, which we call phi-1, for roughly 8 passes over 7B tokens (slightly over 50B total tokens seen) followed by finetuning on less than 200M tokens.’",146,███,███,███,███,MIT,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ InternLM,Shanghai AI Laboratory/SenseTime,https://internlm-org.translate.goog/?_x_tr_sl=zh&_x_tr_tl=en,104,,Dense,"1,600",16:1,███,1.4,,,,,web-scale,Jun/2023,🔴,A,https://github.com/InternLM/InternLM-techreport,,"Outperforms ChatGPT, LLaMA on RACE-h, Chinese + English",███,███,███,███,███,███,"4,096",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BlenderBot 3x,Meta AI,███,175,,Dense,300,2:1,███,0.8,,,,,web-scale,Jun/2023,🟢,A,https://arxiv.org/abs/2306.04707,,OPT-175B with new dialogue data,144,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Orca,Microsoft,███,13,,Dense,"1,000",77:1,███,0.4,,,,,web-scale,Jun/2023,🟡,A,https://arxiv.org/abs/2306.02707,,"LLaMA -> Vicuna -> Orca (GPT-4 finetune). Still an imitation model, overhyped: The False Promise of Imitating Proprietary LLMs https://arxiv.org/abs/2305.15717",143,███,███,███,███,Apache 2.0,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PassGPT,ETH Zürich,,0.124,,███,0,3:1,███,0.001,,,,,special,Jun/2023,🔴,D,https://arxiv.org/abs/2306.01545,,GPT-2 trained on leaked passwords Dataset: GPT-2 small (124M) architecture trained from scratch on ~30M RockYou passwords (~0.24B tokens at ~8 chars/password) = ~0.3B.,142,███,███,███,███,CC-BY-NC-SA 4.0,███,Switzerland,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DIDACT,Google DeepMind,,5,,Dense,"37,900","7,580:1",███,1.5,,,,,special,Jun/2023,███,F,https://ai.googleblog.com/2023/05/large-sequence-models-for-software.html,,Iterative coding model trained on Google's monorepo. Jacob: https://twitter.com/jacobaustin132/status/1663972128176128002 Params: Typical of era.,141,███,███,███,███,Proprietary,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LTM-1,Magic,https://magic.dev/blog/ltm-1,7,,███,30,5:1,███,0.05,,,,,special,Jun/2023,🔴,D,https://magic.dev/blog/ltm-1,,"Context window=5M Dataset: Estimate: Magic LTM-1 (5M context novelty model), no public params or tokens; conservative estimate for a research demo code model = ~30B.",140,███,███,███,███,Proprietary,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4 MathMix,OpenAI,,1760,88,MoE,"13,000",8:1,███,15.9,,,,,web-scale,May/2023,🔴,D,https://arxiv.org/abs/2305.20050,,███,139,███,███,███,███,Proprietary,"32,768",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PandaGPT,Cambridge/Tencent,https://panda-gpt.github.io/,13,,Dense,"1,000",77:1,███,███,,,,,web-scale,May/2023,🟢,A,https://github.com/yxuansu/PandaGPT/blob/main/PandaGPT.pdf,,"Proto-AGI. 6 modalities (text, image/video, audio, depth, thermal, and IMU/accelerometer/gyroscope/compass). Based on Vicuna.",138,███,███,███,███,CC-BY-NC 4.0,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Falcon,TII,https://huggingface.co/tiiuae/falcon-40b,40,,Dense,"1,000",25:1,███,0.7,,,,,web-scale,May/2023,███,A,https://arxiv.org/abs/2311.16867,,Abu Dhabi,137,███,███,███,███,███,"2,048",UAE,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 202305-refact2b-mqa-lion,Refact,https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/,1.6,,Dense,30,███,███,0.02,,,,,web-scale,May/2023,🟡,C,https://refact.ai/blog/2023/applying-recent-innovations-to-train-model/,,"LiON vs Adam, code, RedPajama+The Stack Dataset: Estimate: Refact 1.6B, blog mentions RedPajama + Stack-Dedup-1.2 + internal diffs but no token total; LLaMA-style ~20-30x = ~30B for 1.6B model.",136,███,███,███,███,Other,███,UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Guanaco,UW,https://huggingface.co/spaces/uwnlp/guanaco-playground-tgi,65,,Dense,"1,400",22:1,███,1.0,,,,,web-scale,May/2023,🟢,A,https://arxiv.org/abs/2305.14314,,███,135,███,███,███,███,Non-commercial research,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LIMA,Meta AI,,65,,Dense,"1,400",22:1,███,1.0,,,,███,web-scale,May/2023,🔴,D,https://arxiv.org/abs/2305.11206,,"LLaMA-65B with nearly no fine-tuning, no RLHF Dataset: LLaMA-65B base (1400B tokens) + LIMA fine-tune (1000 curated examples, no RLHF, negligible); total ~1400B.",134,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Formosa (FFM),Asus/TWS,,176,,Dense,366,███,███,0.8,,,,,web-scale,May/2023,🟡,B,https://www.asus.com/news/xxifirl2s2tzesl0/,,"BLOOMZ finetune? Chinese, Taiwan's first LLM. Subscription hardware: https://archive.md/cVdJt",133,███,███,███,███,Other,███,Taiwan,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CodeT5+,Salesforce,https://huggingface.co/Salesforce/codet5p-16b,16,,Dense,629,40:1,███,0.3,,,,,code,May/2023,🟢,███,https://arxiv.org/abs/2305.07922,,"InstructCodeT5+ 16B sets new SoTA results of 35.0% pass@1 and 54.5% pass@10 against other open code LLMs, even surpassing the closed-source OpenAI code-cushman-001' Dataset: CodeGen-Mono 16B base (577B per CodeGen paper) + 51.5B CodeT5+ stage-1 GitHub code + ~1.9M CodeSearchNet pairs = 629B.",132,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PaLM 2,Google,https://console.cloud.google.com/vertex-ai/generative/language/create/chat,███,,Dense,"3,600",11:1,███,3.7,,,,,web-scale,May/2023,🟢,D,https://arxiv.org/abs/2305.10403,SOTA,"“What we found in our work is that it’s not really the sort of size of model — that the larger is not always better,” Deepmind VP Zoubin Ghahramani said in a press briefing ahead of today’s announcement. “That’s why we’ve provided a family of models of different sizes. We think that actually parameter count is not really a useful way of thinking about the capabilities of models and capabilities are really to be judged by people using the models and finding out whether they’re useful in the tests that they try to achieve with these models.”",131,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ StarCoder,ServiceNow,https://huggingface.co/bigcode/starcoderbase,15.5,,Dense,"1,000",65:1,███,0.4,,,,███,code,May/2023,🟢,A,https://arxiv.org/abs/2305.06161,,"""StarCoderBase is a 15.5B parameter model trained on 1 trillion tokens sourced from The Stack"" across 80+ programming languages with PII redaction and opt-out mechanism; 8,192-token context with multi-query attention; StarCoder fine-tunes on 35B Python tokens and achieves 40% pass@1 on HumanEval, matching code-cushman-001.",130,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MPT,MosaicML,https://huggingface.co/mosaicml/mpt-7b,7,,Dense,"1,000",143:1,███,0.3,,,,,web-scale,May/2023,███,A,https://twitter.com/NaveenGRao/status/1654496162492084227,,Llongboi' -Apache 2.0 license suitable for commercial use. -Base 7B LLM trained on 1T tokens outperforms LLaMA and GPT3. -64K+ context length. -$200k to train from scratch.,129,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pi,Inflection AI,https://pi.ai/talk,60,,Dense,"1,200",███,███,0.9,,,,,web-scale,May/2023,🟢,D,https://www-cnbc-com.cdn.ampproject.org/c/s/www.cnbc.com/amp/2022/03/08/reid-hoffman-has-set-up-a-new-ai-company-with-deepminds-co-founder.html,,"No indication of params/tokens. Devs from DeepMind. Dataset: Estimate: Inflection Pi 60B (first release), proprietary, no disclosure; Chinchilla-optimal 20x = 1200B.",128,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-2B-001,NVIDIA,https://huggingface.co/nvidia/GPT-2B-001,2,███,Dense,"1,100",550:1,███,0.2,,,,,web-scale,May/2023,🟢,A,https://huggingface.co/nvidia/GPT-2B-001,,No paper yet,127,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Titan,Amazon,https://aws.amazon.com/bedrock/titan/,200,,Dense,"4,000",20:1,███,3.0,70.4,,,███,web-scale,Apr/2023,🟢,A,https://www.techrepublic.com/article/amazon-bedrock-titan-cloud-artificial-intelligence/,,"No official information at all. 2nd hand via Jack Clark: https://importai.substack.com/p/import-ai-365-wmd-benchmark-amazon '$65m training run. Specifically, they trained a 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train.'",126,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WizardLM 7B,Microsoft,https://6f8173a3550ed441ab.gradio.live/,7,,Dense,"1,000",143:1,███,0.3,,,,,web-scale,Apr/2023,🟢,███,https://arxiv.org/abs/2304.12244,,LLaMA 7B self-instructed fine-tune.,125,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MPT,MosaicML,https://huggingface.co/mosaicml/mpt-1b-redpajama-200b-dolly,1.3,,Dense,200,154:1,███,0.05,,,,,web-scale,Apr/2023,███,A,https://twitter.com/jefrankle/status/1649060478910357504,,More 1B models coming with different datasets. Many more.,124,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ StableLM,Stability AI,https://github.com/stability-AI/stableLM/,7,,Dense,"1,500",215:1,███,0.3,,,███,,web-scale,Apr/2023,🟢,A,https://github.com/stability-AI/stableLM/,,Params: 65 -> 7 (largest actually released Apr/2023 was 7B; 65B was never released at announcement) | Tokens: 1500B -> 800B (actual training for 3B and 7B models at launch; 1.5T was a future aspiration),123,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Dolly 2.0,Databricks,https://huggingface.co/databricks/dolly-v2-12b,███,,Dense,300,25:1,███,0.2,,,,,web-scale,Apr/2023,🟢,A,https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm,,Fine-tuned Pythia 12B,122,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Pythia,EleutherAI,https://huggingface.co/EleutherAI/pythia-12b,12,,Dense,300,███,███,0.2,,,,,web-scale,Apr/2023,🟢,A,https://arxiv.org/abs/2304.01373,,"Suite of 16 decoder-only LLMs from 70M to 12B parameters, all trained identically on The Pile in the same token order with 154 public intermediate checkpoints each; designed for interpretability and training-dynamics research; used to study memorization, term-frequency effects on few-shot performance, and gender bias reduction.",121,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Koala-13B,Berkeley,https://chat.lmsys.org/?model=koala-13b,13,,Dense,"1,000",77:1,███,0.4,,,,,web-scale,Apr/2023,🟢,C,https://bair.berkeley.edu/blog/2023/04/03/koala/,,LLaMA base. Academic licence only. Dataset: LLaMA-13B base (1000B tokens) + small fine-tune on ~60k ShareGPT/dialogue examples (negligible); total ~1000B.,███,███,███,███,███,Non-commercial research,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ C1.2,Character.ai,https://blog.character.ai/character-ai/,20,,Dense,"1,000",50:1,███,0.5,,,,,web-scale,Mar/2023,🟢,D,https://blog.character.ai/character-ai/,███,No details released.,119,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BloombergGPT,Bloomberg,,50,,Dense,569,12:1,███,0.6,39.2,,,,web-scale,Mar/2023,🔴,B,https://arxiv.org/abs/2303.17564,███,"Video: https://youtu.be/m2Scj2SO85Y Underperforms GPT-3, based on BLOOM. Tokens: 'We select a model size motivated by Hoffmann et al. (2022) and train a 50 billion parameter model on 569 billion tokens from our corpus of over 700 billion tokens to produce a model that is competitive with larger models.'",118,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OpenFlamingo-9B,LAION,https://huggingface.co/openflamingo/OpenFlamingo-9B,8.3,,Dense,"1,000",121:1,███,0.3,,,,,web-scale,Mar/2023,🟢,A,https://laion.ai/blog/open-flamingo/,,███,117,███,███,███,███,Non-commercial research,███,Germany,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT4All-LoRa,Nomic,https://github.com/nomic-ai/gpt4all,7,,Dense,"1,000",143:1,███,0.3,,,,,web-scale,Mar/2023,🟢,A,https://s3.amazonaws.com/static.nomic.ai/gpt4all/2023_GPT4All_Technical_Report.pdf,,███,116,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cerebras-GPT,Cerebras,███,13,,Dense,260,20:1,███,0.2,,,,,web-scale,Mar/2023,🟢,A,https://www.cerebras.net/blog/cerebras-gpt-a-family-of-open-compute-efficient-large-language-models/,,20:1 tokens to parameters as per https://lifearchitect.ai/chinchilla/,115,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PanGu-Sigma,Huawei,,1085,108.5,MoE,329,1:1,███,███,,,,,web-scale,Mar/2023,🔴,D,https://arxiv.org/abs/2303.10845,,"Sparse. 1.085T parameters named PanGu-Σ. Dataset: Paper: ""329B tokens"" pretraining across 40 domains (Chinese 75B, English 76B, bilingual 78B, code 75B, etc.); inherited from PanGu-alpha 13B.",114,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CoLT5,Google,,5.3,,Dense,"1,050",199:1,███,0.2,,,,,"q&a, web",Mar/2023,🔴,███,https://arxiv.org/abs/2303.09752,,up to 64k context window [48k words or about 96 pages -Alan] Dataset: Paper: pre-trained 1M steps x batch 256 x seq 4096 = ~1050B tokens on C4 with UL2 objective (T5.1.1 / LongT5 architecture).,113,███,███,███,███,███,"16,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Med-PaLM 2,Google DeepMind,,340,,Dense,"3,600",11:1,███,3.7,,,,,web-scale,Mar/2023,🔴,███,https://arxiv.org/abs/2305.09617,,"Recently, our next iteration, Med-PaLM 2, consistently performed at an “expert” doctor level on medical exam questions, scoring 85%. This is an 18% improvement from Med-PaLM’s previous performance and far surpasses similar AI models.",112,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4 Classic,OpenAI,https://chat.openai.com/,1760,88,MoE,"13,000",8:1,███,15.9,86.4,,35.7,,web-scale,Mar/2023,🟢,███,https://cdn.openai.com/papers/gpt-4.pdf,SOTA,"Includes: gpt-4-0314 & gpt-4-0613, non-Turbo. Original MMLU=86.4. MMLU=90.1 with prompting. Proto-AGI. 1.76T parameters MoE.",111,███,███,███,███,Proprietary,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Alpaca,Stanford,https://crfm.stanford.edu/alpaca/,7,,Dense,"1,000",143:1,███,0.3,███,,,,web-scale,Mar/2023,🟢,A,https://github.com/tatsu-lab/stanford_alpaca,,Stanford Alpaca: An Instruction-following LLaMA model',110,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jurassic-2,AI21,Studio,178,,Dense,"3,560",20:1,███,2.7,,,,,web-scale,Mar/2023,🟢,C,███,,"Dataset: Estimate: AI21 Jurassic-2 178B proprietary, no token disclosure; Chinchilla-optimal 20x params (178 x 20 = 3560B).",109,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-NeoX-Chat-Base-20B,Together,https://huggingface.co/spaces/togethercomputer/OpenChatKit,20,,Dense,472,24:1,███,0.3,33.6,███,,,web-scale,Mar/2023,🟢,C,https://github.com/togethercomputer/OpenChatKit,,"instruction-tuned 20 billion parameter language model, a 6 billion parameter moderation model, and an extensible retrieval system for including up-to-date responses from custom repositories. It was trained on the OIG-43M training dataset, which was a collaboration between Together, LAION, and Ontocord.ai. ' Dataset: GPT-NeoX-20B base (472B Pile per paper) + OIG instruction fine-tune on top (small vs base); total ~472B.",108,███,███,███,███,Apache 2.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Kosmos-1,Microsoft,,1.6,,Dense,███,225:1,███,0.08,,,,,web-scale,Feb/2023,🔴,B,https://arxiv.org/abs/2302.14045,,"Proto-AGI. Multimodal large language model (MLLM). Raven’s Progressive Matrices as real images, not digits as in testing of text-davinci-003 at https://lifearchitect.ai/ravens/",107,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LLaMA-65B,Meta AI,Weights leaked: https://github.com/facebookresearch/llama/pull/73/files ,65,,Dense,"1,400",22:1,███,1.0,68.9,,███,,web-scale,Feb/2023,🟢,A,https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/,SOTA,"Researchers only, noncommercial only. 'LLaMA-65B is competitive with the best models, Chinchilla70B and PaLM-540B.'",106,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MOSS,Fudan University,https://moss.fastnlp.top/,16,,Dense,430,███,███,0.3,,,,,web-scale,Feb/2023,🟢,A,https://txsun1997.github.io/blogs/moss.html,,Major bandwidth issues: https://www.reuters.com/technology/china-fudan-university-team-apologises-after-chatgpt-style-platform-crashes-2023-02-21/,105,███,███,███,███,███,"2,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Palmyra,Writer,https://huggingface.co/models?search=palmyra,20,,Dense,300,15:1,███,0.3,███,,,,web-scale,Feb/2023,🟢,A,https://writer.com/blog/palmyra/,,"Only up to 5B available open-source 'trained on over 300 billion tokens of text data, and the size of the resulting model is over 20 billion parameters. ' https://writer.com/product/cowrite/",104,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Luminous Supreme Control,Aleph Alpha,https://app.aleph-alpha.com/playground/completion,70,,Dense,588,9:1,███,0.7,,,,,web-scale,Feb/2023,🟢,███,https://www.aleph-alpha.com/pdf/2023_02_AA_Benchmarks_doc.pdf,,‘Control’ means instruction tuned,103,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Toolformer+Atlas 11B+NLLB 54B,Meta AI,Replicated: https://github.com/conceptofmind/toolformer,6.7,,Dense,402,60:1,███,0.2,,,,,web-scale,Feb/2023,███,A,https://arxiv.org/abs/2302.04761,,Based on GPT-J 6.7B + access to other models via API,102,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Multimodal-CoT,Amazon,https://github.com/amazon-science/mm-cot,0.738,,Dense,"1,000","1,356:1",███,0.09,,,,,special,Feb/2023,🟢,C,https://arxiv.org/abs/2302.00923,,███,101,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FLAME,Microsoft,,0.06,,Dense,9,150:1,███,0.002,,███,,,special,Jan/2023,🔴,B,https://arxiv.org/abs/2301.13779,,"T5 for Excel formulas, very small 60M params, ""We start from a dataset of 927M formulas"" estimate 10x multiplier for 9B tokens",100,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Med-PaLM 1,Google DeepMind,,540,,Dense,780,2:1,███,2.2,,,,,web-scale,Dec/2022,🔴,B,https://arxiv.org/abs/2212.13138,███,Collab between Google & DeepMind. Makes 1% less errors than humans,99,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OPT-IML,Meta AI,https://github.com/facebookresearch/metaseq/tree/main/projects/OPT-IML,175,,Dense,300,2:1,███,0.8,,,,,web-scale,Dec/2022,🟢,A,███,,Instruct,98,███,███,███,███,Non-commercial research,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RL-CAI,Anthropic,,52,,Dense,400,8:1,███,0.5,,,,███,web-scale,Dec/2022,🔴,B,https://arxiv.org/abs/2212.08073,,RLAIF=reinforcement learning with AI feedback,97,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE-Code,Baidu,,0.56,,Dense,60,108:1,███,0.02,,,,,code,Dec/2022,🟢,D,https://arxiv.org/abs/2212.06742#baidu,,Dataset: T5.1.1 architecture trained from scratch on 6.5M code samples + 1.9M NL-PL pairs + 1.5B doc pages CC-100 + 7.8B OPUS pairs; 100k steps; estimate ~60B total tokens.,███,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RT-1,Google,,0.035,,Dense,0,3:1,███,0.000,,,,,special,Dec/2022,🔴,D,https://robotics-transformer.github.io/assets/rt1.pdf,,Dataset: Behavior cloning on ~130k robot episodes (17 months data collection); not a token-trained LM. Estimate ~0.1B token-equivalents for the 35M model.,███,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ChatGPT (gpt-3.5-turbo),OpenAI,https://chat.openai.com/,20,,Dense,300,15:1,███,0.3,███,,28.1,,web-scale,Nov/2022,🟢,C,https://openai.com/blog/chatgpt,,"Instruct with strict policies (""extremely limited"") Dataset: Estimate: ChatGPT (gpt-3.5-turbo) base is GPT-3.5 (300B GPT-3-era pretraining) + RLHF / SFT; no official disclosure.",94,███,███,███,███,███,"4,096",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ text-davinci-003,OpenAI,https://chat.openai.com/,175,,Dense,300,2:1,███,███,,,,,web-scale,Nov/2022,🟢,D,https://openai.com/blog/chatgpt,,Dataset: Estimate: text-davinci-003 is GPT-3.5 series with RLHF; no public token disclosure; base GPT-3 was 300B tokens.,93,███,███,███,███,Proprietary,"4,096",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-JT,Together,https://huggingface.co/spaces/togethercomputer/GPT-JT,6,,Dense,406,68:1,███,0.2,,,,,web-scale,Nov/2022,🟢,C,███,,"Dataset: GPT-J base (402B Pile) + GPT-JT fine-tune (""3.53 billion tokens"": 2.62B UL2 phase + 0.92B CoT/P3/NaturalInstructions/Pile) = 406B.",92,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RWKV-4,RWKV,https://huggingface.co/BlinkDL,███,,Dense,332,24:1,███,0.2,,,,,web-scale,Nov/2022,🟢,A,https://arxiv.org/abs/2305.13048,,RWKV (pronounced RwaKuv) is an RNN: https://www.reddit.com/r/MachineLearning/comments/yxt8sa/r_rwkv4_7b_release_an_attentionfree_rnn_language/,91,███,███,███,███,Apache 2.0,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Galactica,Meta AI,███,120,,Dense,450,4:1,███,0.8,52.6,,,,journals,Nov/2022,🟢,B,https://galactica.org/static/paper.pdf,,scientific only,90,███,███,███,███,CC-BY-NC 4.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SED,DeepMind,,0.42,,Dense,131,312:1,███,███,,,,,web-scale,Nov/2022,🔴,D,https://arxiv.org/abs/2211.04236,Diffusion,"SED 420M (diffusion text model) Dataset: Paper: SED-L 420M trained for 2M steps x 65,536 tokens/batch = ~131B tokens on C4.",89,███,███,███,███,Proprietary,███,UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mT0,BigScience,https://github.com/bigscience-workshop/xmtf,13,,Dense,"1,000",77:1,███,0.4,,,,,"q&a, web",Nov/2022,███,A,https://arxiv.org/abs/2211.01786,,fine-tuned,88,███,███,███,███,Apache 2.0,"1,024",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BLOOMZ,BigScience,https://github.com/bigscience-workshop/xmtf,███,,Dense,366,3:1,███,0.8,,,,,web-scale,Nov/2022,🟢,A,https://arxiv.org/abs/2211.01786,,fine-tuned,87,███,███,███,███,OpenRAIL,███,International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PACT,Microsoft,https://github.com/microsoft/PACT,1,,Dense,0,1:1,███,0.001,,,,,special,Oct/2022,🟢,B,https://arxiv.org/abs/2209.11133,,███,86,███,███,███,███,MIT,"1,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Flan-T5,Google,TS,11,,Dense,"1,100",100:1,███,0.4,,,,,web-scale,Oct/2022,🟢,A,https://arxiv.org/abs/2210.11416,███,T5=1T tokens + LM-adapted T5 as 100B tokens,85,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Flan-PaLM,Google,,540,,Dense,780,2:1,███,2.2,75.2,,,,███,Oct/2022,🔴,B,https://arxiv.org/abs/2210.11416,,"PaLM 540B instruction-finetuned on 1,836 tasks across multiple task clusters including chain-of-thought data; achieves +9.4% average over base PaLM 540B; 75.2% five-shot MMLU; SOTA on BBH, TyDiQA, and MGSM; demonstrates that scaling instruction-finetuning tasks improves performance across model sizes and families.",84,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ U-PaLM,Google,,540,,Dense,780,2:1,███,2.2,74.1,,,,███,Oct/2022,🔴,B,https://arxiv.org/abs/2210.11399,,"PaLM 540B continued on a mixture-of-denoiser UL2 objective for a few extra steps (~0.1% additional compute); reaches equivalent PaLM 540B performance at roughly half the training budget; gains on chain-of-thought, multilingual (MGSM, TydiQA), and BIG-Bench hard tasks. No new data required.",83,███,███,███,███,Proprietary,"8,192",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ VIMA,NVIDIA,Open: https://vimalabs.github.io/,0.2,,Dense,1,3:1,███,0.001,,,███,,special,Oct/2022,🟢,C,https://arxiv.org/abs/2210.03094,,Dataset: Behavior cloning from 650K trajectories (50K x 17 tasks); not a token-trained LM. Frozen T5 encoder for prompts; estimate ~0.5B token-equivalents.,82,███,███,███,███,Other,512,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OpenChat,Tsinghua,https://huggingface.co/openchat/openchat_3.5,13,,Dense,"2,000",154:1,███,0.5,,,,,web-scale,Sep/2022,🟢,A,https://arxiv.org/abs/2309.11235,,Llama 2 13B -> OpenChat 13B,███,███,███,███,███,Apache 2.0,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ WeLM,Wechat,https://welm.weixin.qq.com/docs/playground/,10,,Dense,300,30:1,███,0.2,███,,,,web-scale,Sep/2022,🟢,A,https://arxiv.org/abs/2209.10372,,13% English tokens and 87% Chinese,80,███,███,███,███,███,"2,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CodeGeeX,Tsinghua,,13,,Dense,850,66:1,███,0.4,,,,,███,Sep/2022,🟢,B,https://github.com/THUDM/CodeGeeX,,"13B autoregressive decoder (40 layers, hidden 5,120, FFN 20,480) trained on 158.7B tokens across 23 programming languages; 2,048-token context; trained on 1,536 Ascend 910 processors for ~2 months; achieves 54.76% average pass@1 on HumanEval-X (Python, C++, Java, JavaScript, Go), competitive with CodeGen-Multi-16B.",79,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Sparrow,DeepMind,,70,,Dense,"1,400",20:1,███,1.0,,,,███,web-scale,Sep/2022,🔴,B,https://storage.googleapis.com/deepmind-media/DeepMind.com/Authors-Notes/sparrow/sparrow-final.pdf,,Chatbot as a fine-tuned version of Chinchilla 70B,78,███,███,███,███,Proprietary,███,UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PaLI,Google,,17,,Dense,"1,000",59:1,███,0.4,,,,,special,Sep/2022,🔴,D,https://arxiv.org/abs/2209.06794,,███,77,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NeMo Megatron-GPT 20B,NVIDIA,https://huggingface.co/nvidia/nemo-megatron-gpt-20B,20,,Dense,"1,100",55:1,███,0.5,,,,,web-scale,Sep/2022,🟢,C,███,,"Dataset: NVIDIA NeMo Megatron-GPT 20B: Pile + curated web data; HF card omits token count, but NeMo Megatron family trained for ~1.1T tokens per NVIDIA blog.",76,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Z-Code++,Microsoft,,0.71,,Dense,500,705:1,███,0.06,,,,███,web-scale,Aug/2022,🔴,B,https://arxiv.org/abs/2208.09770v1,,"abstractive text summarization, 710M, outperforms PaLM 540B. ""Due to the limited computational resource, Z-Code++LARGE is trained with only 500B tokens instead of 1T tokens as that for mT5 training.""",75,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Atlas,Meta AI,,11,,Dense,40,4:1,███,0.07,47.9,,,,web-scale,Aug/2022,🟢,B,███,,"Retrieval-augmented 11B model that learns knowledge-intensive tasks with very few training examples; achieves >42% on Natural Questions with only 64 examples using 50x fewer parameters than a 540B dense baseline; evaluated on MMLU, KILT, and NaturalQuestions; document index can be updated post-training without retraining the model.",74,███,███,███,███,CC-BY-NC 4.0,"1,024",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BlenderBot 3,Meta AI,blenderbot.ai (US only),175,,Dense,300,2:1,███,0.8,,,,,web-scale,Aug/2022,🟢,A,https://github.com/facebookresearch/ParlAI/blob/main/projects/bb3/BB3_main_tech_report.pdf,,"""A 175B-parameter, publicly available chatbot"" built on the Director architecture with internet search (SeeKeR), long-term memory, and multi-task training on 12 module-level tasks; deployed as a live public demo that learns from user feedback over time. 3B and 30B weights open; 175B available on request.",███,███,███,███,███,CC-BY-NC 4.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLM-130B,Tsinghua,https://huggingface.co/spaces/THUDM/GLM-130B,130,,Dense,400,4:1,███,0.8,44.8,,,,web-scale,Aug/2022,🟢,A,https://arxiv.org/abs/2210.02414,,███,72,███,███,███,███,Non-commercial research,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ AlexaTM 20B,Amazon,https://github.com/amazon-science/alexa-teacher-models,20,,Dense,"1,300",65:1,███,0.5,,,,,███,Aug/2022,🟢,A,https://assets.amazon.science/ee/20/3abcf2304d9b8d68da2006ff7107/alexatm-20b-few-shot-learning-using-a-large-scale-multilingual-seq2seq-model.pdf,,Wikipedia and mC4 only. seq2seq,71,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ 6.9B FIM,OpenAI,,6.9,,Dense,100,15:1,███,0.09,,,,,web-scale,Jul/2022,🔴,███,https://arxiv.org/pdf/2207.14255.pdf,,"Several models: 8 sizes, NLP, Code, FIM/non-FIM. 100B tokens for 6.9B params... beyond chinchilla",70,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ‘monorepo-Transformer’,Google,███,0.5,,Dense,10,20:1,███,0.007,,,,,code,Jul/2022,🔴,D,https://ai.googleblog.com/2022/07/ml-enhanced-code-completion-improves.html,,Unnamed. Writes >3% of internal google code. Dataset: Estimate: unnamed 500M Google internal code-completion model (blog only); Chinchilla-optimal 20x = 10B.,69,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PanGu-Coder,Huawei,███,2.6,,Dense,250,97:1,███,0.08,,,,,special,Jul/2022,🔴,D,https://arxiv.org/abs/2207.11280,,Python via GH Dataset: PanGu-alpha 2.6B base (~250B Chinese tokens) + 2-stage code fine-tune (raw code + 120k NL/code pairs); fine-tune tokens not disclosed.,68,███,███,███,███,███,"2,000",China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NLLB,Meta AI,Github (train/deploy),54.5,3.3,MoE,900,17:1,███,0.7,,,███,,special,Jul/2022,🟢,C,https://research.facebook.com/publications/no-language-left-behind/,,"54.5B MOE, 3.3B dense. 200+ languages Dataset: Estimate: 200+ language translation, ~18B sentence pairs at ~50 tokens/pair = ~900B; paper does not disclose pretraining token total.",67,███,███,███,███,CC-BY-NC 4.0,"1,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ J-1 RBG,AI21,ask-rbg.ai,178,,Dense,300,2:1,███,0.8,,,,,special,Jul/2022,🟢,A,https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1,,J-1 fine-tuned with RBG law corpus,███,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BLOOM (tr11-176B-ml),BigScience,https://huggingface.co/spaces/huggingface/bloom_demo,176,,Dense,366,3:1,███,0.8,39.1,,,███,web-scale,Jul/2022,🟢,A,https://github.com/bigscience-workshop/bigscience/tree/master/train/tr11-176B-ml,,"176B decoder-only transformer (70 layers, hidden 14,336, 112 heads, ALiBi positional encoding, 2,048-token context, 250,680-token vocabulary) trained on the 1.5TB ROOTS corpus of 46 languages (350B tokens) on 384 A100 80GB GPUs; first open-access multilingual 100B+ model, trained March-July 2022 under BigScience workshop.",65,███,███,███,███,███,"2,048",International,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Minerva,Google,,540,,Dense,819,███,███,2.2,,,,,web-scale,Jun/2022,🔴,B,https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html,,PaLM finetuned on LaTeX/arXiv maths,64,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GODEL-XL,Microsoft,,2.7,,Dense,408,152:1,███,0.1,,,,,"web-scale, dialogue",Jun/2022,🟢,███,https://arxiv.org/abs/2206.11309#microsoft,,"XL: GPT-3 175B in paper, GPT-J 2.7B released Dataset: GPT-J 402B base (the released GODEL-XL) + 6B Reddit dialog fine-tune (147M sessions, DialoGPT data) = 408B.",63,███,███,███,███,MIT,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ YaLM 100B,Yandex,https://github.com/facebookresearch/fairseq/tree/nllb/,100,,Dense,300,3:1,███,0.6,,,,,web-scale,Jun/2022,███,A,https://github.com/yandex/YaLM-100B,,"Megatron-LM clone, Russian/English: https://medium.com/yandex/yandex-publishes-yalm-100b-its-the-largest-gpt-like-neural-network-in-open-source-d1df53d0e9a6",62,███,███,███,███,███,"2,048",Russia,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Unified-IO,Allen AI,https://unified-io.allenai.org/,2.8,,Dense,"1,000",358:1,███,0.2,,,,███,web-scale,Jun/2022,🔴,C,https://github.com/jiasenlu/unified-io/blob/main/UnifiedIOv1.pdf,,Based on T5. Demo only Dataset: Based on T5-Large 2.8B: T5 pretrain ~1000B C4 + multimodal fine-tune (image-text tasks) on top; fine-tune size not disclosed but small vs base.,61,███,███,███,███,███,"1,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LIMoE,Google,,5.6,0.28,MoE,400,███,███,0.2,,,,,web-scale,Jun/2022,🔴,D,https://ai.googleblog.com/2022/06/limoe-learning-multiple-modalities-with.html,,"Dataset: Estimate: LIMoE multimodal CLIP-style, no precise disclosure in blog post; era contemporaries (CLIP, ALIGN) used 0.4-1.8B image-text pairs ~= 400B tokens equivalent.",60,███,███,███,███,Proprietary,"1,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-4chan,Independent,https://huggingface.co/ykilcher/gpt-4chan/discussions/4,6,,Dense,402,67:1,███,0.2,,,,,web-scale,Jun/2022,🟢,C,https://arxiv.org/abs/2001.07487,,███,59,███,███,███,███,Apache 2.0,███,,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Diffusion-LM,Stanford,https://github.com/XiangLi1999/Diffusion-LM,0.08,,Dense,3,42:1,███,0.002,,,,,"dialogue, special",May/2022,🟢,███,https://arxiv.org/abs/2205.14217,Diffusion,GPT-J with synthetic data Dataset: Tiny diffusion model trained from scratch on 800K iters × 64 batch × 64 seq len = 3.3B tokens trained (ROCStories; multi-epoch over 98K examples).,58,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ UL2 20B,Google,,20,,Dense,"1,000",50:1,███,0.5,39.2,,,,web-scale,May/2022,███,B,https://arxiv.org/abs/2205.05131,,Unifying Language model. C4 only.,57,███,███,███,███,Apache 2.0,"4,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gato (Cat),DeepMind,,1.2,,Dense,"1,500","1,250:1",███,0.1,,,,,robotics,May/2022,🔴,███,https://storage.googleapis.com/deepmind-media/A%20Generalist%20Agent/Generalist%20Agent.pdf,SOTA,"Proto-AGI. Generalist agent (LLM, VLM, robot) Dataset: Paper: ~1.5T training tokens across 596 control tasks, 63M episodes + vision/language (14.7% sample weight); 1M steps x 512 batch x 1024 seq.",56,███,███,███,███,███,"1,024",UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LaMDA 2,Google,https://youtu.be/l9FJm--ClvY,137,,Dense,"2,810",21:1,███,2.1,███,,,,"dialogue, special",May/2022,🟡,C,https://arxiv.org/abs/2201.08239,,"Chatbot with tiny walled garden demo TBA Dataset: LaMDA 2 is same base model as LaMDA: ""2.81T BPE tokens"" pretraining per paper.",55,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ OPT-175B,Meta AI,https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/,███,,Dense,300,2:1,███,0.8,,,,,web-scale,May/2022,🟢,A,https://arxiv.org/abs/2205.01068,,Only 30B available (Jun/2022),54,███,███,███,███,Non-commercial research,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Tk-Instruct,Allen AI,https://instructions.apps.allenai.org/demo,11,,Dense,"1,001",91:1,███,0.3,,,███,,"q&a, web",Apr/2022,🟢,C,https://arxiv.org/abs/2204.07705,,Based on T5. Dataset: T5-11B base (~1000B C4 pretrain) + 1000 fine-tune steps x ~1M tokens/batch (~1B Natural Instructions) = 1001B.,53,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ InCoder,Meta AI,https://huggingface.co/spaces/facebook/incoder-demo,6.7,,Dense,52,8:1,███,0.06,,,,,code,Apr/2022,🟢,C,https://arxiv.org/abs/2204.05999,,███,52,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ NOOR,TII,,10,,Dense,200,20:1,███,0.1,,,███,,web-scale,Apr/2022,🔴,D,https://www.tii.ae/news/technology-innovation-institute-announces-launch-noor-worlds-largest-arabic-nlp-model,,"Arabic. ""World’s largest high-quality cross-domain Arabic dataset, combining web data with books, poetry, news articles, and technical information"" Dataset: Estimate: TII proprietary Arabic 10B, no public training details in press release; Chinchilla-optimal 20x (10 x 20 = 200B).",51,███,███,███,███,███,"2,000",UAE,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ mGPT,Sber,https://huggingface.co/ai-forever/mGPT,13,,Dense,400,███,███,0.2,,,,,web-scale,Apr/2022,🟡,C,https://arxiv.org/abs/2204.07580,,"60 languages. Only 1.3B model available Dataset: Paper: ""the models have seen 400B tokens during pretraining"" (600k steps, batch 2048, seq 512) on mC4 + Wikipedia, 61 languages.",50,███,███,███,███,███,"2,000",Russia,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PaLM-Coder,Google,,540,,Dense,780,2:1,███,2.2,,,,███,"web-scale, code",Apr/2022,🔴,B,https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf,,"PaLM 540B fine-tuned on code generation tasks; base PaLM trained via Pathways on 6,144 TPU v4 chips on 780B tokens; strong multilingual and code generation capabilities, surpassing average human performance on BIG-bench. PaLM-Coder adds specialization for code tasks including HumanEval and MBPP.",49,███,███,███,███,███,"8,192",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PaLM,Google,,540,,Dense,780,2:1,███,2.2,,,,,web-scale,Apr/2022,███,B,https://storage.googleapis.com/pathways-language-model/PaLM-paper.pdf,SOTA,"""Pathways Language Model"" -- a 540B densely-activated Transformer trained using Pathways across 6,144 TPU v4 chips on 780B tokens; achieved SOTA few-shot on hundreds of NLP benchmarks, surpassed average human performance on BIG-bench, and showed discontinuous improvements from model scale on multi-step reasoning tasks.",48,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ SeeKeR,Meta AI,,2.7,,Dense,100,38:1,███,0.05,,,,,web-scale,Mar/2022,🟢,D,https://arxiv.org/abs/2203.13224,,BART and compared to GPT-2 Dataset: Paper: R2C2 2.7B base pretrained on Pushshift Reddit + RoBERTa+CC100en ~100B tokens (BART denoise objective); SeeKeR fine-tunes on top.,███,███,███,███,███,Other,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CodeGen,Salesforce,,16,,Dense,577,███,███,0.3,,,,,"code, BigQuery, BigPython",Mar/2022,🟢,D,https://arxiv.org/abs/2203.13474,,Code Dataset: Paper Table 5: ThePile 386.3B + BigQuery 119.1B + BigPython 71.7B = 577B (sequential training: CodeGen-NL -> CodeGen-Multi -> CodeGen-Mono).,46,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ VLM-4,LightOn,https://lighton.ai/fr/home,10,,Dense,200,20:1,███,0.1,,███,,,web-scale,Mar/2022,🟢,C,https://lighton.ai/lighton-blogs/lighton-publicly-launches-muse,,"Params corrected 25/Apr/2022 Dataset: Estimate: LightOn proprietary 10B, no public disclosure (source is CNBC piece); Chinchilla-optimal 20x params (10 x 20 = 200B).",45,███,███,███,███,Proprietary,███,France,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Chinchilla,DeepMind,,70,,Dense,"1,400",20:1,███,1.0,67.5,,,,web-scale,Mar/2022,🔴,███,https://arxiv.org/abs/2203.15556,SOTA,First to double tokens per size increase,44,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-NeoX-20B,EleutherAI,https://huggingface.co/EleutherAI/gpt-neox-20b,20,,Dense,472,24:1,███,0.3,,,,,web-scale,Feb/2022,🟢,C,https://github.com/EleutherAI/gpt-neox,,███,43,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Perceiver AR,DeepMind,,1,,Dense,420,420:1,███,0.07,,,███,,web-scale,Feb/2022,🔴,D,https://arxiv.org/abs/2202.07765,,"Context window=100,000. Params=364m wiki, 975M pg-19, 826M books, music=?, imagenet=770M, Dataset: Paper: PG-19 variant trained ""about 200k steps at batch 2048, or about 420B total tokens"" until convergence.",42,███,███,███,███,Proprietary,███,UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CM3,Meta AI,,13,,Dense,223,18:1,███,0.2,,,,,"wiki, web",Jan/2022,🟢,D,https://arxiv.org/abs/2201.07520,,"LLM with multimodal capabilities Dataset: Paper: training corpus ""comprises 223 billion tokens"" (CC-NEWS 460GB + EN Wikipedia 383GB). Trained from scratch.",███,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ERNIE 3.0 Titan,Baidu,,260,,Dense,300,2:1,███,0.9,,,,,web-scale,Dec/2021,███,D,https://arxiv.org/abs/2112.12731,,Dataset: Paper: 4TB Chinese ERNIE 3.0 Corpus; tokens not disclosed; estimated ~300B matching sibling ERNIE 3.0 10B (~375B).,40,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ XGLM,Meta AI,,7.5,,Dense,500,67:1,███,0.2,,,,,web-scale,Dec/2021,🟢,D,https://arxiv.org/abs/2112.10668,,"Multilingual: 30 languages, 16 families. Dataset: Paper: ""All models are trained for up to 500B tokens"" of CC100-XL across 30 languages.",███,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Fairseq,Meta AI,,1100,10,MoE,300,1:1,███,1.9,,,███,,web-scale,Dec/2021,🟢,D,https://arxiv.org/abs/2112.10684,,"13B & 1100B param models. Dataset: Paper Sec 3.1: ""we train our models for 300B tokens"" uniformly across dense and MoE variants.",38,███,███,███,███,███,"2,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Gopher,DeepMind,,280,,Dense,300,2:1,███,1.0,60,,,,web-scale,Dec/2021,🔴,B,███,SOTA,Dataset: https://lifearchitect.ai/whats-in-my-ai/,37,███,███,███,███,███,"2,000",UK,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GLaM,Google,,1200,134,MoE,"1,600",2:1,███,4.6,,,,,web-scale,Dec/2021,🔴,D,███,,"Dataset: Paper: ""1.6 trillion tokens"" of filtered web + books + conversations.",36,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Anthropic-LM 52B,Anthropic,,52,,Dense,400,8:1,███,0.5,,,,,███,Dec/2021,🔴,B,https://arxiv.org/abs/2112.00861,,Internal research only,35,███,███,███,███,Proprietary,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RETRO,DeepMind,███,7.5,,Dense,200,27:1,███,0.1,,,,,web-scale,Dec/2021,🔴,D,https://arxiv.org/abs/2112.04426,,with retrieval Dataset: Paper: RETRO trained on 200B tokens of MassiveText (retrieval database separately holds 600B-1.75T).,34,███,███,███,███,Proprietary,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Luminous,Aleph Alpha,https://app.aleph-alpha.com/playground/completion,70,,Dense,███,6:1,███,0.6,,,,,web-scale,Nov/2021,🟢,C,https://www.aleph-alpha.de/pricing,,"Devs from EleutherAI Dataset: Estimate: 2021-era 200B dense proprietary, trained from scratch; no disclosure; contemporaries (GPT-3, MT-NLG) used ~300B.",33,███,███,███,███,███,"2,000",Germany,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ DeBERTaV3,Microsoft,,1.5,,Dense,162,108:1,███,0.05,,,,,web-scale,Nov/2021,🟢,D,https://arxiv.org/abs/2111.09543,███,RoBERTa=162B token dataset.,32,███,███,███,███,███,"1,024",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BERT-480,Google,,███,,Dense,130,1:1,███,0.8,,,,,"wiki, books",Nov/2021,🔴,D,https://cloud.google.com/blog/topics/tpus/google-showcases-cloud-tpu-v4-pods-for-large-model-training,,"Submission to benchmarks. Original dataset was BookCorpus + Wikipedia: https://arxiv.org/pdf/1810.04805.pdf Dataset: Same TPU v4 demo run as BERT-200, standard MLPerf BERT training budget; no disclosure.",31,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BERT-200,Google,,200,,Dense,130,1:1,███,0.5,,,,,"wiki, books",Nov/2021,🔴,D,https://cloud.google.com/blog/topics/tpus/google-showcases-cloud-tpu-v4-pods-for-large-model-training (same as above),,███,30,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Cedille FR-Boris,Coteries,https://app.cedille.ai/,6,,Dense,670,112:1,███,0.2,,,,,web-scale,Nov/2021,🟢,C,https://github.com/coteries/cedille-ai,,███,29,███,███,███,███,MIT,███,Switzerland,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ MT-NLG,Microsoft/NVIDIA,,530,,Dense,270,1:1,███,1.3,,,,,web-scale,Oct/2021,🔴,B,███,,"""Megatron-Turing NLG"" 530B monolithic transformer jointly trained by Microsoft and NVIDIA using 3D parallelism (tensor, pipeline, data) via combined DeepSpeed and Megatron frameworks; established new SOTA zero-, one-, and few-shot results on NLP benchmarks at release; training corpus design and curation highlighted as key to performance.",28,███,███,███,███,Proprietary,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ FLAN,Google,,137,,Dense,"2,490",19:1,███,1.9,,███,,,"dialogue, special",Sep/2021,🔴,D,https://arxiv.org/abs/2109.01652,,"Fine-tuned LaMDA Dataset: LaMDA-PT 137B base pretrained on ""2.49T BPE tokens"" per FLAN paper + small instruction tune; total ~2490B.",27,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Command xlarge,Cohere,,52.4,,Dense,"1,050",21:1,███,0.8,,,,,███,Sep/2021,🟢,D,https://arxiv.org/abs/2108.07790,,Stealth 'ebooks and webpages'. 52B: https://crfm.stanford.edu/helm/v1.0/?models=1 Dataset: Estimate: Chinchilla-optimal 20x params (52.4B x 20 = 1050B); no official Cohere disclosure.,26,███,███,███,███,Proprietary,███,Canada,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ PLATO-XL,Baidu,https://nlp.baidu.com/special/plato/englishDemo,11,███,Dense,150,14:1,███,0.1,,,,,"reddit outbound, dialogue",Sep/2021,🟢,C,https://arxiv.org/abs/2109.09519,,"Chatbot. Reddit comments + CN social Dataset: Paper: ""trained for a total of 150B tokens"" with 2M-token batch on Reddit + Chinese social-media dialogue. From scratch.",25,███,███,███,███,Proprietary,███,China,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Macaw,Allen AI,https://macaw.apps.allenai.org/,11,,Dense,"1,000",91:1,███,0.3,,,,███,"reddit outbound, dialogue",Sep/2021,🟡,C,https://arxiv.org/abs/2109.02593,,Chatbot Dataset: Fine-tune of T5-11B: T5 pretrain ~1000B C4 tokens + negligible QA fine-tune (126k steps at batch 8).,24,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ CodeT5,Salesforce,,0.7,,Dense,"1,021","1,459:1",███,0.09,███,,,,"code, BigQuery, BigPython",Sep/2021,🟢,D,https://arxiv.org/abs/2109.00859,,"""Text-to-Text Transfer Transformer"". Code. Large introduced in https://arxiv.org/pdf/2207.01780.pdf Dataset: T5-base 0.7B fine-tune: T5 pretrain ~1000B C4 tokens + ~21B code (8.35M instances at ~2.5KB ea) = 1021B.",23,███,███,███,███,Other,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Codex,OpenAI,Deprecated,12,,Dense,400,34:1,███,0.2,███,,,,code,Aug/2021,🟢,C,https://arxiv.org/abs/2107.03374,,"Code Dataset: GPT-3 base (300B) + Codex fine-tune (""100 billion tokens"" per paper, on 159GB filtered Python) = 400B.",22,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Jurassic-1,AI21,Deprecated,178,,Dense,300,2:1,███,0.8,,,,,web-scale,Aug/2021,🟢,A,https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1,███,Emulated GPT-3 dataset,21,███,███,███,███,Proprietary,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BlenderBot 2.0,Meta AI,,9.4,,Dense,180,20:1,███,0.1,,,,,"dialogue, special",Jul/2021,🟢,███,https://parl.ai/projects/blenderbot2/,,Chatbot Dataset: Fine-tune of BlenderBot 1 (~180B Pushshift Reddit tokens) on small Multi-Session Chat / WoI / BAD datasets. Fine-tune token count tiny vs base; total ~180B.,20,███,███,███,███,CC-BY-NC 4.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-J,EleutherAI,https://huggingface.co/EleutherAI/gpt-j-6b,6,███,Dense,402,67:1,███,0.2,,,,,web-scale,Jun/2021,🟢,A,https://github.com/kingoflolz/mesh-transformer-jax,,Popular,19,███,███,███,███,Apache 2.0,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ LaMDA,Google,https://www.youtube.com/watch?v=aUSSfo5nCdM,137,,Dense,"2,810",21:1,███,███,,,,,"reddit outbound, web, dialogue",May/2021,🔴,C,https://arxiv.org/abs/2201.08239,SOTA,"Chatbot. ~2.8T tokens (1.56T words, SentencePiece/BPE); trained ~57.7 days (~1385 hrs) on 1024 TPU v3. Dataset: Paper Sec 3: ""tokenize the dataset into 2.81T BPE tokens"" (from 1.56T words).",18,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ruGPT-3,Huawei/Sberbank,https://russiannlp.github.io/rugpt-demo/,███,,Dense,240,185:1,███,0.06,,,,,web-scale,Feb/2021,🟢,C,https://github.com/sberbank-ai/ru-gpts,,"Russian GPT-3 with input from Huawei Dataset: Repo README: ""trained on 80B tokens for 3 epochs"" (seq length 1024) = 240B tokens-seen. Trained from scratch.",17,███,███,███,███,Apache 2.0,"2,000",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Switch Transformer,Google,,1571,80,MoE,576,1:1,███,3.2,,,,,web-scale,Jan/2021,███,B,https://arxiv.org/abs/2101.03961,,"""Switch Transformers"" scale to trillion-parameter MoE models using a simplified single-expert routing algorithm; achieves 7x pre-training speedup over T5-Base/Large at equal FLOP budget; multilingual gains across all 101 tested languages; trained on C4. Demonstrates MoE selects different parameters per token at constant compute cost.",16,███,███,███,███,Proprietary,"1,024",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BIGBIRD-ETC large,Google,https://github.com/google-research/bigbird,0.345,,Dense,"2,200","6,377:1",███,0.09,,,,,web-scale,Jul/2020,███,C,https://arxiv.org/abs/2007.14062,,"Sparse attention transformer with linear complexity (vs quadratic), handles 8x longer sequences. SoTA at launch on Natural Questions LA, TriviaQA, WikiHop, and long-doc summarization (Arxiv, PubMed, BigPatent). For tokens, the paper warm-starts from RoBERTa checkpoint then continues MLM training; RoBERTa's pretraining was ~2T tokens but BigBird itself adds incremental training.",15,███,███,███,███,███,"4,000",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-3,OpenAI,Deprecated,███,,Dense,300,2:1,███,0.8,43.9,,,,web-scale,May/2020,🟢,A,https://arxiv.org/abs/2005.14165,SOTA,No RLHF (base only). Popular: 3.1M wpm. Dataset: https://lifearchitect.ai/whats-in-my-ai/,14,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Megatron-11B,Meta AI,https://inferkit.com/,11,,Dense,"2,200",200:1,███,0.5,,,,███,web-scale,Apr/2020,🟢,A,https://github.com/pytorch/fairseq/tree/main/examples/megatron_11b,,My favourite model until GPT-3 and GPT-4 came along: https://github.com/facebookresearch/fairseq/blob/main/examples/megatron_11b/README.md,13,███,███,███,███,Other,"2,048",███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Transformer++,American Express,,0.212,,Dense,15,71:1,███,0.006,,,,,translations,Mar/2020,🔴,D,https://arxiv.org/abs/2003.04974,,"Not to be confused with the more common usage of Transformer++, the ~2023 Transformer++ based on Llama. See Mamba paper. Encoder-decoder MT model from American Express ML & AI Team (Thapak & Hore 2020). Architecture follows Vaswani 2017 Transformer with two novel additions: (1) hybrid multi-head attention using standard self-attention in H/2 heads and depthwise-separable dilated causal convolution attention (modeling word-context dependencies) in the other H/2 heads; (2) multi-task auxiliary heads for POS tagging and NER on the base encoder using Spacy labels. Trained on WMT 2014 EN-DE (4.5M pairs) and WMT 2014 EN-FR (36M pairs). Paper claims new SOTA: BLEU 32.1 on EN-DE (+1.4 over prior best) and 44.6 on EN-FR (+1.1 over prior best). Training step/batch count not stated; ~15B tokens estimated from Transformer big recipe (300K steps × ~50K tokens/batch) given the matching 212M param scale and shared training datasets. No public weights or code released.",███,███,███,███,███,Proprietary,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Meena,Google,,2.6,,Dense,"10,000","3,847:1",███,0.5,,,,,dialogue,Jan/2020,🔴,B,https://arxiv.org/abs/2001.09977,SOTA,███,11,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ T5,Google,https://huggingface.co/google-t5/t5-base,11,,Dense,"1,000",91:1,███,0.3,,,,,███,Oct/2019,🟢,A,https://arxiv.org/abs/1910.10683,,"""Text-to-Text Transfer Transformer"". C4 + NLP language problems. ""compared the following three configurations: First, the standard baseline model, which was pre-trained on 235 ≈ 34B tokens; second, the baseline trained instead for about 1 trillion tokens (i.e. the same amount of pre-training used for T5), which we refer to as “baseline-1T”; and third, T5-Base.""",10,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Megatron-LM 8.3B,NVIDIA,https://github.com/NVIDIA/Megatron-LM,8.3,,Dense,157,19:1,███,0.1,,,███,,web-scale,Sep/2019,🟢,A,https://arxiv.org/abs/1909.08053,SOTA,"GPT-2 style decoder-only transformer with 72 layers, hidden size 3072, 24 attention heads. Trained 300K iterations on 512 NVIDIA V100 GPUs with 8-way intra-layer model parallelism (batch 512 × seq 1024 ≈ 157B tokens). Training data: 174GB aggregate of Wikipedia, CC-Stories, RealNews, and OpenWebText (deduplicated via LSH). New SOTA: WikiText103 perplexity 10.81 (prior 15.79) and LAMBADA accuracy 66.51% (prior 63.24%). Paired 3.9B BERT-style model achieved SOTA on RACE at 90.9%.",9,███,███,███,███,███,"2,048",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ RoBERTa,Meta AI,https://huggingface.co/FacebookAI/roberta-large,0.355,,Dense,"2,200","6,198:1",███,0.09,27.9,,,,web-scale,Jul/2019,🟢,███,https://arxiv.org/abs/1907.11692,,"calcs: ""In total, this batch size and number of steps corresponds to pre-training on 235 ≈ 34B tokens. This is considerably less than BERT (Devlin et al., 2018), which used roughly 137B tokens, or RoBERTa (Liu et al., 2019c), which used roughly 2.2T tokens. Using only 2 35 tokens results in a reasonable computational budget while still providing a sufficient amount of pre-training for acceptable performance. We consider the effect of pre-training for more steps in Sections 3.6 and 3.7. Note that 2 35 tokens only covers a fraction of the entire C4 data set, so we never repeat any data during pre-training."" https://arxiv.org/pdf/1910.10683.pdf MMLU shows RoBERTa-base 125M only=27.9 (not 355M)",8,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-2,OpenAI,https://huggingface.co/openai-community/gpt2-large,1.5,,Dense,40,27:1,███,0.03,32.4,███,,,reddit outbound,Feb/2019,🟢,A,https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf,SOTA,WebText 10B token corpus × 4 epochs → 40B tokens processed. Reddit outbound only,7,███,███,███,███,███,"1,024",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ BERT,Google,https://huggingface.co/google-bert/bert-base-uncased,0.34,,Dense,137,403:1,███,0.02,,,,███,"wiki, books",Oct/2018,🟢,A,https://arxiv.org/abs/1810.04805,SOTA,"""BERT — 128 000 tokens per step × 1 000 000 steps → 128 B tokens processed""",6,███,███,███,███,Apache 2.0,███,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ GPT-1,OpenAI,https://huggingface.co/openai-community/openai-gpt,0.117,,Dense,98,842:1,███,0.01,,,,,books,Jun/2018,🟢,A,███,SOTA,"""GPT-1 — 984M tokens corpus × 100 epochs × 1 token per word → 98.4B tokens processed"" Books only. ""We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens."" =3,276,800",5,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ELMo,Allen AI,https://huggingface.co/allenai/elmo,0.094,███,Dense,8,86:1,███,0.003,,,,,news,Feb/2018,🟢,A,https://arxiv.org/abs/1802.05365,SOTA,"Pioneer of pretrain-then-fine-tune workflow. biLSTM with 2 layers, 4096 hidden units, 512-dim projections, both directions, plus character CNN with 2048 n-gram filters and 2 highway layers. Trained 10 epochs on the 1B Word Benchmark (~0.8B training words ≈ 8B word-tokens). Word-level LM objective. New SOTA on six NLP tasks at launch (SQuAD F1 85.8, SNLI 88.7, SRL F1 84.6, Coref F1 70.4, NER F1 92.22, SST-5 54.7).",4,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ ULMFiT,Fast.ai,https://docs.fast.ai/tutorial.text.html,0.034,,Dense,1,30:1,███,0.001,,,,,wiki,Jan/2018,███,C,https://arxiv.org/abs/1801.06146,SOTA,"Three-stage transfer learning method (general-domain LM pretrain, target-task LM fine-tune, classifier fine-tune) with AWD-LSTM (Merity 2017a): embedding size 400, 3 layers, 1150 hidden units. Pretrained on Wikitext-103 (28,595 Wikipedia articles, ~103M words; epoch count not stated, ~10 epochs typical ≈ ~1B word-tokens). Introduced discriminative fine-tuning, slanted triangular learning rates, and gradual unfreezing. Set new SOTA on six text classification tasks at launch with 18-24% error reduction (IMDb 4.6% err, TREC-6 3.6%, AG 5.01%, DBpedia 0.80%, Yelp-bi 2.16%, Yelp-full 29.98%).",3,███,███,███,███,███,"1,024",USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Transformer (big),Google,https://github.com/tensorflow/tensor2tensor?tab=readme-ov-file#walkthrough,0.213,,Dense,15,71:1,███,███,,,,,translations,Jun/2017,🟢,A,https://arxiv.org/abs/1706.03762,SOTA,"Original Transformer big. 6-layer encoder-decoder with dmodel=1024, dff=4096, 16 attention heads. 213M params (paper Table 3). Trained 300K steps (~3.5 days on 8 NVIDIA P100 GPUs) on WMT 2014 EN-DE and separately on WMT 2014 EN-FR; ~25K source + 25K target tokens per batch, so ~15B tokens total. New SOTA on WMT 2014: BLEU 28.4 (EN-DE) and 41.8 (EN-FR), surpassing the previous best including ensembles by over 2 BLEU on EN-DE.",2,███,███,███,███,Apache 2.0,512,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███ Transformer (base),Google,https://github.com/tensorflow/tensor2tensor?tab=readme-ov-file#walkthrough,0.065,,Dense,5,███,███,0.002,,,,,translations,Jun/2017,🟢,A,https://arxiv.org/abs/1706.03762,SOTA,"Original Transformer base. 6-layer encoder-decoder with dmodel=512, dff=2048, 8 attention heads, dk=dv=64. 65M params (paper Table 3). Trained 100K steps on WMT 2014 EN-DE (4.5M sentence pairs, ~37K BPE shared vocab) and separately on WMT 2014 EN-FR (~36M sentences, 32K word-piece vocab); each batch ~25K source + 25K target tokens, so ~5B tokens total. BLEU 27.3 (EN-DE) and 38.1 (EN-FR), beating all previously published single models.",1,███,███,███,███,███,512,USA,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███,███