YongganFu commited on
Commit
e77795a
·
verified ·
1 Parent(s): 317a6f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -13,7 +13,7 @@ pipeline_tag: text-generation
13
 
14
  ## Model Overview
15
 
16
- Nemotron-Flash is a new hybrid small language model family designed around real-world latency rather than parameter count. It features latency-optimal depth–width ratios, hybrid operators discovered through evolutionary search, and training-time weight normalization.
17
 
18
  The models achieve SOTA accuracy in math, coding, and commonsense reasoning at the 1B and 3B scales, while delivering decent small-batch latency and large-batch throughput. For example, Nemotron-Flash-1B achieves +5.5% accuracy, 1.9× lower latency, and 45.6× higher throughput compared with Qwen3-0.6B; and Nemotron-Flash-3B achieves +2% / +5.5% accuracy over Qwen2.5-3B / Qwen3-1.7B with 1.3× / 1.7× lower latency and 6.4× / 18.7× higher throughput, respectively.
19
 
 
13
 
14
  ## Model Overview
15
 
16
+ Nemotron-Flash is a new hybrid small language model family designed around real-world latency rather than parameter count. It features latency-optimal depth–width ratios, hybrid operators discovered through evolutionary search, and training-time weight normalization. See our <a href="https://arxiv.org/pdf/2511.18890">NeurIPS 2025 paper</a> for more technical details.
17
 
18
  The models achieve SOTA accuracy in math, coding, and commonsense reasoning at the 1B and 3B scales, while delivering decent small-batch latency and large-batch throughput. For example, Nemotron-Flash-1B achieves +5.5% accuracy, 1.9× lower latency, and 45.6× higher throughput compared with Qwen3-0.6B; and Nemotron-Flash-3B achieves +2% / +5.5% accuracy over Qwen2.5-3B / Qwen3-1.7B with 1.3× / 1.7× lower latency and 6.4× / 18.7× higher throughput, respectively.
19