embedl
/

Llama-3.2-3B-Instruct-FlashHead

flash_head_llama

text-generation-inference

Model card Files Files and versions

WilhelmT commited on Dec 8, 2025

Commit

ad0382d

·

verified ·

1 Parent(s): 9f6e041

Update README.md

Files changed (1) hide show

README.md +8 -8

README.md CHANGED Viewed

@@ -48,21 +48,21 @@ FlashHead matches the baseline **Llama-3.2-3B** within rounding on standard eval
 | **Precision** | **Tokens/sec** | **Speedup vs BF16** |
 |----------------|----------------|----------------------|
-| BF16 baseline | 130 | 1.0× |
-| **FlashHead (Embedl)** | **163** | **1.25×** |
-| W4A16 baseline | 278 | 2.14× |
-| **FlashHead W4A16 (Embedl)** | **485** | **3.73×** |
-FlashHead improves end-to-end speed by **1.75×** over state-of-the-art, while maintaining full accuracy parity.
 ---
 ## Accuracy (Parity with Baseline)
-| **Method** | **MMLU-Pro** | **HellaSwag** | **IFEval** | **BoolQ** | **BBH** | **TruthfulQA** | **GSM8K** |
 |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------|
-| **Baseline** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
-| **FlashHead** | 0.18 | 0.59 | 0.45 | 0.69 | 0.38 | 0.36 | 0.46 |
 FlashHead matches baseline performance exactly across all evaluation benchmarks.

 | **Precision** | **Tokens/sec** | **Speedup vs BF16** |
 |----------------|----------------|----------------------|
+| BF16 baseline | 54 | 1.0× |
+| **FlashHead (Embedl)** | **58** | **1.07×** |
+| W4A16 baseline | 141 | 2.61× |
+| **FlashHead W4A16 (Embedl)** | **177** | **3.28×** |
+FlashHead improves end-to-end speed by **1.26×** over state-of-the-art, while maintaining full accuracy parity.
 ---
 ## Accuracy (Parity with Baseline)
+| **Method** | **MMLU-Pro** | **IFEval** | **BBH** | **TruthfulQA** | **GSM8K** |
 |-------------|---------------|----------------|--------------|-------------|-------------|----------------|--------------|
+| **Baseline** | 0.31 | 0.57 | 0.57 | 0.57 | 0.77 |
+| **FlashHead** | 0.31 | 0.56 | 0.57 | 0.58 | 0.77 |
 FlashHead matches baseline performance exactly across all evaluation benchmarks.