embedl
/

Llama-3.2-3B-Instruct-FlashHead-W4A16

text-generation-inference

compressed-tensors

Model card Files Files and versions

WilhelmT commited on 3 days ago

Commit

a59054b

·

verified ·

1 Parent(s): 337df94

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -41,7 +41,7 @@ FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on co
 ## Optimizations
 - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
-- **Quantization (W4A16)** - large reduction in memory footprint and accuracy.
 - **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
 ---

 ## Optimizations
 - **FlashHead LM Head** - lightweight replacement for the dense LM head, significantly improving throughput.
+- **Quantization (W4A16)** - large reduction in memory footprint and latency.
 - **Custom Runtime Integration** - compatible with **vLLM (0.10.2)** via the `embedl-models` package.
 ---