embedl
/

Llama-3.2-3B-Instruct-FlashHead-W4A16

text-generation-inference

compressed-tensors

Model card Files Files and versions

WilhelmT commited on 4 days ago

Commit

10e648a

·

verified ·

1 Parent(s): ca55dec

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
 - Quantization (W4A16)
 - Custom vLLM generation via `embedl-models`
-FlashHead matches the baseline **Llama-3.2-3B-Instruct** within rounding on standard evaluations (MMLU-Pro, HellaSwag, GSM8K, etc.) and, in combination with quantization, achieves **H200-level latency** on **RTX Ada** GPUs.
 ---

 - Quantization (W4A16)
 - Custom vLLM generation via `embedl-models`
+FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency
 ---