embedl
/

Llama-3.2-1B-Instruct-FlashHead

flash_head_llama

text-generation-inference

Model card Files Files and versions

WilhelmT commited on Dec 8, 2025

Commit

aadcecd

·

verified ·

1 Parent(s): f060db0

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
 - FlashHead
 - Custom vLLM generation via `embedl-models`
-FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
 ---

 - FlashHead
 - Custom vLLM generation via `embedl-models`
+FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
 ---