Update README.md
Browse files
README.md
CHANGED
|
@@ -18,7 +18,7 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
|
|
| 18 |
- Quantization (W4A16)
|
| 19 |
- Custom vLLM generation via `embedl-models`
|
| 20 |
|
| 21 |
-
FlashHead matches the
|
| 22 |
|
| 23 |
---
|
| 24 |
|
|
|
|
| 18 |
- Quantization (W4A16)
|
| 19 |
- Custom vLLM generation via `embedl-models`
|
| 20 |
|
| 21 |
+
FlashHead matches the Llama-3.2-3B-Instruct baseline within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency
|
| 22 |
|
| 23 |
---
|
| 24 |
|