WilhelmT commited on
Commit
aadcecd
·
verified ·
1 Parent(s): f060db0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -17,7 +17,7 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
17
  - FlashHead
18
  - Custom vLLM generation via `embedl-models`
19
 
20
- FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding error on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, delivers SOTA on-device latency.
21
 
22
  ---
23
 
 
17
  - FlashHead
18
  - Custom vLLM generation via `embedl-models`
19
 
20
+ FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
21
 
22
  ---
23