embedl
/

Llama-3.2-1B-Instruct-FlashHead-W4A16

text-generation-inference

compressed-tensors

Model card Files Files and versions

swaze commited on 3 days ago

Commit

a451f70

·

verified ·

1 Parent(s): 35aff0b

Update README.md

Files changed (1) hide show

README.md +9 -0

README.md CHANGED Viewed

@@ -22,6 +22,15 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
 FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
 ---
 ## Model Details

 FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
+### Quickstart
+Launch a chat window with commands for /reset and /exit with
+```shell
+pip install embedl-models
+python3 -m embedl.models.vllm.demo --model embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16
+```
 ---
 ## Model Details