swaze commited on
Commit
a451f70
·
verified ·
1 Parent(s): 35aff0b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -0
README.md CHANGED
@@ -22,6 +22,15 @@ Designed for **low-latency inference** on **NVIDIA RTX GPUs**, leveraging:
22
 
23
  FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
24
 
 
 
 
 
 
 
 
 
 
25
  ---
26
 
27
  ## Model Details
 
22
 
23
  FlashHead matches the baseline Llama-3.2-1B-Instruct within rounding on common benchmarks (MMLU-Pro, HellaSwag, GSM8K, etc.) and, combined with quantization, achieves H200-class throughput on RTX Ada GPUs.
24
 
25
+ ### Quickstart
26
+
27
+ Launch a chat window with commands for /reset and /exit with
28
+
29
+ ```shell
30
+ pip install embedl-models
31
+ python3 -m embedl.models.vllm.demo --model embedl/Llama-3.2-1B-Instruct-FlashHead-W4A16
32
+ ```
33
+
34
  ---
35
 
36
  ## Model Details