Update README.md
Browse files
README.md
CHANGED
|
@@ -109,7 +109,7 @@ lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks
|
|
| 109 |
|
| 110 |
## int4wo-hqq
|
| 111 |
```
|
| 112 |
-
lm_eval --model hf --model_args pretrained=
|
| 113 |
```
|
| 114 |
|
| 115 |
`TODO: more complete eval results`
|
|
@@ -162,7 +162,7 @@ python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
|
|
| 162 |
|
| 163 |
### int4wo-hqq
|
| 164 |
```
|
| 165 |
-
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model
|
| 166 |
```
|
| 167 |
|
| 168 |
## benchmark_serving
|
|
@@ -186,16 +186,16 @@ python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --
|
|
| 186 |
### int4wo-hqq
|
| 187 |
Server:
|
| 188 |
```
|
| 189 |
-
vllm serve
|
| 190 |
```
|
| 191 |
|
| 192 |
Client:
|
| 193 |
```
|
| 194 |
-
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model
|
| 195 |
```
|
| 196 |
|
| 197 |
# Serving with vllm
|
| 198 |
We can use the same command we used in serving benchmarks to serve the model with vllm
|
| 199 |
```
|
| 200 |
-
vllm serve
|
| 201 |
```
|
|
|
|
| 109 |
|
| 110 |
## int4wo-hqq
|
| 111 |
```
|
| 112 |
+
lm_eval --model hf --model_args pretrained=pytorch/Phi-4-mini-instruct-int4wo-hqq --tasks hellaswag --device cuda:0 --batch_size 8
|
| 113 |
```
|
| 114 |
|
| 115 |
`TODO: more complete eval results`
|
|
|
|
| 162 |
|
| 163 |
### int4wo-hqq
|
| 164 |
```
|
| 165 |
+
python benchmarks/benchmark_latency.py --input-len 256 --output-len 256 --model pytorch/Phi-4-mini-instruct-int4wo-hqq --batch-size 1
|
| 166 |
```
|
| 167 |
|
| 168 |
## benchmark_serving
|
|
|
|
| 186 |
### int4wo-hqq
|
| 187 |
Server:
|
| 188 |
```
|
| 189 |
+
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 190 |
```
|
| 191 |
|
| 192 |
Client:
|
| 193 |
```
|
| 194 |
+
python benchmarks/benchmark_serving.py --backend vllm --dataset-name sharegpt --tokenizer microsoft/Phi-4-mini-instruct --dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json --model pytorch/Phi-4-mini-instruct-int4wo-hqq --num-prompts 1
|
| 195 |
```
|
| 196 |
|
| 197 |
# Serving with vllm
|
| 198 |
We can use the same command we used in serving benchmarks to serve the model with vllm
|
| 199 |
```
|
| 200 |
+
vllm serve pytorch/Phi-4-mini-instruct-int4wo-hqq --tokenizer microsoft/Phi-4-mini-instruct -O3
|
| 201 |
```
|