add vllm serving instructions to readme
Browse files
README.md
CHANGED
|
@@ -119,6 +119,22 @@ resp = tokenizer.batch_decode(output)[0]
|
|
| 119 |
print(resp.replace(model_inputs, ""))
|
| 120 |
```
|
| 121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
## Training and Evaluation
|
| 123 |
|
| 124 |
### Training Data
|
|
|
|
| 119 |
print(resp.replace(model_inputs, ""))
|
| 120 |
```
|
| 121 |
|
| 122 |
+
### Serving with vLLM
|
| 123 |
+
|
| 124 |
+
For production deployments, you can serve Foundation-Sec-8B-Reasoning using [vLLM](https://github.com/vllm-project/vllm). The model uses the `minimax_m2` reasoning parser to properly handle reasoning traces.
|
| 125 |
+
|
| 126 |
+
```bash
|
| 127 |
+
vllm serve "fdtn-ai/Foundation-Sec-8B-Reasoning" \
|
| 128 |
+
--host 0.0.0.0 \
|
| 129 |
+
--port ${PORT} \
|
| 130 |
+
--tensor-parallel-size 1 \
|
| 131 |
+
--max-model-len 32768 \
|
| 132 |
+
--trust-remote-code \
|
| 133 |
+
--reasoning-parser minimax_m2
|
| 134 |
+
```
|
| 135 |
+
|
| 136 |
+
Adjust `--tensor-parallel-size` based on your GPU configuration and `--max-model-len` based on your memory constraints.
|
| 137 |
+
|
| 138 |
## Training and Evaluation
|
| 139 |
|
| 140 |
### Training Data
|