cmpatino
commited on
Commit
·
aa9d5a1
1
Parent(s):
d4ed61a
Add vllm and slang commands
Browse files
README.md
CHANGED
|
@@ -83,8 +83,10 @@ output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
|
|
| 83 |
print(tokenizer.decode(output_ids, skip_special_tokens=True))
|
| 84 |
```
|
| 85 |
|
|
|
|
|
|
|
| 86 |
|
| 87 |
-
### Enabling and Disabling Extended Thinking
|
| 88 |
|
| 89 |
We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
|
| 90 |
|
|
@@ -177,6 +179,39 @@ text = tokenizer.apply_chat_template(
|
|
| 177 |
|
| 178 |
For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
|
| 179 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
## Evaluation
|
| 181 |
|
| 182 |
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|
|
|
|
| 83 |
print(tokenizer.decode(output_ids, skip_special_tokens=True))
|
| 84 |
```
|
| 85 |
|
| 86 |
+
>[!TIP]
|
| 87 |
+
> We recommend setting `temperature=0.6` and `top_p=0.95` in the sampling parameters.
|
| 88 |
|
| 89 |
+
### Enabling and Disabling Extended Thinking Mod`
|
| 90 |
|
| 91 |
We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
|
| 92 |
|
|
|
|
| 179 |
|
| 180 |
For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
|
| 181 |
|
| 182 |
+
### vLLM and SGLang
|
| 183 |
+
|
| 184 |
+
You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.
|
| 185 |
+
|
| 186 |
+
#### SGLang
|
| 187 |
+
|
| 188 |
+
```bash
|
| 189 |
+
python -m sglang.launch_server --model-path HuggingFaceTB/SmolLM3-3B
|
| 190 |
+
```
|
| 191 |
+
|
| 192 |
+
#### vLLM
|
| 193 |
+
|
| 194 |
+
```bash
|
| 195 |
+
vllm serve HuggingFaceTB/SmolLM3-3B
|
| 196 |
+
```
|
| 197 |
+
|
| 198 |
+
#### Setting `chat_template_kwargs`
|
| 199 |
+
|
| 200 |
+
You can specify `chat_template_kwargs` such as `enable_thinking` and `xml_tools` to a deployed model by passing the `chat_template_kwargs` parameter in the API request.
|
| 201 |
+
|
| 202 |
+
```bash
|
| 203 |
+
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
|
| 204 |
+
"model": "HuggingFaceTB/SmolLM3-3B",
|
| 205 |
+
"messages": [
|
| 206 |
+
{"role": "user", "content": "Give me a brief explanation of gravity in simple terms."}
|
| 207 |
+
],
|
| 208 |
+
"temperature": 0.6,
|
| 209 |
+
"top_p": 0.95,
|
| 210 |
+
"max_tokens": 16384,
|
| 211 |
+
"chat_template_kwargs": {"enable_thinking": false}
|
| 212 |
+
}'
|
| 213 |
+
```
|
| 214 |
+
|
| 215 |
## Evaluation
|
| 216 |
|
| 217 |
In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
|