cmpatino commited on
Commit
aa9d5a1
·
1 Parent(s): d4ed61a

Add vllm and slang commands

Browse files
Files changed (1) hide show
  1. README.md +36 -1
README.md CHANGED
@@ -83,8 +83,10 @@ output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
83
  print(tokenizer.decode(output_ids, skip_special_tokens=True))
84
  ```
85
 
 
 
86
 
87
- ### Enabling and Disabling Extended Thinking Mode
88
 
89
  We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
90
 
@@ -177,6 +179,39 @@ text = tokenizer.apply_chat_template(
177
 
178
  For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
179
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  ## Evaluation
181
 
182
  In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.
 
83
  print(tokenizer.decode(output_ids, skip_special_tokens=True))
84
  ```
85
 
86
+ >[!TIP]
87
+ > We recommend setting `temperature=0.6` and `top_p=0.95` in the sampling parameters.
88
 
89
+ ### Enabling and Disabling Extended Thinking Mod`
90
 
91
  We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
92
 
 
179
 
180
  For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
181
 
182
+ ### vLLM and SGLang
183
+
184
+ You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.
185
+
186
+ #### SGLang
187
+
188
+ ```bash
189
+ python -m sglang.launch_server --model-path HuggingFaceTB/SmolLM3-3B
190
+ ```
191
+
192
+ #### vLLM
193
+
194
+ ```bash
195
+ vllm serve HuggingFaceTB/SmolLM3-3B
196
+ ```
197
+
198
+ #### Setting `chat_template_kwargs`
199
+
200
+ You can specify `chat_template_kwargs` such as `enable_thinking` and `xml_tools` to a deployed model by passing the `chat_template_kwargs` parameter in the API request.
201
+
202
+ ```bash
203
+ curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
204
+ "model": "HuggingFaceTB/SmolLM3-3B",
205
+ "messages": [
206
+ {"role": "user", "content": "Give me a brief explanation of gravity in simple terms."}
207
+ ],
208
+ "temperature": 0.6,
209
+ "top_p": 0.95,
210
+ "max_tokens": 16384,
211
+ "chat_template_kwargs": {"enable_thinking": false}
212
+ }'
213
+ ```
214
+
215
  ## Evaluation
216
 
217
  In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.