HuggingFaceTB
/

SmolLM3-3B

Text Generation

Model card Files Files and versions

cmpatino HF Staff commited on Jul 8, 2025

Commit

aa9d5a1

·

1 Parent(s): d4ed61a

Add vllm and slang commands

Files changed (1) hide show

README.md +36 -1

README.md CHANGED Viewed

@@ -83,8 +83,10 @@ output_ids = generated_ids[0][len(model_inputs.input_ids[0]) :]
 print(tokenizer.decode(output_ids, skip_special_tokens=True))
 ```
-### Enabling and Disabling Extended Thinking Mode
 We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
@@ -177,6 +179,39 @@ text = tokenizer.apply_chat_template(
 For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
 ## Evaluation
 In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.

 print(tokenizer.decode(output_ids, skip_special_tokens=True))
 ```
+>[!TIP]
+> We recommend setting `temperature=0.6` and `top_p=0.95` in the sampling parameters.
+### Enabling and Disabling Extended Thinking Mod`
 We enable extended thinking by default, so the example above generates the output with a reasoning trace. For choosing between enabling, you can provide the `/think` and `/no_think` flags through the system prompt as shown in the snippet below for extended thinking disabled. The code for generating the response with extended thinking would be the same except that the system prompt should have `/think` instead of `/no_think`.
 For local inference, you can use `llama.cpp`, `ONNX`, `MLX` and `MLC`. You can find quantized checkpoints in this collection [TODO].
+### vLLM and SGLang
+You can use vLLM and SGLang to deploy the model in an API compatible with OpenAI format.
+#### SGLang
+```bash
+python -m sglang.launch_server --model-path HuggingFaceTB/SmolLM3-3B
+```
+#### vLLM
+```bash
+vllm serve HuggingFaceTB/SmolLM3-3B
+```
+#### Setting `chat_template_kwargs`
+You can specify `chat_template_kwargs` such as `enable_thinking` and `xml_tools` to a deployed model by passing the `chat_template_kwargs` parameter in the API request.
+```bash
+curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
+  "model": "HuggingFaceTB/SmolLM3-3B",
+  "messages": [
+    {"role": "user", "content": "Give me a brief explanation of gravity in simple terms."}
+  ],
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "max_tokens": 16384,
+  "chat_template_kwargs": {"enable_thinking": false}
+}'
+```
 ## Evaluation
 In this section, we report the evaluation results of SmolLM3 base model. All evaluations are zero-shot unless stated otherwise, and we use [lighteval](https://github.com/huggingface/lighteval) to run them.