Instructions to use HuggingFaceTB/SmolLM2-135M-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use HuggingFaceTB/SmolLM2-135M-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-135M-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct") model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-135M-Instruct") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Transformers.js
How to use HuggingFaceTB/SmolLM2-135M-Instruct with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('text-generation', 'HuggingFaceTB/SmolLM2-135M-Instruct'); - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use HuggingFaceTB/SmolLM2-135M-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "HuggingFaceTB/SmolLM2-135M-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM2-135M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/HuggingFaceTB/SmolLM2-135M-Instruct
- SGLang
How to use HuggingFaceTB/SmolLM2-135M-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM2-135M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM2-135M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "HuggingFaceTB/SmolLM2-135M-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "HuggingFaceTB/SmolLM2-135M-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use HuggingFaceTB/SmolLM2-135M-Instruct with Docker Model Runner:
docker model run hf.co/HuggingFaceTB/SmolLM2-135M-Instruct
Why are model_q4.onnx and model_q4f16.onnx not 4 times smaller than model.onnx?
I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:
| File Name | Size |
|---|---|
| model.onnx | 654 MB |
| model_fp16.onnx | 327 MB |
| model_q4.onnx | 200 MB |
| model_q4f16.onnx | 134 MB |
I understand that:
model.onnxis the fp32 model,model_fp16.onnxis the model whose weights are quantized tofp16
I don't understand the size of model_q4.onnx and model_q4f16.onnx
Why is
model_q4.onnx200 MB instead of 654 MB / 4 = 163.5 MB? I thoughtmodel_q4.onnxmeant that the weights are quantized to 4 bits.Why is
model_q4f16.onnx134 MB instead of 654 MB / 4 = 163.5 MB? I thoughtmodel_q4f16.onnxmeant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:qAfB(_id), whereArepresents the number of bits for storing weights andBrepresents the number of bits for storing activations.
and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).