Instructions to use HuggingFaceTB/SmolLM2-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolLM2-1.7B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM2-1.7B")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM2-1.7B")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use HuggingFaceTB/SmolLM2-1.7B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolLM2-1.7B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM2-1.7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolLM2-1.7B

SGLang

How to use HuggingFaceTB/SmolLM2-1.7B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolLM2-1.7B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM2-1.7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolLM2-1.7B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM2-1.7B",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolLM2-1.7B with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolLM2-1.7B
```

Pure C++ version of the SmolLM2 model code for EDGE implementations

by MartialTerran - opened Nov 4, 2024

Discussion

MartialTerran

Nov 4, 2024

I first heard about this model today in this article: https://venturebeat.com/ai/ai-on-your-smartphone-hugging-faces-smollm2-brings-powerful-models-to-the-palm-of-your-hand/?_bhlid=071034f893836a3364663dcc52fbea6fd14a2f15
I am disappointed that the full edge-optimal "model" is not disclosed in a compileable code format that can be ported to an edge device capable of running compiled code, such as a raspberry pi, or an andriod phone, or a arduino.

I'm hoping to deploy this model on resource-constrained devices like Raspberry Pis, Android phones, and even Arduinos. Currently, I don't see a readily available, portable implementation. While I understand the model architecture is based on LLaMA, I'm looking for something more directly deployable than using AutoModelForCausalLM.

Ideally, I'd like to see a simplified, compilable representation of the SmolLM2 model, perhaps in C/C++, that doesn't rely on the transformers library. This would allow for greater flexibility in porting and optimizing for these edge devices. Something analogous to this (though I understand this is a highly simplified illustration):

// Hypothetical example
LLAMA_for_SmolLM2 model(SmolLM_135M_checkpoint_file);
model.to(device); // Assuming some device abstraction

LLAMA_for_SmolLM2_tokenizer tokenizer;
auto inputs = tokenizer.encode("Gravity is");
inputs.to(device);

// Hypothetical example

// Load the model from a checkpoint file
LLAMA_for_SmolLM2 model(SmolLM_135M_checkpoint_file);

// Move the model to the target device (e.g., CPU, GPU, or specialized hardware)
model.to(device);

// Tokenizer instance for preprocessing text
LLAMA_for_SmolLM2_tokenizer tokenizer;

// Encode the input string
auto inputs = tokenizer.encode("Gravity is");

// Move the encoded input to the same device as the model
inputs.to(device);

This contrasts with the current Python-based approach using transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
checkpoint = "HuggingFaceTB/SmolLM2-1.7B"
device = "cuda" # or "cpu"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
inputs = tokenizer.encode("Gravity is", return_tensors="pt").to(device)

Is providing a more portable, compilable version of SmolLM2 something being considered?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment