Instructions to use HuggingFaceTB/SmolLM-135M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use HuggingFaceTB/SmolLM-135M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="HuggingFaceTB/SmolLM-135M")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-135M")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use HuggingFaceTB/SmolLM-135M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "HuggingFaceTB/SmolLM-135M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM-135M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/HuggingFaceTB/SmolLM-135M

SGLang

How to use HuggingFaceTB/SmolLM-135M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "HuggingFaceTB/SmolLM-135M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM-135M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "HuggingFaceTB/SmolLM-135M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "HuggingFaceTB/SmolLM-135M",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use HuggingFaceTB/SmolLM-135M with Docker Model Runner:
```
docker model run hf.co/HuggingFaceTB/SmolLM-135M
```

Are there two identical embedding tensors, even though embeddings are shared?

#15

by graefics - opened Sep 19, 2024

Discussion

graefics

Sep 19, 2024

•

edited Sep 19, 2024

The SmolLM models have tied embeddings (config.tie_word_embeddings = True), i.e. input and output embeddings are shared.

However, the model contains two identical embedding tensors lm_head.weight and model.embed_tokens.weight. This seems to defeat the purpose of having shared (aka tied) embeddings. Any ideas?

Clicking on the arrow on the right-hand side (see screenshot below) shows a summary of the parameters without the lm_head tensor, see the second screenshot below (model.norm.weight is the last tensor shown):

Here is code showing that the model contains two identical tensors for input and output embeddings:

from transformers import AutoModelForCausalLM
import torch

model = AutoModelForCausalLM.from_pretrained('HuggingFaceTB/SmolLM-135M')
print(torch.equal(model.lm_head.weight, model.model.embed_tokens.weight))

Above code returns True, because the two tensors are identical.

The issue might be in modeling_llama.py, which doesn't seem to fully support shared embeddings. Specifically, line 1210

            logits = self.lm_head(hidden_states[:, -num_logits_to_keep:, :]).float()

Contrast this for example with OpenELM's code that seems to fully support tied and untied embeddings on line 878

        if self.lm_head is None:
            # shared
            logits = F.linear(hidden_states, weight=self.transformer.token_embeddings.weight)
        else:
            logits = self.lm_head(hidden_states)

graefics changed discussion title from Are there two identical embedding tensors in the model, even though input and output embeddings are shared? to Are there two identical embedding tensors in the model, even though embeddings are shared? Sep 19, 2024

graefics changed discussion title from Are there two identical embedding tensors in the model, even though embeddings are shared? to Are there two identical embedding tensors, even though embeddings are shared? Sep 19, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment