transformers / app /src /content /chapters /inference.mdx
burtenshaw's picture
burtenshaw HF Staff
Update app/src/content/chapters/inference.mdx (#2)
80b577a verified
import Sidenote from '../../components/Sidenote.astro'
## Example 1: Inference on nanochat in Transformers
<Sidenote>
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/#fileId=https://huggingface.co/datasets/nanochat-students/notebooks/blob/main/inference.ipynb)
</Sidenote>
First bonus tutorial will help you to do basic inference in `transformers`:
```py
import torch
from transformers import AutoTokenizer, NanoChatForCausalLM
tokenizer = AutoTokenizer.from_pretrained("nanochat-students/nanochat-d20")
model = NanoChatForCausalLM.from_pretrained("nanochat-students/nanochat-d20")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
inputs.pop("token_type_ids", None)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Inference in transformers with vLLM
Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference.
We'll need to install `vLLM` from main:
```sh
pip install git+https://github.com/huggingface/transformers.git@main
```
Then we can start a `vLLM` server like so:
```
vllm serve nanochat-students/nanochat-d20 --enforce-eager --revision refs/pr/1
```
Finally, we can call the server like so:
```sh
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "nanochat-students/nanochat-d20",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'
```
### Inference on your trained nanochat weights
Let's say you've followed the nanochat repo and used it to train a model. Then you can add transformer compatibility to your model and use it in other libraries.
1. download any `nanochat` checkpoint from the hub. Here we use Karpathy's but this could be yours:
```
hf download karpathy/nanochat-d34 --local-dir nanochat-d34
```
2. convert the checkpoint to transformers format using the conversion scripts:
```
uv run \
--with "transformers @ git+https://github.com/huggingface/transformers.git@main" \
--with "tiktoken>=0.12.0" \
https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/nanochat/convert_nanochat_checkpoints.py \
--input_dir ./nanochat-d34 \
--output_dir ./nanochat-d3-hf
```
3. (optional) Upload the checkpoint to the Hugging Face Hub
```
hf upload <username>/nanochat-d34 nanochat-d34
```
4. As above, you can generate with your model in `transformers`.
```py
import torch
from transformers import AutoTokenizer, NanoChatForCausalLM
tokenizer = AutoTokenizer.from_pretrained("./nanochat-d3-hf")
model = NanoChatForCausalLM.from_pretrained("./nanochat-d3-hf")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
prompt = "Hello, how are you?"
inputs = tokenizer(prompt, return_tensors="pt").to(device)
inputs.pop("token_type_ids", None)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```