| import Sidenote from '../../components/Sidenote.astro' |
|
|
| ## Example 1: Inference on nanochat in Transformers |
|
|
| <Sidenote> |
|
|
| [](https://colab.research.google.com/#fileId=https://huggingface.co/datasets/nanochat-students/notebooks/blob/main/inference.ipynb) |
|
|
| </Sidenote> |
|
|
| First bonus tutorial will help you to do basic inference in `transformers`: |
|
|
| ```py |
| import torch |
| from transformers import AutoTokenizer, NanoChatForCausalLM |
|
|
| tokenizer = AutoTokenizer.from_pretrained("nanochat-students/nanochat-d20") |
| model = NanoChatForCausalLM.from_pretrained("nanochat-students/nanochat-d20") |
|
|
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device) |
|
|
| prompt = "Hello, how are you?" |
| inputs = tokenizer(prompt, return_tensors="pt").to(device) |
| inputs.pop("token_type_ids", None) |
| outputs = model.generate(**inputs, max_new_tokens=100) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
| ### Inference in transformers with vLLM |
| |
| Next, let's use `transformers` as a backend for `vLLM` to serve the model for optimized inference. |
|
|
| We'll need to install `vLLM` from main: |
|
|
| ```sh |
| pip install git+https://github.com/huggingface/transformers.git@main |
| ``` |
|
|
| Then we can start a `vLLM` server like so: |
|
|
| ``` |
| vllm serve nanochat-students/nanochat-d20 --enforce-eager --revision refs/pr/1 |
| ``` |
| |
| Finally, we can call the server like so: |
|
|
| ```sh |
| curl -X POST "http://localhost:8000/v1/completions" \ |
| -H "Content-Type: application/json" \ |
| --data '{ |
| "model": "nanochat-students/nanochat-d20", |
| "prompt": "Once upon a time,", |
| "max_tokens": 512, |
| "temperature": 0.5 |
| }' |
| ``` |
|
|
| ### Inference on your trained nanochat weights |
|
|
| Let's say you've followed the nanochat repo and used it to train a model. Then you can add transformer compatibility to your model and use it in other libraries. |
|
|
| 1. download any `nanochat` checkpoint from the hub. Here we use Karpathy's but this could be yours: |
|
|
| ``` |
| hf download karpathy/nanochat-d34 --local-dir nanochat-d34 |
| ``` |
|
|
| 2. convert the checkpoint to transformers format using the conversion scripts: |
|
|
| ``` |
| uv run \ |
| --with "transformers @ git+https://github.com/huggingface/transformers.git@main" \ |
| --with "tiktoken>=0.12.0" \ |
| https://raw.githubusercontent.com/huggingface/transformers/main/src/transformers/models/nanochat/convert_nanochat_checkpoints.py \ |
| --input_dir ./nanochat-d34 \ |
| --output_dir ./nanochat-d3-hf |
| ``` |
|
|
| 3. (optional) Upload the checkpoint to the Hugging Face Hub |
|
|
| ``` |
| hf upload <username>/nanochat-d34 nanochat-d34 |
| ``` |
|
|
| 4. As above, you can generate with your model in `transformers`. |
|
|
| ```py |
| import torch |
| from transformers import AutoTokenizer, NanoChatForCausalLM |
|
|
| tokenizer = AutoTokenizer.from_pretrained("./nanochat-d3-hf") |
| model = NanoChatForCausalLM.from_pretrained("./nanochat-d3-hf") |
|
|
| device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
| model = model.to(device) |
|
|
| prompt = "Hello, how are you?" |
| inputs = tokenizer(prompt, return_tensors="pt").to(device) |
| inputs.pop("token_type_ids", None) |
| outputs = model.generate(**inputs, max_new_tokens=100) |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| ``` |
|
|
|
|
|
|