| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # Chat basics | |
| Chat models are conversational models you can send a message to and receive a response. Most language models from mid-2023 onwards are chat models and may be referred to as "instruct" or "instruction-tuned" models. Models that do not support chat are often referred to as "base" or "pretrained" models. | |
| Larger and newer models are generally more capable, but models specialized in certain domains (medical, legal text, non-English languages, etc.) can often outperform these larger models. Try leaderboards like [OpenLLM](https://hf.co/spaces/HuggingFaceH4/open_llm_leaderboard) and [LMSys Chatbot Arena](https://chat.lmsys.org/?leaderboard) to help you identify the best model for your use case. | |
| This guide shows you how to quickly load chat models in Transformers from the command line, how to build and format a conversation, and how to chat using the [`TextGenerationPipeline`]. | |
| ## chat CLI | |
| After you've [installed Transformers](./installation), you can chat with a model directly from the command line. The command below launches an interactive session with a model, with a few base commands listed at the start of the session. | |
| > For the following commands, please make sure [`transformers serve` is running](https://huggingface.co/docs/transformers/main/en/serving). | |
| ```bash | |
| transformers chat Qwen/Qwen2.5-0.5B-Instruct | |
| ``` | |
| <div class="flex justify-center"> | |
| <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/transformers-chat-cli.png"/> | |
| </div> | |
| You can launch the CLI with arbitrary `generate` flags, with the format `arg_1=value_1 arg_2=value_2 ...` | |
| ```bash | |
| transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10 | |
| ``` | |
| For a full list of options, run the command below. | |
| ```bash | |
| transformers chat -h | |
| ``` | |
| The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)). | |
| ## TextGenerationPipeline | |
| [`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format). | |
| Chat models accept a list of messages (the chat history) as the input. Each message is a dictionary with `role` and `content` keys. | |
| To start the chat, add a single `user` message. You can also optionally include a `system` message to give the model directions on how to behave. | |
| ```py | |
| chat = [ | |
| {"role": "system", "content": "You are a helpful science assistant."}, | |
| {"role": "user", "content": "Hey, can you explain gravity to me?"} | |
| ] | |
| ``` | |
| Create the [`TextGenerationPipeline`] and pass `chat` to it. For large models, setting [device_map="auto"](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. | |
| ```py | |
| import torch | |
| from transformers import pipeline | |
| pipeline = pipeline(task="text-generation", model="HuggingFaceTB/SmolLM2-1.7B-Instruct", dtype="auto", device_map="auto") | |
| response = pipeline(chat, max_new_tokens=512) | |
| print(response[0]["generated_text"][-1]["content"]) | |
| ``` | |
| If this works successfully, you should see a response from the model! If you want to continue the conversation, | |
| you need to update the chat history with the model's response. You can do this either by appending the text | |
| to `chat` (use the `assistant` role), or by reading `response[0]["generated_text"]`, which contains | |
| the full chat history, including the most recent response. | |
| Once you have the model's response, you can continue the conversation by appending a new `user` message to the chat history. | |
| ```py | |
| chat = response[0]["generated_text"] | |
| chat.append( | |
| {"role": "user", "content": "Woah! But can it be reconciled with quantum mechanics?"} | |
| ) | |
| response = pipeline(chat, max_new_tokens=512) | |
| print(response[0]["generated_text"][-1]["content"]) | |
| ``` | |
| By repeating this process, you can continue the conversation as long as you like, at least until the model runs out of context window | |
| or you run out of memory. | |
| ## Performance and memory usage | |
| Transformers load models in full `float32` precision by default, and for a 8B model, this requires ~32GB of memory! Use the `torch_dtype="auto"` argument, which generally uses `bfloat16` for models that were trained with it, to reduce your memory usage. | |
| > [!TIP] | |
| > Refer to the [Quantization](./quantization/overview) docs for more information about the different quantization backends available. | |
| To lower memory usage even lower, you can quantize the model to 8-bit or 4-bit with [bitsandbytes](https://hf.co/docs/bitsandbytes/index). Create a [`BitsAndBytesConfig`] with your desired quantization settings and pass it to the pipelines `model_kwargs` parameter. The example below quantizes a model to 8-bits. | |
| ```py | |
| from transformers import pipeline, BitsAndBytesConfig | |
| quantization_config = BitsAndBytesConfig(load_in_8bit=True) | |
| pipeline = pipeline(task="text-generation", model="meta-llama/Meta-Llama-3-8B-Instruct", device_map="auto", model_kwargs={"quantization_config": quantization_config}) | |
| ``` | |
| In general, model size and performance are directly correlated. Larger models are slower in addition to requiring more memory because each active parameter must be read from memory for every generated token. | |
| This is a bottleneck for LLM text generation and the main options for improving generation speed are to either quantize a model or use hardware with higher memory bandwidth. Adding more compute power doesn't meaningfully help. | |
| You can also try techniques like [speculative decoding](./generation_strategies#speculative-decoding), where a smaller model generates candidate tokens that are verified by the larger model. If the candidate tokens are correct, the larger model can generate more than one token at a time. This significantly alleviates the bandwidth bottleneck and improves generation speed. | |
| > [!TIP] | |
| Mixture-of-Expert (MoE) models such as [Mixtral](./model_doc/mixtral), [Qwen2MoE](./model_doc/qwen2_moe), and [GPT-OSS](./model_doc/gpt-oss) have lots of parameters, but only "activate" a small fraction of them to generate each token. As a result, MoE models generally have much lower memory bandwidth requirements and can be faster than a regular LLM of the same size. However, techniques like speculative decoding are ineffective with MoE models because more parameters become activated with each new speculated token. | |