Buckets:

hf-doc-build/doc-dev / trl /pr_5321 /en /chat_template_utils.md
rtrm's picture
|
download
raw
4.65 kB
# Chat template utilities
## clone_chat_template[[trl.clone_chat_template]]
#### trl.clone_chat_template[[trl.clone_chat_template]]
[Source](https://github.com/huggingface/trl/blob/vr_5321/trl/chat_template_utils.py#L18)
Clones a chat template from a source tokenizer to the target tokenizer and updates the model accordingly.
This function:
- Copies the chat template from a source tokenizer to the target tokenizer.
- Adds any new tokens from the source tokenizer to the target tokenizer.
- Sets and synchronizes the EOS token across the tokenizer and model.
- Resizes the model's token embeddings to match the new vocabulary size, optionally rounding it up to a multiple of
a specified value. In such cases, dummy tokens are added to the tokenizer to ensure the vocabulary size matches
the embedding dimensions.
Example:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
model, tokenizer, added_tokens = clone_chat_template(model, tokenizer, "Qwen/Qwen3-0.6B")
```
**Parameters:**
model ([PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel)) : Model to update.
tokenizer (`PreTrainedTokenizer`) : Tokenizer to update.
source_tokenizer_path (`str`) : Path or identifier of the pretrained tokenizer to clone from.
resize_to_multiple_of (`int` or `None`, *optional*, defaults to `64`) : The embedding layer will be resized to the new vocabulary size. If this is not `None`, it will round up the new vocabulary size to the nearest multiple of this value.
**Returns:**
`model ([PreTrainedModel](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.PreTrainedModel))`
Updated model with resized token embeddings and EOS token configured.
tokenizer (`PreTrainedTokenizer`):
Updated tokenizer with the chat template and special tokens applied.
added_tokens (`list[int]`):
List of tokens that were added to the tokenizer from the source tokenizer.
## is_chat_template_prefix_preserving[[trl.chat_template_utils.is_chat_template_prefix_preserving]]
#### trl.chat_template_utils.is_chat_template_prefix_preserving[[trl.chat_template_utils.is_chat_template_prefix_preserving]]
[Source](https://github.com/huggingface/trl/blob/vr_5321/trl/chat_template_utils.py#L472)
Check whether the chat template preserves prefixes when applied.
**Parameters:**
tokenizer (`PreTrainedTokenizer`) : Tokenizer instance to check.
**Returns:**
``bool``
`True` if the chat template preserves prefixes, `False` otherwise.
## get_training_chat_template[[trl.get_training_chat_template]]
#### trl.get_training_chat_template[[trl.get_training_chat_template]]
[Source](https://github.com/huggingface/trl/blob/vr_5321/trl/chat_template_utils.py#L610)
Get a prefix-preserving chat template for training, if needed.
If the tokenizer's template isn't prefix-preserving, returns a training-compatible template (currently Qwen3 and
Qwen3.5 supported). Otherwise, returns `None`.
Example:
```python
>>> from trl.chat_template_utils import get_training_chat_template
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-0.6B")
>>> messages1 = [
... {"role": "user", "content": "What color is the sky?"},
... {"role": "assistant", "content": "It is blue."},
... ]
>>> messages2 = [
... {"role": "user", "content": "What color is the sky?"},
... {"role": "assistant", "content": "It is blue."},
... {"role": "user", "content": "And at night?"},
... ]
>>> tokenizer.apply_chat_template(messages1, tokenize=False)
'user\nWhat color is the sky?\nassistant\n\n\n\n\nIt is blue.\n'
>>> tokenizer.apply_chat_template(messages2, tokenize=False)
'user\nWhat color is the sky?\nassistant\nIt is blue.\nuser\nAnd at night?\n'
>>> # ^ think tags missing
>>> chat_template = get_training_chat_template(tokenizer)
>>> tokenizer.apply_chat_template(messages1, tokenize=False, chat_template=chat_template)
'user\nWhat color is the sky?\nassistant\n\n\n\n\nIt is blue.\n'
>>> tokenizer.apply_chat_template(messages2, tokenize=False, chat_template=chat_template)
'user\nWhat color is the sky?\nassistant\n\n\n\n\nIt is blue.\nuser\nAnd at night?\n'
```
**Parameters:**
tokenizer (`PreTrainedTokenizer`) : Tokenizer instance to check.
**Returns:**
``str` or `None``
Training-compatible chat template, or `None` if no patching is needed.

Xet Storage Details

Size:
4.65 kB
·
Xet hash:
f96b66474f67a2ff34f075907d408ef8105189fd66781afa3c4f2a33a56f2ce6

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.