Formatting Datasets for Chat Template Compatibility
When working with datasets for fine-tuning conversational models, it's essential to ensure that the data is formatted correctly to work seamlessly with any chat template. In this article, we'll explore a Python function that transforms the
nroggendorff/mayo dataset from Hugging Face into a compatible format.
The format_prompts Function
Here's a breakdown of the format_prompts function:
def format_prompts(examples):
texts = []
for text in examples['text']:
conversation = []
parts = text.split('<|end|>')
for i in range(0, len(parts) - 1, 2):
prompt = parts[i].replace("<|user|>", "")
response = parts[i + 1].replace("<|bot|>", "")
conversation.append({"role": "user", "content": prompt})
conversation.append({"role": "assistant", "content": response})
formatted_conversation = tokenizer.apply_chat_template(conversation, tokenize=False)
texts.append(formatted_conversation)
return {"text": texts}
The function takes an examples parameter, which is expected to be a dictionary containing a 'text' key with a list of conversation strings.
We initialize an empty list called
textsto store the formatted conversations.We iterate over each
textinexamples['text']:- We split the
textusing the delimiter'<|end|>'to separate the conversation into parts. - We iterate over the
partsin steps of 2, assuming that even indices represent user prompts and odd indices represent bot responses. - We extract the
promptandresponseby removing the"<|user|>"and"<|bot|>"tags, respectively. - We append the
promptandresponseto theconversationlist as dictionaries with "role" and "content" keys.
- We split the
After processing all the parts, we apply the chat template to the
conversationusingtokenizer.apply_chat_template(), withtokenizeset toFalseto avoid tokenization at this stage.We append the
formatted_conversationto thetextslist.Finally, we create an
outputdictionary with a 'text' key containing the list of formatted conversations and return it.
Usage
To use the format_prompts function, you can pass your dataset examples to it:
from datasets import load_dataset
dataset = load_dataset("nroggendorff/mayo", split="train")
dataset = dataset.map(format_prompts, batched=True)
dataset['text'][2] # Check to see if the fields were formatted correctly
By applying this formatting step, you can ensure that your dataset is compatible with various chat templates, making it easier to fine-tune conversational models for different use cases.