phi-2-chatml / README.md

Updated readme - Added new special tokens and resized embedding and final output layer

e1b4100 verified about 2 years ago

1.29 kB

	---
	library_name: transformers
	license: mit
	---

	# Model Card for Model ID

	<!-- Provide a quick summary of what the model is/does. -->
	Based on: https://huggingface.co/microsoft/phi-2

	Summary of changes made:
	1. Add new special tokens for padding ([PAD]) and ChatML tokens (<\|im_start\|>, <\|im_start\|>) for further finetuning on instruction/chat datasets
	2. Resize embedding layer and final output layer
	- https://huggingface.co/microsoft/phi-2/discussions/22#659d8ba950c1bbee5be6f179
	- Original embedding size is 51200, but only 50295 tokens were used
	- Resized the final embdedding matrix to avoid confusion, now aligns with tokenizer vocabulary
	- https://huggingface.co/microsoft/phi-2/discussions/43#659d8d3418dc7360290a4734

	# Code for Reproducibility
	```python
	import torch
	import transformers

	transformers.set_seed(42)
	torch.set_default_device("cuda")

	model_checkpoint = "microsoft/phi-2"
	tokenizer = transformers.AutoTokenizer.from_pretrained(model_checkpoint)
	model = transformers.AutoModelForCausalLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16, trust_remote_code=True)

	num_added_tokens = tokenizer.add_special_tokens({'additional_special_tokens': ['<\|im_start\|>', '<\|im_end\|>'], 'pad_token': '[PAD]'})
	model.resize_token_embeddings(len(tokenizer))
	```