| library_name: transformers | |
| license: mit | |
| # Model Card for Model ID | |
| <!-- Provide a quick summary of what the model is/does. --> | |
| Based on: https://huggingface.co/microsoft/phi-2 | |
| Summary of changes made: | |
| 1. Add new special tokens for padding ([PAD]) and ChatML tokens (<|im_start|>, <|im_start|>) for further finetuning on instruction/chat datasets | |
| 2. Resize embedding layer and final output layer | |
| - https://huggingface.co/microsoft/phi-2/discussions/22#659d8ba950c1bbee5be6f179 | |
| - Original embedding size is 51200, but only 50295 tokens were used | |
| - Resized the final embdedding matrix to avoid confusion, now aligns with tokenizer vocabulary | |
| - https://huggingface.co/microsoft/phi-2/discussions/43#659d8d3418dc7360290a4734 | |
| # Code for Reproducibility | |
| ```python | |
| import torch | |
| import transformers | |
| transformers.set_seed(42) | |
| torch.set_default_device("cuda") | |
| model_checkpoint = "microsoft/phi-2" | |
| tokenizer = transformers.AutoTokenizer.from_pretrained(model_checkpoint) | |
| model = transformers.AutoModelForCausalLM.from_pretrained(model_checkpoint, torch_dtype=torch.float16, trust_remote_code=True) | |
| num_added_tokens = tokenizer.add_special_tokens({'additional_special_tokens': ['<|im_start|>', '<|im_end|>'], 'pad_token': '[PAD]'}) | |
| model.resize_token_embeddings(len(tokenizer)) | |
| ``` |