| license: mit | |
| language: | |
| - fa | |
| tags: | |
| - persian | |
| - llama | |
| I trained Llama2-7B after extending its tokenizer by 21,455 token on about 15B farsi text(common crawl, social, papers) | |
| ``` | |
| from transformers import LlamaForCausalLM, AutoTokenizer | |
| import torch | |
| model = LlamaForCausalLM.from_pretrained("mostafaamiri/base_7B") | |
| tokenizer = AutoTokenizer.from_pretrained("mostafaamiri/llama2_7B_15Btoken") | |
| model.resize_token_embeddings(len(tokenizer)) | |
| model.load_adapter("mostafaamiri/llama2_7B_15Btoken") | |
| ``` |