YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Indic Vocab Files Only - Not a model
This is not a model, just a BPE tokenizer (fallback enabled) vocab files trained on 12 Indian languages ( a few billion tokens ). 16k extra tokens. Reduces tokens by >50-60% for some languages
Languages Trained on
- Bengali
- Gujarati
- Hindi
- Kannada
- Malayalam
- Marathi
- Odia
- Punjabi
- Sanskrit
- Tamil
- Telugu
- Urudu
Code to merge with existing model
from transformers import AutoModel, AutoTokenizer
import sentencepiece as spm
model = AutoModel.from_pretrained("model_path")
tokenizer = AutoTokenizer.from_pretrained("model_path")
vocabulary = tokenizer.get_vocab().keys()
sp = spm.SentencePieceProcessor(model_file='tok16000.model')
new_vocab = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]
new_tokens = set(new_vocab) - set(tokenizer.vocab.keys())
tokenizer.add_tokens(list(new_tokens))
model.resize_token_embeddings(len(tokenizer))
model.save_pretrained("model_extended_vocab")
tokenizer.save_pretrained("model_extended_vocab")
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support