YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Indic Vocab Files Only - Not a model

This is not a model, just a BPE tokenizer (fallback enabled) vocab files trained on 12 Indian languages ( a few billion tokens ). 16k extra tokens. Reduces tokens by >50-60% for some languages

Languages Trained on

  • Bengali
  • Gujarati
  • Hindi
  • Kannada
  • Malayalam
  • Marathi
  • Odia
  • Punjabi
  • Sanskrit
  • Tamil
  • Telugu
  • Urudu

Code to merge with existing model

from transformers import AutoModel, AutoTokenizer
import sentencepiece as spm

model = AutoModel.from_pretrained("model_path")
tokenizer = AutoTokenizer.from_pretrained("model_path")

vocabulary = tokenizer.get_vocab().keys()

sp = spm.SentencePieceProcessor(model_file='tok16000.model')

new_vocab = [sp.id_to_piece(id) for id in range(sp.get_piece_size())]

new_tokens = set(new_vocab) - set(tokenizer.vocab.keys())

tokenizer.add_tokens(list(new_tokens))

model.resize_token_embeddings(len(tokenizer))

model.save_pretrained("model_extended_vocab")

tokenizer.save_pretrained("model_extended_vocab")
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support