aaparajit02's picture
Update README.md
4470265 verified
metadata
library_name: transformers
tags:
  - llama
  - gpt
  - malayalam
  - text-generation-inference
license: mit
datasets:
  - uonlp/CulturaX
language:
  - ml
pipeline_tag: text-classification

About

  • This tokenizer was trained using the CulturaX dataset. We sample 1.2 million datapoints at random.
  • This was trained using the SentencePiece by Google.
  • Then the trained tokens were then added to the LlamaTokenizer leading to a total of 49,120 tokens from 32,000 from the original tokenizer.
  • The merging was done according to what the Chinese-Llama-Alpaca's merging did.

Usage

from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer.from_pretrained("learninbit/malayalam-llama-2-tokenizer-v0.1")
text = "ഹനഫസ ഹഫഞ്ചഥ ചകഡു ടെണല ഡൃൊമത്തീഴ ടഞ്ഞഭഞ റദ്ധഷ ഌിപത്മഫഥ ടജ്ജഡ്ഡപ്പെവ പഴുണൊ."
tokens = tokenizer.tokenizer(text)