You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Llama-3.2-1B-Kannada-Tokenizer

This repository describes a custom tokenizer that extends the original meta-llama/Llama-3.2–1B SentencePiece tokenizer with enhanced support for the Kannada language.

Model Details

Model Description

This tokenizer is an extended version of the meta-llama/Llama-3.2–1B SentencePiece tokenizer. It has been augmented with a custom Kannada subword vocabulary derived from a large Kannada text corpus (e.g., Wikipedia and other sources), using the Byte-Pair Encoding (BPE) algorithm.

The goal of this extension is to improve tokenization quality for Kannada text by:

Reducing fragmented or broken Unicode token segments.

Shortening token sequences by learning language-aware subword units.

Enabling better semantic understanding for Kannada during fine-tuning or continual pretraining of LLMs.

The tokenizer:

Retains the full original vocabulary of Llama 3.2–1B, ensuring backward compatibility.
Adds only new Kannada-specific tokens that were missing from the base vocabulary.
Can be directly plugged into any model originally using the Llama 3.2–1B tokenizer.
Developed by: Manjunath S N
Model type: SentencePiece BPE Tokenizer (extended)
Language(s) (NLP): English (base), Kannada (enhanced), and other languages supported by the base Llama 3.2-1B tokenizer.
License: MIT License
Extended from tokenizer: meta-llama/Llama-3.2-1B

Uses

Direct Use

This tokenizer is designed for:

Preprocessing Kannada text for Llama 3.2 1B model.

Analyzing tokenization patterns in Kannada text.

Providing efficient subword segmentation for Kannada in NLP pipelines.

Downstream Use

This tokenizer is a critical component for:

Training or fine-tuning any Llama 3.2-compatible LLM to achieve high performance in Kannada.

Developing applications that require accurate and efficient tokenization of Kannada text (e.g., search, sentiment analysis, text classification).

Research on multilingual tokenization and subword units for Indic languages.

Out-of-Scope Use

Using this tokenizer with models that have not had their embedding layers resized to accommodate the extended vocabulary, as this will lead to incorrect or meaningless outputs.

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import AutoTokenizer

tokenizer_id = "imanjunathn/Llama-3.2-1B-Kannada-Tokenizer"

# Load the custom tokenizer
tokenizer = AutoTokenizer.from_pretrained(tokenizer_id)

# Example tokenization in Kannada
test_text_kannada = "ನಮಸ್ತೆ, ಇದು ಕನ್ನಡದಲ್ಲಿ ಹೊಸ ಟೋಕನೈಸರ್ ಆಗಿದೆ. ಕನ್ನಡವು ಒಂದು ಸುಂದರ ಭಾಷೆ. ಬೆಂಗಳೂರು ಕರ್ನಾಟಕದ ರಾಜಧಾನಿ."
test_tokens = tokenizer.tokenize(test_text_kannada)
print(f"Original Kannada text: '{test_text_kannada}'")
print(f"Tokenized output (adapted tokenizer): {test_tokens}")
print(f"Decoded output: '{tokenizer.decode(tokenizer.encode(test_text_kannada),skip_special_tokens=True)}'")

# Check vocabulary size
print(f"Tokenizer Vocabulary Size: {len(tokenizer)}")

Training Details

Training Procedure

Tokenizer Algorithm: SentencePiece (BPE model) was trained on the Kannada corpus.

Vocabulary Extension Process:

The base meta-llama/Llama-3.2-1B tokenizer was loaded.

A new SentencePiece BPE model was trained on the preprocessed Kannada dataset with a target vocabulary size sufficient to capture Kannada subwords effectively.

Tokens generated by the new Kannada SentencePiece model were compared against the existing Llama 3.2-1B vocabulary.

Only unique Kannada tokens (i.e., those not already present in the Llama 3.2-1B vocabulary) were extracted.

These unique Kannada tokens were then added to the original Llama 3.2-1B tokenizer's vocabulary using tokenizer.add_tokens().

Training Hyperparameters (for new SPM):

Model Type: BPE

Vocabulary Size: 32000

Character Coverage: 0.995 (recommended for languages with rich character sets like Kannada)

Summary

This custom tokenizer extends the powerful Llama 3.2-1B tokenizer with a specialized Kannada vocabulary. By leveraging the BPE algorithm on a dedicated Kannada corpus, it provides more granular and culturally relevant subword segmentation for Kannada text.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for imanjunathn/Llama-3.2-1B-Kannada-Tokenizer

Base model

meta-llama/Llama-3.2-1B

Finetuned

(899)

this model