Model Card for Model ID

This repository provides a SentencePiece tokenizer trained for the Azerbaijani language. It uses a Byte-Pair Encoding (BPE) model trained with SentencePiece and wrapped as a Hugging Face T5TokenizerFast object for easy integration with the 🤗 Transformers ecosystem.

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [Nazrin Burziyeva]
  • Model type: [Tokenizer]
  • Language(s) (NLP): [Azerbaijani]
  • License: [More Information Needed]
  • Finetuned from model [N/A]: [ (trained from raw Azerbaijani text corpus)]

Uses

Preprocessing Azerbaijani text for NLP tasks

How to Get Started with the Model

from transformers import AutoTokenizer

# Load tokenizer from the Hub
tokenizer = AutoTokenizer.from_pretrained("nazrinburz/azerbaijani-sentencepiece-tokenizer")

# Example text in Azerbaijani
text = "Azərbaycan dilinin inkişafı mədəniyyətimizin qorunması və gələcək nəsillərə ötürülməsi üçün vacib şərtdir."

# Tokenize
tokens = tokenizer.tokenize(text)
print("Tokens:", tokens)

# Encode into IDs
ids = tokenizer.encode(text)
print("Token IDs:", ids)

# Decode back to text
decoded = tokenizer.decode(ids)
print("Decoded:", decoded)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support