Model Card for Friday Tokenizer

Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems.


Model Details

Model Description

Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers.

The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.

  • Developed by: Debashish Roy
  • Funded by: Self-funded
  • Shared by: Debashish Roy
  • Model type: SentencePiece Tokenizer
  • Language(s) (NLP): Bengali, English, Banglish
  • License: Apache 2.0
  • Finetuned from model: None (built from scratch)

Model Sources [optional]


Uses

Direct Use

This tokenizer is intended for:

  • GPT-style decoder-only language models
  • Conversational AI systems
  • Bengali NLP experiments
  • Banglish text generation
  • Lightweight multilingual language models

Downstream Use

The tokenizer can be integrated into:

  • Chatbots
  • Language generation systems
  • Translation systems
  • Bengali AI assistants
  • Custom transformer training pipelines

Out-of-Scope Use

This tokenizer is not optimized for:

  • Formal literary Bengali
  • Legal or medical NLP applications
  • High-precision linguistic analysis
  • Production-scale multilingual systems without further evaluation

Bias, Risks, and Limitations

The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result:

  • Informal language patterns may be overrepresented
  • Rare words may split aggressively
  • Banglish spelling inconsistencies may affect tokenization quality
  • Dataset biases from subtitle and internet conversations may exist

Recommendations

Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains.


How to Get Started with the Model

Use the code below to get started with the tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "thedeba/friday-tokenizer",
    use_fast=False
)

text = "আমি আজ বাইরে যাচ্ছি"

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

decoded = tokenizer.decode(ids)
print(decoded)

Training Details

Training Data

The tokenizer was trained using mixed multilingual conversational datasets including:

  • OpenSubtitles
  • Bengali conversational text
  • Bengali-English mixed text
  • Banglish datasets

Training Procedure

The tokenizer was trained from scratch using SentencePiece subword tokenization.

Preprocessing

  • Unicode normalization
  • Text cleaning
  • Duplicate filtering
  • Mixed-language corpus preparation

Training Hyperparameters

  • Vocabulary Size: 32000
  • Training regime: SentencePiece subword training

Speeds, Sizes, Times

  • Lightweight tokenizer suitable for low-resource devices
  • Compact vocabulary size for efficient inference

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal conversational Bengali-English text samples were used for qualitative evaluation.

Factors

Evaluation focused on:

  • Bengali Unicode support
  • Mixed-language tokenization
  • Banglish handling
  • Conversational token quality

Metrics

Qualitative tokenization inspection and reconstruction accuracy were primarily used.

Results

The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation.

Summary

Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications.


Model Examination

Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.


Environmental Impact

Carbon emissions were not formally tracked during tokenizer training.

  • Hardware Type: Consumer GPU / CPU
  • Hours used: Not recorded
  • Cloud Provider: Google Colab
  • Compute Region: Not specified
  • Carbon Emitted: Unknown

Technical Specifications

Model Architecture and Objective

  • Architecture: SentencePiece tokenizer
  • Objective: Multilingual subword tokenization for conversational AI

Compute Infrastructure

Training was performed using local and cloud-based environments.

Hardware

  • Consumer-grade hardware
  • Google Colab environment

Software

  • Python
  • SentencePiece
  • Hugging Face Transformers

Citation

BibTeX

@misc{fridaytokenizer2026,
  title={Friday Tokenizer},
  author={Debashish Roy},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}}
}

APA

Roy, D. (2026). Friday Tokenizer. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer


Glossary

  • Banglish: Bengali written using the Latin alphabet
  • Subword Tokenization: Splitting words into smaller meaningful units
  • SentencePiece: A language-independent tokenizer and text segmentation library

More Information

Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.


Model Card Authors

Debashish Roy


Model Card Contact

For questions or collaboration:


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support