Model Card for Friday Tokenizer

Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems.

Model Details

Model Description

Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers.

The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.

Developed by: Debashish Roy
Funded by: Self-funded
Shared by: Debashish Roy
Model type: SentencePiece Tokenizer
Language(s) (NLP): Bengali, English, Banglish
License: Apache 2.0
Finetuned from model: None (built from scratch)

Model Sources [optional]

Repository: https://huggingface.co/thedeba/friday-tokenizer
Paper: Not available
Demo: Not available

Uses

Direct Use

This tokenizer is intended for:

GPT-style decoder-only language models
Conversational AI systems
Bengali NLP experiments
Banglish text generation
Lightweight multilingual language models

Downstream Use

The tokenizer can be integrated into:

Chatbots
Language generation systems
Translation systems
Bengali AI assistants
Custom transformer training pipelines

Out-of-Scope Use

This tokenizer is not optimized for:

Formal literary Bengali
Legal or medical NLP applications
High-precision linguistic analysis
Production-scale multilingual systems without further evaluation

Bias, Risks, and Limitations

The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result:

Informal language patterns may be overrepresented
Rare words may split aggressively
Banglish spelling inconsistencies may affect tokenization quality
Dataset biases from subtitle and internet conversations may exist

Recommendations

Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains.

How to Get Started with the Model

Use the code below to get started with the tokenizer.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "thedeba/friday-tokenizer",
    use_fast=False
)

text = "আমি আজ বাইরে যাচ্ছি"

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

decoded = tokenizer.decode(ids)
print(decoded)

Training Details

Training Data

The tokenizer was trained using mixed multilingual conversational datasets including:

OpenSubtitles
Bengali conversational text
Bengali-English mixed text
Banglish datasets

Training Procedure

The tokenizer was trained from scratch using SentencePiece subword tokenization.

Preprocessing

Unicode normalization
Text cleaning
Duplicate filtering
Mixed-language corpus preparation

Training Hyperparameters

Vocabulary Size: 32000
Training regime: SentencePiece subword training

Speeds, Sizes, Times

Lightweight tokenizer suitable for low-resource devices
Compact vocabulary size for efficient inference

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal conversational Bengali-English text samples were used for qualitative evaluation.

Factors

Evaluation focused on:

Bengali Unicode support
Mixed-language tokenization
Banglish handling
Conversational token quality

Metrics

Qualitative tokenization inspection and reconstruction accuracy were primarily used.

Results

The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation.

Summary

Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications.

Model Examination

Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.

Environmental Impact

Carbon emissions were not formally tracked during tokenizer training.

Hardware Type: Consumer GPU / CPU
Hours used: Not recorded
Cloud Provider: Google Colab
Compute Region: Not specified
Carbon Emitted: Unknown

Technical Specifications

Model Architecture and Objective

Architecture: SentencePiece tokenizer
Objective: Multilingual subword tokenization for conversational AI

Compute Infrastructure

Training was performed using local and cloud-based environments.

Hardware

Consumer-grade hardware
Google Colab environment

Software

Python
SentencePiece
Hugging Face Transformers

Citation

BibTeX

@misc{fridaytokenizer2026,
  title={Friday Tokenizer},
  author={Debashish Roy},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}}
}

APA

Roy, D. (2026). Friday Tokenizer. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer

Glossary

Banglish: Bengali written using the Latin alphabet
Subword Tokenization: Splitting words into smaller meaningful units
SentencePiece: A language-independent tokenizer and text segmentation library

More Information

Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.

Model Card Authors

Debashish Roy

Model Card Contact

For questions or collaboration:

Hugging Face: https://huggingface.co/thedeba

Downloads last month: -; Downloads are not tracked for this model. How to track