File size: 6,072 Bytes

---
language:
- bn
- en
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
tags:
- tokenizer
- sentencepiece
- bengali
- banglish
- english
- multilingual
- transformers
- nlp
- gpt
---

# Model Card for Friday Tokenizer

Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems.

---

## Model Details

### Model Description

Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers.

The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.

- **Developed by:** Debashish Roy
- **Funded by:** Self-funded
- **Shared by:** Debashish Roy
- **Model type:** SentencePiece Tokenizer
- **Language(s) (NLP):** Bengali, English, Banglish
- **License:** Apache 2.0
- **Finetuned from model:** None (built from scratch)

### Model Sources [optional]

- **Repository:** https://huggingface.co/thedeba/friday-tokenizer
- **Paper:** Not available
- **Demo:** Not available

---

## Uses

### Direct Use

This tokenizer is intended for:

- GPT-style decoder-only language models
- Conversational AI systems
- Bengali NLP experiments
- Banglish text generation
- Lightweight multilingual language models

### Downstream Use

The tokenizer can be integrated into:

- Chatbots
- Language generation systems
- Translation systems
- Bengali AI assistants
- Custom transformer training pipelines

### Out-of-Scope Use

This tokenizer is not optimized for:

- Formal literary Bengali
- Legal or medical NLP applications
- High-precision linguistic analysis
- Production-scale multilingual systems without further evaluation

---

## Bias, Risks, and Limitations

The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result:

- Informal language patterns may be overrepresented
- Rare words may split aggressively
- Banglish spelling inconsistencies may affect tokenization quality
- Dataset biases from subtitle and internet conversations may exist

### Recommendations

Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains.

---

## How to Get Started with the Model

Use the code below to get started with the tokenizer.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "thedeba/friday-tokenizer",
    use_fast=False
)

text = "আমি আজ বাইরে যাচ্ছি"

tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)

print(tokens)
print(ids)

decoded = tokenizer.decode(ids)
print(decoded)
```

---

## Training Details

### Training Data

The tokenizer was trained using mixed multilingual conversational datasets including:

- OpenSubtitles
- Bengali conversational text
- Bengali-English mixed text
- Banglish datasets

### Training Procedure

The tokenizer was trained from scratch using SentencePiece subword tokenization.

#### Preprocessing

- Unicode normalization
- Text cleaning
- Duplicate filtering
- Mixed-language corpus preparation

#### Training Hyperparameters

- **Vocabulary Size:** 32000
- **Training regime:** SentencePiece subword training

#### Speeds, Sizes, Times

- Lightweight tokenizer suitable for low-resource devices
- Compact vocabulary size for efficient inference

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

Internal conversational Bengali-English text samples were used for qualitative evaluation.

#### Factors

Evaluation focused on:

- Bengali Unicode support
- Mixed-language tokenization
- Banglish handling
- Conversational token quality

#### Metrics

Qualitative tokenization inspection and reconstruction accuracy were primarily used.

### Results

The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation.

#### Summary

Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications.

---

## Model Examination

Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.

---

## Environmental Impact

Carbon emissions were not formally tracked during tokenizer training.

- **Hardware Type:** Consumer GPU / CPU
- **Hours used:** Not recorded
- **Cloud Provider:** Google Colab
- **Compute Region:** Not specified
- **Carbon Emitted:** Unknown

---

## Technical Specifications

### Model Architecture and Objective

- Architecture: SentencePiece tokenizer
- Objective: Multilingual subword tokenization for conversational AI

### Compute Infrastructure

Training was performed using local and cloud-based environments.

#### Hardware

- Consumer-grade hardware
- Google Colab environment

#### Software

- Python
- SentencePiece
- Hugging Face Transformers

---

## Citation

### BibTeX

```bibtex
@misc{fridaytokenizer2026,
  title={Friday Tokenizer},
  author={Debashish Roy},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}}
}
```

### APA

Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer

---

## Glossary

- **Banglish:** Bengali written using the Latin alphabet
- **Subword Tokenization:** Splitting words into smaller meaningful units
- **SentencePiece:** A language-independent tokenizer and text segmentation library

---

## More Information

Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.

---

## Model Card Authors

Debashish Roy

---

## Model Card Contact

For questions or collaboration:

- Hugging Face: https://huggingface.co/thedeba
```