friday-tokenizer / README.md
thedeba's picture
Update README.md
587b151 verified
---
language:
- bn
- en
library_name: transformers
pipeline_tag: text-generation
license: apache-2.0
tags:
- tokenizer
- sentencepiece
- bengali
- banglish
- english
- multilingual
- transformers
- nlp
- gpt
---
# Model Card for Friday Tokenizer
Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems.
---
## Model Details
### Model Description
Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers.
The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
- **Developed by:** Debashish Roy
- **Funded by:** Self-funded
- **Shared by:** Debashish Roy
- **Model type:** SentencePiece Tokenizer
- **Language(s) (NLP):** Bengali, English, Banglish
- **License:** Apache 2.0
- **Finetuned from model:** None (built from scratch)
### Model Sources [optional]
- **Repository:** https://huggingface.co/thedeba/friday-tokenizer
- **Paper:** Not available
- **Demo:** Not available
---
## Uses
### Direct Use
This tokenizer is intended for:
- GPT-style decoder-only language models
- Conversational AI systems
- Bengali NLP experiments
- Banglish text generation
- Lightweight multilingual language models
### Downstream Use
The tokenizer can be integrated into:
- Chatbots
- Language generation systems
- Translation systems
- Bengali AI assistants
- Custom transformer training pipelines
### Out-of-Scope Use
This tokenizer is not optimized for:
- Formal literary Bengali
- Legal or medical NLP applications
- High-precision linguistic analysis
- Production-scale multilingual systems without further evaluation
---
## Bias, Risks, and Limitations
The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result:
- Informal language patterns may be overrepresented
- Rare words may split aggressively
- Banglish spelling inconsistencies may affect tokenization quality
- Dataset biases from subtitle and internet conversations may exist
### Recommendations
Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains.
---
## How to Get Started with the Model
Use the code below to get started with the tokenizer.
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"thedeba/friday-tokenizer",
use_fast=False
)
text = "আমি আজ বাইরে যাচ্ছি"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(tokens)
print(ids)
decoded = tokenizer.decode(ids)
print(decoded)
```
---
## Training Details
### Training Data
The tokenizer was trained using mixed multilingual conversational datasets including:
- OpenSubtitles
- Bengali conversational text
- Bengali-English mixed text
- Banglish datasets
### Training Procedure
The tokenizer was trained from scratch using SentencePiece subword tokenization.
#### Preprocessing
- Unicode normalization
- Text cleaning
- Duplicate filtering
- Mixed-language corpus preparation
#### Training Hyperparameters
- **Vocabulary Size:** 32000
- **Training regime:** SentencePiece subword training
#### Speeds, Sizes, Times
- Lightweight tokenizer suitable for low-resource devices
- Compact vocabulary size for efficient inference
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
Internal conversational Bengali-English text samples were used for qualitative evaluation.
#### Factors
Evaluation focused on:
- Bengali Unicode support
- Mixed-language tokenization
- Banglish handling
- Conversational token quality
#### Metrics
Qualitative tokenization inspection and reconstruction accuracy were primarily used.
### Results
The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation.
#### Summary
Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications.
---
## Model Examination
Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
---
## Environmental Impact
Carbon emissions were not formally tracked during tokenizer training.
- **Hardware Type:** Consumer GPU / CPU
- **Hours used:** Not recorded
- **Cloud Provider:** Google Colab
- **Compute Region:** Not specified
- **Carbon Emitted:** Unknown
---
## Technical Specifications
### Model Architecture and Objective
- Architecture: SentencePiece tokenizer
- Objective: Multilingual subword tokenization for conversational AI
### Compute Infrastructure
Training was performed using local and cloud-based environments.
#### Hardware
- Consumer-grade hardware
- Google Colab environment
#### Software
- Python
- SentencePiece
- Hugging Face Transformers
---
## Citation
### BibTeX
```bibtex
@misc{fridaytokenizer2026,
title={Friday Tokenizer},
author={Debashish Roy},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}}
}
```
### APA
Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer
---
## Glossary
- **Banglish:** Bengali written using the Latin alphabet
- **Subword Tokenization:** Splitting words into smaller meaningful units
- **SentencePiece:** A language-independent tokenizer and text segmentation library
---
## More Information
Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
---
## Model Card Authors
Debashish Roy
---
## Model Card Contact
For questions or collaboration:
- Hugging Face: https://huggingface.co/thedeba
```