Update README.md

587b151 verified 1 day ago

6.07 kB

	---
	language:
	- bn
	- en
	library_name: transformers
	pipeline_tag: text-generation
	license: apache-2.0
	tags:
	- tokenizer
	- sentencepiece
	- bengali
	- banglish
	- english
	- multilingual
	- transformers
	- nlp
	- gpt
	---

	# Model Card for Friday Tokenizer

	Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems.

	---

	## Model Details

	### Model Description

	Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers.

	The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.

	- Developed by: Debashish Roy
	- Funded by: Self-funded
	- Shared by: Debashish Roy
	- Model type: SentencePiece Tokenizer
	- Language(s) (NLP): Bengali, English, Banglish
	- License: Apache 2.0
	- Finetuned from model: None (built from scratch)

	### Model Sources [optional]

	- Repository: https://huggingface.co/thedeba/friday-tokenizer
	- Paper: Not available
	- Demo: Not available

	---

	## Uses

	### Direct Use

	This tokenizer is intended for:

	- GPT-style decoder-only language models
	- Conversational AI systems
	- Bengali NLP experiments
	- Banglish text generation
	- Lightweight multilingual language models

	### Downstream Use

	The tokenizer can be integrated into:

	- Chatbots
	- Language generation systems
	- Translation systems
	- Bengali AI assistants
	- Custom transformer training pipelines

	### Out-of-Scope Use

	This tokenizer is not optimized for:

	- Formal literary Bengali
	- Legal or medical NLP applications
	- High-precision linguistic analysis
	- Production-scale multilingual systems without further evaluation

	---

	## Bias, Risks, and Limitations

	The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result:

	- Informal language patterns may be overrepresented
	- Rare words may split aggressively
	- Banglish spelling inconsistencies may affect tokenization quality
	- Dataset biases from subtitle and internet conversations may exist

	### Recommendations

	Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains.

	---

	## How to Get Started with the Model

	Use the code below to get started with the tokenizer.

	```python
	from transformers import AutoTokenizer

	tokenizer = AutoTokenizer.from_pretrained(
	"thedeba/friday-tokenizer",
	use_fast=False
	)

	text = "আমি আজ বাইরে যাচ্ছি"

	tokens = tokenizer.tokenize(text)
	ids = tokenizer.encode(text)

	print(tokens)
	print(ids)

	decoded = tokenizer.decode(ids)
	print(decoded)
	```

	---

	## Training Details

	### Training Data

	The tokenizer was trained using mixed multilingual conversational datasets including:

	- OpenSubtitles
	- Bengali conversational text
	- Bengali-English mixed text
	- Banglish datasets

	### Training Procedure

	The tokenizer was trained from scratch using SentencePiece subword tokenization.

	#### Preprocessing

	- Unicode normalization
	- Text cleaning
	- Duplicate filtering
	- Mixed-language corpus preparation

	#### Training Hyperparameters

	- Vocabulary Size: 32000
	- Training regime: SentencePiece subword training

	#### Speeds, Sizes, Times

	- Lightweight tokenizer suitable for low-resource devices
	- Compact vocabulary size for efficient inference

	---

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	Internal conversational Bengali-English text samples were used for qualitative evaluation.

	#### Factors

	Evaluation focused on:

	- Bengali Unicode support
	- Mixed-language tokenization
	- Banglish handling
	- Conversational token quality

	#### Metrics

	Qualitative tokenization inspection and reconstruction accuracy were primarily used.

	### Results

	The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation.

	#### Summary

	Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications.

	---

	## Model Examination

	Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.

	---

	## Environmental Impact

	Carbon emissions were not formally tracked during tokenizer training.

	- Hardware Type: Consumer GPU / CPU
	- Hours used: Not recorded
	- Cloud Provider: Google Colab
	- Compute Region: Not specified
	- Carbon Emitted: Unknown

	---

	## Technical Specifications

	### Model Architecture and Objective

	- Architecture: SentencePiece tokenizer
	- Objective: Multilingual subword tokenization for conversational AI

	### Compute Infrastructure

	Training was performed using local and cloud-based environments.

	#### Hardware

	- Consumer-grade hardware
	- Google Colab environment

	#### Software

	- Python
	- SentencePiece
	- Hugging Face Transformers

	---

	## Citation

	### BibTeX

	```bibtex
	@misc{fridaytokenizer2026,
	title={Friday Tokenizer},
	author={Debashish Roy},
	year={2026},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}}
	}
	```

	### APA

	Roy, D. (2026). Friday Tokenizer. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer

	---

	## Glossary

	- Banglish: Bengali written using the Latin alphabet
	- Subword Tokenization: Splitting words into smaller meaningful units
	- SentencePiece: A language-independent tokenizer and text segmentation library

	---

	## More Information

	Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.

	---

	## Model Card Authors

	Debashish Roy

	---

	## Model Card Contact

	For questions or collaboration:

	- Hugging Face: https://huggingface.co/thedeba
	```