--- language: - bn - en library_name: transformers pipeline_tag: text-generation license: apache-2.0 tags: - tokenizer - sentencepiece - bengali - banglish - english - multilingual - transformers - nlp - gpt --- # Model Card for Friday Tokenizer Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems. --- ## Model Details ### Model Description Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers. The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs. - **Developed by:** Debashish Roy - **Funded by:** Self-funded - **Shared by:** Debashish Roy - **Model type:** SentencePiece Tokenizer - **Language(s) (NLP):** Bengali, English, Banglish - **License:** Apache 2.0 - **Finetuned from model:** None (built from scratch) ### Model Sources [optional] - **Repository:** https://huggingface.co/thedeba/friday-tokenizer - **Paper:** Not available - **Demo:** Not available --- ## Uses ### Direct Use This tokenizer is intended for: - GPT-style decoder-only language models - Conversational AI systems - Bengali NLP experiments - Banglish text generation - Lightweight multilingual language models ### Downstream Use The tokenizer can be integrated into: - Chatbots - Language generation systems - Translation systems - Bengali AI assistants - Custom transformer training pipelines ### Out-of-Scope Use This tokenizer is not optimized for: - Formal literary Bengali - Legal or medical NLP applications - High-precision linguistic analysis - Production-scale multilingual systems without further evaluation --- ## Bias, Risks, and Limitations The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result: - Informal language patterns may be overrepresented - Rare words may split aggressively - Banglish spelling inconsistencies may affect tokenization quality - Dataset biases from subtitle and internet conversations may exist ### Recommendations Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains. --- ## How to Get Started with the Model Use the code below to get started with the tokenizer. ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( "thedeba/friday-tokenizer", use_fast=False ) text = "আমি আজ বাইরে যাচ্ছি" tokens = tokenizer.tokenize(text) ids = tokenizer.encode(text) print(tokens) print(ids) decoded = tokenizer.decode(ids) print(decoded) ``` --- ## Training Details ### Training Data The tokenizer was trained using mixed multilingual conversational datasets including: - OpenSubtitles - Bengali conversational text - Bengali-English mixed text - Banglish datasets ### Training Procedure The tokenizer was trained from scratch using SentencePiece subword tokenization. #### Preprocessing - Unicode normalization - Text cleaning - Duplicate filtering - Mixed-language corpus preparation #### Training Hyperparameters - **Vocabulary Size:** 32000 - **Training regime:** SentencePiece subword training #### Speeds, Sizes, Times - Lightweight tokenizer suitable for low-resource devices - Compact vocabulary size for efficient inference --- ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data Internal conversational Bengali-English text samples were used for qualitative evaluation. #### Factors Evaluation focused on: - Bengali Unicode support - Mixed-language tokenization - Banglish handling - Conversational token quality #### Metrics Qualitative tokenization inspection and reconstruction accuracy were primarily used. ### Results The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation. #### Summary Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications. --- ## Model Examination Basic qualitative inspection was performed to verify token splitting and text reconstruction quality. --- ## Environmental Impact Carbon emissions were not formally tracked during tokenizer training. - **Hardware Type:** Consumer GPU / CPU - **Hours used:** Not recorded - **Cloud Provider:** Google Colab - **Compute Region:** Not specified - **Carbon Emitted:** Unknown --- ## Technical Specifications ### Model Architecture and Objective - Architecture: SentencePiece tokenizer - Objective: Multilingual subword tokenization for conversational AI ### Compute Infrastructure Training was performed using local and cloud-based environments. #### Hardware - Consumer-grade hardware - Google Colab environment #### Software - Python - SentencePiece - Hugging Face Transformers --- ## Citation ### BibTeX ```bibtex @misc{fridaytokenizer2026, title={Friday Tokenizer}, author={Debashish Roy}, year={2026}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}} } ``` ### APA Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer --- ## Glossary - **Banglish:** Bengali written using the Latin alphabet - **Subword Tokenization:** Splitting words into smaller meaningful units - **SentencePiece:** A language-independent tokenizer and text segmentation library --- ## More Information Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch. --- ## Model Card Authors Debashish Roy --- ## Model Card Contact For questions or collaboration: - Hugging Face: https://huggingface.co/thedeba ```