Text Generation
Transformers
Bengali
English
tokenizer
sentencepiece
bengali
banglish
english
multilingual
nlp
gpt
Instructions to use thedeba/friday-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thedeba/friday-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thedeba/friday-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("thedeba/friday-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thedeba/friday-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thedeba/friday-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thedeba/friday-tokenizer
- SGLang
How to use thedeba/friday-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thedeba/friday-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thedeba/friday-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thedeba/friday-tokenizer with Docker Model Runner:
docker model run hf.co/thedeba/friday-tokenizer
| language: | |
| - bn | |
| - en | |
| library_name: transformers | |
| pipeline_tag: text-generation | |
| license: apache-2.0 | |
| tags: | |
| - tokenizer | |
| - sentencepiece | |
| - bengali | |
| - banglish | |
| - english | |
| - multilingual | |
| - transformers | |
| - nlp | |
| - gpt | |
| # Model Card for Friday Tokenizer | |
| Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems. | |
| --- | |
| ## Model Details | |
| ### Model Description | |
| Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers. | |
| The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs. | |
| - **Developed by:** Debashish Roy | |
| - **Funded by:** Self-funded | |
| - **Shared by:** Debashish Roy | |
| - **Model type:** SentencePiece Tokenizer | |
| - **Language(s) (NLP):** Bengali, English, Banglish | |
| - **License:** Apache 2.0 | |
| - **Finetuned from model:** None (built from scratch) | |
| ### Model Sources [optional] | |
| - **Repository:** https://huggingface.co/thedeba/friday-tokenizer | |
| - **Paper:** Not available | |
| - **Demo:** Not available | |
| --- | |
| ## Uses | |
| ### Direct Use | |
| This tokenizer is intended for: | |
| - GPT-style decoder-only language models | |
| - Conversational AI systems | |
| - Bengali NLP experiments | |
| - Banglish text generation | |
| - Lightweight multilingual language models | |
| ### Downstream Use | |
| The tokenizer can be integrated into: | |
| - Chatbots | |
| - Language generation systems | |
| - Translation systems | |
| - Bengali AI assistants | |
| - Custom transformer training pipelines | |
| ### Out-of-Scope Use | |
| This tokenizer is not optimized for: | |
| - Formal literary Bengali | |
| - Legal or medical NLP applications | |
| - High-precision linguistic analysis | |
| - Production-scale multilingual systems without further evaluation | |
| --- | |
| ## Bias, Risks, and Limitations | |
| The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result: | |
| - Informal language patterns may be overrepresented | |
| - Rare words may split aggressively | |
| - Banglish spelling inconsistencies may affect tokenization quality | |
| - Dataset biases from subtitle and internet conversations may exist | |
| ### Recommendations | |
| Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains. | |
| --- | |
| ## How to Get Started with the Model | |
| Use the code below to get started with the tokenizer. | |
| ```python | |
| from transformers import AutoTokenizer | |
| tokenizer = AutoTokenizer.from_pretrained( | |
| "thedeba/friday-tokenizer", | |
| use_fast=False | |
| ) | |
| text = "আমি আজ বাইরে যাচ্ছি" | |
| tokens = tokenizer.tokenize(text) | |
| ids = tokenizer.encode(text) | |
| print(tokens) | |
| print(ids) | |
| decoded = tokenizer.decode(ids) | |
| print(decoded) | |
| ``` | |
| --- | |
| ## Training Details | |
| ### Training Data | |
| The tokenizer was trained using mixed multilingual conversational datasets including: | |
| - OpenSubtitles | |
| - Bengali conversational text | |
| - Bengali-English mixed text | |
| - Banglish datasets | |
| ### Training Procedure | |
| The tokenizer was trained from scratch using SentencePiece subword tokenization. | |
| #### Preprocessing | |
| - Unicode normalization | |
| - Text cleaning | |
| - Duplicate filtering | |
| - Mixed-language corpus preparation | |
| #### Training Hyperparameters | |
| - **Vocabulary Size:** 32000 | |
| - **Training regime:** SentencePiece subword training | |
| #### Speeds, Sizes, Times | |
| - Lightweight tokenizer suitable for low-resource devices | |
| - Compact vocabulary size for efficient inference | |
| --- | |
| ## Evaluation | |
| ### Testing Data, Factors & Metrics | |
| #### Testing Data | |
| Internal conversational Bengali-English text samples were used for qualitative evaluation. | |
| #### Factors | |
| Evaluation focused on: | |
| - Bengali Unicode support | |
| - Mixed-language tokenization | |
| - Banglish handling | |
| - Conversational token quality | |
| #### Metrics | |
| Qualitative tokenization inspection and reconstruction accuracy were primarily used. | |
| ### Results | |
| The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation. | |
| #### Summary | |
| Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications. | |
| --- | |
| ## Model Examination | |
| Basic qualitative inspection was performed to verify token splitting and text reconstruction quality. | |
| --- | |
| ## Environmental Impact | |
| Carbon emissions were not formally tracked during tokenizer training. | |
| - **Hardware Type:** Consumer GPU / CPU | |
| - **Hours used:** Not recorded | |
| - **Cloud Provider:** Google Colab | |
| - **Compute Region:** Not specified | |
| - **Carbon Emitted:** Unknown | |
| --- | |
| ## Technical Specifications | |
| ### Model Architecture and Objective | |
| - Architecture: SentencePiece tokenizer | |
| - Objective: Multilingual subword tokenization for conversational AI | |
| ### Compute Infrastructure | |
| Training was performed using local and cloud-based environments. | |
| #### Hardware | |
| - Consumer-grade hardware | |
| - Google Colab environment | |
| #### Software | |
| - Python | |
| - SentencePiece | |
| - Hugging Face Transformers | |
| --- | |
| ## Citation | |
| ### BibTeX | |
| ```bibtex | |
| @misc{fridaytokenizer2026, | |
| title={Friday Tokenizer}, | |
| author={Debashish Roy}, | |
| year={2026}, | |
| publisher={Hugging Face}, | |
| howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}} | |
| } | |
| ``` | |
| ### APA | |
| Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer | |
| --- | |
| ## Glossary | |
| - **Banglish:** Bengali written using the Latin alphabet | |
| - **Subword Tokenization:** Splitting words into smaller meaningful units | |
| - **SentencePiece:** A language-independent tokenizer and text segmentation library | |
| --- | |
| ## More Information | |
| Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch. | |
| --- | |
| ## Model Card Authors | |
| Debashish Roy | |
| --- | |
| ## Model Card Contact | |
| For questions or collaboration: | |
| - Hugging Face: https://huggingface.co/thedeba | |
| ``` |