Instructions to use thedeba/friday-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thedeba/friday-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thedeba/friday-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("thedeba/friday-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thedeba/friday-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thedeba/friday-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thedeba/friday-tokenizer
- SGLang
How to use thedeba/friday-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thedeba/friday-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thedeba/friday-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thedeba/friday-tokenizer with Docker Model Runner:
docker model run hf.co/thedeba/friday-tokenizer
Model Card for Friday Tokenizer
Friday Tokenizer is a custom multilingual tokenizer built completely from scratch for Bengali, English, and Banglish conversational AI systems.
Model Details
Model Description
Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightweight GPT-style language models and conversational AI applications. It was developed as part of the Friday GPT project to support Bengali and multilingual NLP without relying on existing pre-trained tokenizers.
The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
- Developed by: Debashish Roy
- Funded by: Self-funded
- Shared by: Debashish Roy
- Model type: SentencePiece Tokenizer
- Language(s) (NLP): Bengali, English, Banglish
- License: Apache 2.0
- Finetuned from model: None (built from scratch)
Model Sources [optional]
- Repository: https://huggingface.co/thedeba/friday-tokenizer
- Paper: Not available
- Demo: Not available
Uses
Direct Use
This tokenizer is intended for:
- GPT-style decoder-only language models
- Conversational AI systems
- Bengali NLP experiments
- Banglish text generation
- Lightweight multilingual language models
Downstream Use
The tokenizer can be integrated into:
- Chatbots
- Language generation systems
- Translation systems
- Bengali AI assistants
- Custom transformer training pipelines
Out-of-Scope Use
This tokenizer is not optimized for:
- Formal literary Bengali
- Legal or medical NLP applications
- High-precision linguistic analysis
- Production-scale multilingual systems without further evaluation
Bias, Risks, and Limitations
The tokenizer was trained primarily on conversational and subtitle-style datasets. As a result:
- Informal language patterns may be overrepresented
- Rare words may split aggressively
- Banglish spelling inconsistencies may affect tokenization quality
- Dataset biases from subtitle and internet conversations may exist
Recommendations
Users should evaluate tokenizer performance before deploying it in sensitive or production environments. Additional fine-tuning or vocabulary expansion may improve performance for specialized domains.
How to Get Started with the Model
Use the code below to get started with the tokenizer.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"thedeba/friday-tokenizer",
use_fast=False
)
text = "আমি আজ বাইরে যাচ্ছি"
tokens = tokenizer.tokenize(text)
ids = tokenizer.encode(text)
print(tokens)
print(ids)
decoded = tokenizer.decode(ids)
print(decoded)
Training Details
Training Data
The tokenizer was trained using mixed multilingual conversational datasets including:
- OpenSubtitles
- Bengali conversational text
- Bengali-English mixed text
- Banglish datasets
Training Procedure
The tokenizer was trained from scratch using SentencePiece subword tokenization.
Preprocessing
- Unicode normalization
- Text cleaning
- Duplicate filtering
- Mixed-language corpus preparation
Training Hyperparameters
- Vocabulary Size: 32000
- Training regime: SentencePiece subword training
Speeds, Sizes, Times
- Lightweight tokenizer suitable for low-resource devices
- Compact vocabulary size for efficient inference
Evaluation
Testing Data, Factors & Metrics
Testing Data
Internal conversational Bengali-English text samples were used for qualitative evaluation.
Factors
Evaluation focused on:
- Bengali Unicode support
- Mixed-language tokenization
- Banglish handling
- Conversational token quality
Metrics
Qualitative tokenization inspection and reconstruction accuracy were primarily used.
Results
The tokenizer successfully supports multilingual conversational tokenization with efficient subword segmentation.
Summary
Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT-style language models and Bengali conversational AI applications.
Model Examination
Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
Environmental Impact
Carbon emissions were not formally tracked during tokenizer training.
- Hardware Type: Consumer GPU / CPU
- Hours used: Not recorded
- Cloud Provider: Google Colab
- Compute Region: Not specified
- Carbon Emitted: Unknown
Technical Specifications
Model Architecture and Objective
- Architecture: SentencePiece tokenizer
- Objective: Multilingual subword tokenization for conversational AI
Compute Infrastructure
Training was performed using local and cloud-based environments.
Hardware
- Consumer-grade hardware
- Google Colab environment
Software
- Python
- SentencePiece
- Hugging Face Transformers
Citation
BibTeX
@misc{fridaytokenizer2026,
title={Friday Tokenizer},
author={Debashish Roy},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/thedeba/friday-tokenizer}}
}
APA
Roy, D. (2026). Friday Tokenizer. Hugging Face. https://huggingface.co/thedeba/friday-tokenizer
Glossary
- Banglish: Bengali written using the Latin alphabet
- Subword Tokenization: Splitting words into smaller meaningful units
- SentencePiece: A language-independent tokenizer and text segmentation library
More Information
Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
Model Card Authors
Debashish Roy
Model Card Contact
For questions or collaboration:
- Hugging Face: https://huggingface.co/thedeba