Text Generation
Transformers
Bengali
English
tokenizer
sentencepiece
bengali
banglish
english
multilingual
nlp
gpt
Instructions to use thedeba/friday-tokenizer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use thedeba/friday-tokenizer with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="thedeba/friday-tokenizer")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("thedeba/friday-tokenizer", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use thedeba/friday-tokenizer with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "thedeba/friday-tokenizer" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/thedeba/friday-tokenizer
- SGLang
How to use thedeba/friday-tokenizer with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "thedeba/friday-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "thedeba/friday-tokenizer" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "thedeba/friday-tokenizer", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use thedeba/friday-tokenizer with Docker Model Runner:
docker model run hf.co/thedeba/friday-tokenizer
Update README.md
Browse files
README.md
CHANGED
|
@@ -32,18 +32,18 @@ Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightwe
|
|
| 32 |
The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
|
| 33 |
|
| 34 |
- **Developed by:** Debashish Roy
|
| 35 |
-
- **Funded by
|
| 36 |
-
- **Shared by
|
| 37 |
- **Model type:** SentencePiece Tokenizer
|
| 38 |
- **Language(s) (NLP):** Bengali, English, Banglish
|
| 39 |
- **License:** Apache 2.0
|
| 40 |
-
- **Finetuned from model
|
| 41 |
|
| 42 |
### Model Sources [optional]
|
| 43 |
|
| 44 |
- **Repository:** https://huggingface.co/thedeba/friday-tokenizer
|
| 45 |
-
- **Paper
|
| 46 |
-
- **Demo
|
| 47 |
|
| 48 |
---
|
| 49 |
|
|
@@ -59,7 +59,7 @@ This tokenizer is intended for:
|
|
| 59 |
- Banglish text generation
|
| 60 |
- Lightweight multilingual language models
|
| 61 |
|
| 62 |
-
### Downstream Use
|
| 63 |
|
| 64 |
The tokenizer can be integrated into:
|
| 65 |
|
|
@@ -136,7 +136,7 @@ The tokenizer was trained using mixed multilingual conversational datasets inclu
|
|
| 136 |
|
| 137 |
The tokenizer was trained from scratch using SentencePiece subword tokenization.
|
| 138 |
|
| 139 |
-
#### Preprocessing
|
| 140 |
|
| 141 |
- Unicode normalization
|
| 142 |
- Text cleaning
|
|
@@ -148,7 +148,7 @@ The tokenizer was trained from scratch using SentencePiece subword tokenization.
|
|
| 148 |
- **Vocabulary Size:** 32000
|
| 149 |
- **Training regime:** SentencePiece subword training
|
| 150 |
|
| 151 |
-
#### Speeds, Sizes, Times
|
| 152 |
|
| 153 |
- Lightweight tokenizer suitable for low-resource devices
|
| 154 |
- Compact vocabulary size for efficient inference
|
|
@@ -186,7 +186,7 @@ Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT
|
|
| 186 |
|
| 187 |
---
|
| 188 |
|
| 189 |
-
## Model Examination
|
| 190 |
|
| 191 |
Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
|
| 192 |
|
|
@@ -204,7 +204,7 @@ Carbon emissions were not formally tracked during tokenizer training.
|
|
| 204 |
|
| 205 |
---
|
| 206 |
|
| 207 |
-
## Technical Specifications
|
| 208 |
|
| 209 |
### Model Architecture and Objective
|
| 210 |
|
|
@@ -228,7 +228,7 @@ Training was performed using local and cloud-based environments.
|
|
| 228 |
|
| 229 |
---
|
| 230 |
|
| 231 |
-
## Citation
|
| 232 |
|
| 233 |
### BibTeX
|
| 234 |
|
|
@@ -248,7 +248,7 @@ Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba
|
|
| 248 |
|
| 249 |
---
|
| 250 |
|
| 251 |
-
## Glossary
|
| 252 |
|
| 253 |
- **Banglish:** Bengali written using the Latin alphabet
|
| 254 |
- **Subword Tokenization:** Splitting words into smaller meaningful units
|
|
@@ -256,13 +256,13 @@ Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba
|
|
| 256 |
|
| 257 |
---
|
| 258 |
|
| 259 |
-
## More Information
|
| 260 |
|
| 261 |
Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
|
| 262 |
|
| 263 |
---
|
| 264 |
|
| 265 |
-
## Model Card Authors
|
| 266 |
|
| 267 |
Debashish Roy
|
| 268 |
|
|
|
|
| 32 |
The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
|
| 33 |
|
| 34 |
- **Developed by:** Debashish Roy
|
| 35 |
+
- **Funded by:** Self-funded
|
| 36 |
+
- **Shared by:** Debashish Roy
|
| 37 |
- **Model type:** SentencePiece Tokenizer
|
| 38 |
- **Language(s) (NLP):** Bengali, English, Banglish
|
| 39 |
- **License:** Apache 2.0
|
| 40 |
+
- **Finetuned from model:** None (built from scratch)
|
| 41 |
|
| 42 |
### Model Sources [optional]
|
| 43 |
|
| 44 |
- **Repository:** https://huggingface.co/thedeba/friday-tokenizer
|
| 45 |
+
- **Paper:** Not available
|
| 46 |
+
- **Demo:** Not available
|
| 47 |
|
| 48 |
---
|
| 49 |
|
|
|
|
| 59 |
- Banglish text generation
|
| 60 |
- Lightweight multilingual language models
|
| 61 |
|
| 62 |
+
### Downstream Use
|
| 63 |
|
| 64 |
The tokenizer can be integrated into:
|
| 65 |
|
|
|
|
| 136 |
|
| 137 |
The tokenizer was trained from scratch using SentencePiece subword tokenization.
|
| 138 |
|
| 139 |
+
#### Preprocessing
|
| 140 |
|
| 141 |
- Unicode normalization
|
| 142 |
- Text cleaning
|
|
|
|
| 148 |
- **Vocabulary Size:** 32000
|
| 149 |
- **Training regime:** SentencePiece subword training
|
| 150 |
|
| 151 |
+
#### Speeds, Sizes, Times
|
| 152 |
|
| 153 |
- Lightweight tokenizer suitable for low-resource devices
|
| 154 |
- Compact vocabulary size for efficient inference
|
|
|
|
| 186 |
|
| 187 |
---
|
| 188 |
|
| 189 |
+
## Model Examination
|
| 190 |
|
| 191 |
Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
|
| 192 |
|
|
|
|
| 204 |
|
| 205 |
---
|
| 206 |
|
| 207 |
+
## Technical Specifications
|
| 208 |
|
| 209 |
### Model Architecture and Objective
|
| 210 |
|
|
|
|
| 228 |
|
| 229 |
---
|
| 230 |
|
| 231 |
+
## Citation
|
| 232 |
|
| 233 |
### BibTeX
|
| 234 |
|
|
|
|
| 248 |
|
| 249 |
---
|
| 250 |
|
| 251 |
+
## Glossary
|
| 252 |
|
| 253 |
- **Banglish:** Bengali written using the Latin alphabet
|
| 254 |
- **Subword Tokenization:** Splitting words into smaller meaningful units
|
|
|
|
| 256 |
|
| 257 |
---
|
| 258 |
|
| 259 |
+
## More Information
|
| 260 |
|
| 261 |
Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
|
| 262 |
|
| 263 |
---
|
| 264 |
|
| 265 |
+
## Model Card Authors
|
| 266 |
|
| 267 |
Debashish Roy
|
| 268 |
|