thedeba
/

friday-tokenizer

@@ -32,18 +32,18 @@ Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightwe
 The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
 - **Developed by:** Debashish Roy
-- **Funded by [optional]:** Self-funded
-- **Shared by [optional]:** Debashish Roy
 - **Model type:** SentencePiece Tokenizer
 - **Language(s) (NLP):** Bengali, English, Banglish
 - **License:** Apache 2.0
-- **Finetuned from model [optional]:** None (built from scratch)
 ### Model Sources [optional]
 - **Repository:** https://huggingface.co/thedeba/friday-tokenizer
-- **Paper [optional]:** Not available
-- **Demo [optional]:** Not available
 ---
@@ -59,7 +59,7 @@ This tokenizer is intended for:
 - Banglish text generation
 - Lightweight multilingual language models
-### Downstream Use [optional]
 The tokenizer can be integrated into:
@@ -136,7 +136,7 @@ The tokenizer was trained using mixed multilingual conversational datasets inclu
 The tokenizer was trained from scratch using SentencePiece subword tokenization.
-#### Preprocessing [optional]
 - Unicode normalization
 - Text cleaning
@@ -148,7 +148,7 @@ The tokenizer was trained from scratch using SentencePiece subword tokenization.
 - **Vocabulary Size:** 32000
 - **Training regime:** SentencePiece subword training
-#### Speeds, Sizes, Times [optional]
 - Lightweight tokenizer suitable for low-resource devices
 - Compact vocabulary size for efficient inference
@@ -186,7 +186,7 @@ Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT
 ---
-## Model Examination [optional]
 Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
@@ -204,7 +204,7 @@ Carbon emissions were not formally tracked during tokenizer training.
 ---
-## Technical Specifications [optional]
 ### Model Architecture and Objective
@@ -228,7 +228,7 @@ Training was performed using local and cloud-based environments.
 ---
-## Citation [optional]
 ### BibTeX
@@ -248,7 +248,7 @@ Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba
 ---
-## Glossary [optional]
 - **Banglish:** Bengali written using the Latin alphabet
 - **Subword Tokenization:** Splitting words into smaller meaningful units
@@ -256,13 +256,13 @@ Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba
 ---
-## More Information [optional]
 Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
 ---
-## Model Card Authors [optional]
 Debashish Roy

 The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
 - **Developed by:** Debashish Roy
+- **Funded by:** Self-funded
+- **Shared by:** Debashish Roy
 - **Model type:** SentencePiece Tokenizer
 - **Language(s) (NLP):** Bengali, English, Banglish
 - **License:** Apache 2.0
+- **Finetuned from model:** None (built from scratch)
 ### Model Sources [optional]
 - **Repository:** https://huggingface.co/thedeba/friday-tokenizer
+- **Paper:** Not available
+- **Demo:** Not available
 ---
 - Banglish text generation
 - Lightweight multilingual language models
+### Downstream Use
 The tokenizer can be integrated into:
 The tokenizer was trained from scratch using SentencePiece subword tokenization.
+#### Preprocessing
 - Unicode normalization
 - Text cleaning
 - **Vocabulary Size:** 32000
 - **Training regime:** SentencePiece subword training
+#### Speeds, Sizes, Times
 - Lightweight tokenizer suitable for low-resource devices
 - Compact vocabulary size for efficient inference
 ---
+## Model Examination
 Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
 ---
+## Technical Specifications
 ### Model Architecture and Objective
 ---
+## Citation
 ### BibTeX
 ---
+## Glossary
 - **Banglish:** Bengali written using the Latin alphabet
 - **Subword Tokenization:** Splitting words into smaller meaningful units
 ---
+## More Information
 Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
 ---
+## Model Card Authors
 Debashish Roy