thedeba commited on
Commit
587b151
·
verified ·
1 Parent(s): 1b523da

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -14
README.md CHANGED
@@ -32,18 +32,18 @@ Friday Tokenizer is a SentencePiece-based subword tokenizer designed for lightwe
32
  The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
33
 
34
  - **Developed by:** Debashish Roy
35
- - **Funded by [optional]:** Self-funded
36
- - **Shared by [optional]:** Debashish Roy
37
  - **Model type:** SentencePiece Tokenizer
38
  - **Language(s) (NLP):** Bengali, English, Banglish
39
  - **License:** Apache 2.0
40
- - **Finetuned from model [optional]:** None (built from scratch)
41
 
42
  ### Model Sources [optional]
43
 
44
  - **Repository:** https://huggingface.co/thedeba/friday-tokenizer
45
- - **Paper [optional]:** Not available
46
- - **Demo [optional]:** Not available
47
 
48
  ---
49
 
@@ -59,7 +59,7 @@ This tokenizer is intended for:
59
  - Banglish text generation
60
  - Lightweight multilingual language models
61
 
62
- ### Downstream Use [optional]
63
 
64
  The tokenizer can be integrated into:
65
 
@@ -136,7 +136,7 @@ The tokenizer was trained using mixed multilingual conversational datasets inclu
136
 
137
  The tokenizer was trained from scratch using SentencePiece subword tokenization.
138
 
139
- #### Preprocessing [optional]
140
 
141
  - Unicode normalization
142
  - Text cleaning
@@ -148,7 +148,7 @@ The tokenizer was trained from scratch using SentencePiece subword tokenization.
148
  - **Vocabulary Size:** 32000
149
  - **Training regime:** SentencePiece subword training
150
 
151
- #### Speeds, Sizes, Times [optional]
152
 
153
  - Lightweight tokenizer suitable for low-resource devices
154
  - Compact vocabulary size for efficient inference
@@ -186,7 +186,7 @@ Friday Tokenizer provides lightweight multilingual tokenization suitable for GPT
186
 
187
  ---
188
 
189
- ## Model Examination [optional]
190
 
191
  Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
192
 
@@ -204,7 +204,7 @@ Carbon emissions were not formally tracked during tokenizer training.
204
 
205
  ---
206
 
207
- ## Technical Specifications [optional]
208
 
209
  ### Model Architecture and Objective
210
 
@@ -228,7 +228,7 @@ Training was performed using local and cloud-based environments.
228
 
229
  ---
230
 
231
- ## Citation [optional]
232
 
233
  ### BibTeX
234
 
@@ -248,7 +248,7 @@ Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba
248
 
249
  ---
250
 
251
- ## Glossary [optional]
252
 
253
  - **Banglish:** Bengali written using the Latin alphabet
254
  - **Subword Tokenization:** Splitting words into smaller meaningful units
@@ -256,13 +256,13 @@ Roy, D. (2026). *Friday Tokenizer*. Hugging Face. https://huggingface.co/thedeba
256
 
257
  ---
258
 
259
- ## More Information [optional]
260
 
261
  Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
262
 
263
  ---
264
 
265
- ## Model Card Authors [optional]
266
 
267
  Debashish Roy
268
 
 
32
  The tokenizer is optimized for conversational datasets, mixed Bengali-English text, and Banglish (Romanized Bengali) inputs.
33
 
34
  - **Developed by:** Debashish Roy
35
+ - **Funded by:** Self-funded
36
+ - **Shared by:** Debashish Roy
37
  - **Model type:** SentencePiece Tokenizer
38
  - **Language(s) (NLP):** Bengali, English, Banglish
39
  - **License:** Apache 2.0
40
+ - **Finetuned from model:** None (built from scratch)
41
 
42
  ### Model Sources [optional]
43
 
44
  - **Repository:** https://huggingface.co/thedeba/friday-tokenizer
45
+ - **Paper:** Not available
46
+ - **Demo:** Not available
47
 
48
  ---
49
 
 
59
  - Banglish text generation
60
  - Lightweight multilingual language models
61
 
62
+ ### Downstream Use
63
 
64
  The tokenizer can be integrated into:
65
 
 
136
 
137
  The tokenizer was trained from scratch using SentencePiece subword tokenization.
138
 
139
+ #### Preprocessing
140
 
141
  - Unicode normalization
142
  - Text cleaning
 
148
  - **Vocabulary Size:** 32000
149
  - **Training regime:** SentencePiece subword training
150
 
151
+ #### Speeds, Sizes, Times
152
 
153
  - Lightweight tokenizer suitable for low-resource devices
154
  - Compact vocabulary size for efficient inference
 
186
 
187
  ---
188
 
189
+ ## Model Examination
190
 
191
  Basic qualitative inspection was performed to verify token splitting and text reconstruction quality.
192
 
 
204
 
205
  ---
206
 
207
+ ## Technical Specifications
208
 
209
  ### Model Architecture and Objective
210
 
 
228
 
229
  ---
230
 
231
+ ## Citation
232
 
233
  ### BibTeX
234
 
 
248
 
249
  ---
250
 
251
+ ## Glossary
252
 
253
  - **Banglish:** Bengali written using the Latin alphabet
254
  - **Subword Tokenization:** Splitting words into smaller meaningful units
 
256
 
257
  ---
258
 
259
+ ## More Information
260
 
261
  Friday Tokenizer is part of the broader Friday GPT ecosystem focused on building multilingual lightweight AI systems from scratch.
262
 
263
  ---
264
 
265
+ ## Model Card Authors
266
 
267
  Debashish Roy
268