pandurangpatil commited on
Commit
3b8147b
·
verified ·
1 Parent(s): 4c21ff7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +113 -3
README.md CHANGED
@@ -1,3 +1,113 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - mr
4
+ license: mit
5
+ tags:
6
+ - tokenizer
7
+ - bpe
8
+ - marathi
9
+ - devanagari
10
+ library_name: tokenizers
11
+ ---
12
+
13
+ # Marathi BPE Tokenizer
14
+
15
+ A Byte Pair Encoding (BPE) tokenizer trained on Marathi text using the Devanagari script.
16
+
17
+ ## Model Details
18
+
19
+ - **Model Type:** BPE Tokenizer
20
+ - **Language:** Marathi (mr)
21
+ - **Script:** Devanagari
22
+ - **Vocabulary Size:** 4845 tokens
23
+ - **Base Vocabulary:** 845 graphemes
24
+ - **Merge Operations:** 4000
25
+ - **License:** MIT
26
+
27
+ ## Training Details
28
+
29
+ The tokenizer was trained using a custom Byte Pair Encoding implementation optimized for Devanagari script:
30
+
31
+ - **Starting Unit:** Unicode extended grapheme clusters (not bytes)
32
+ - **Training Corpus Size:** 92,627 characters
33
+ - **Compression Ratio (Grapheme):** 2.84x
34
+ - **Compression Ratio (Byte):** 12.30x
35
+
36
+ ## Usage
37
+
38
+ ```python
39
+ from tokenizers import Tokenizer
40
+
41
+ # Load the tokenizer
42
+ tokenizer = Tokenizer.from_pretrained("pandurangpatil/sample-marathi-bpe-tokenizer")
43
+
44
+ # Encode text
45
+ text = "नमस्कार! हे एक मराठी टोकनायझर आहे."
46
+ encoded = tokenizer.encode(text)
47
+ print(f"Token IDs: {encoded.ids}")
48
+ print(f"Tokens: {encoded.tokens}")
49
+
50
+ # Decode back to text
51
+ decoded = tokenizer.decode(encoded.ids)
52
+ print(f"Decoded: {decoded}")
53
+ ```
54
+
55
+ ### Using with Custom Scripts
56
+
57
+ If you want to use the raw artifacts:
58
+
59
+ ```python
60
+ import json
61
+ from tokenizer_utils import encode, decode
62
+
63
+ # Load artifacts
64
+ with open('vocab.json', 'r', encoding='utf-8') as f:
65
+ token_to_id = json.load(f)
66
+
67
+ with open('merges.json', 'r', encoding='utf-8') as f:
68
+ merges_str = json.load(f)
69
+ merges = {tuple(map(int, k.split(','))): v for k, v in merges_str.items()}
70
+
71
+ with open('id_to_token.json', 'r', encoding='utf-8') as f:
72
+ id_to_token = {int(k): v for k, v in json.load(f).items()}
73
+
74
+ # Encode and decode
75
+ text = "मराठी मजकूर"
76
+ token_ids = encode(merges, token_to_id, text)
77
+ reconstructed = decode(id_to_token, token_ids)
78
+ ```
79
+
80
+ ## Grapheme-based Approach
81
+
82
+ Unlike traditional byte-level BPE, this tokenizer:
83
+ - Starts with **Unicode grapheme clusters** as base units
84
+ - Properly handles Devanagari combining characters (matras, virama)
85
+ - Maintains linguistic meaning at the subword level
86
+ - Achieves better compression for Devanagari text
87
+
88
+ Example of grapheme segmentation:
89
+ - नमस्कार → [न, म, स्, का, र] (graphemes)
90
+ - Each grapheme preserves visual/phonetic integrity
91
+
92
+ ## Limitations
93
+
94
+ - Trained on a limited corpus size
95
+ - May not generalize well to domains outside training data
96
+ - Does not include special tokens for ML models (PAD, UNK, BOS, EOS)
97
+ - Designed for tokenization research and experimentation
98
+
99
+ ## Citation
100
+
101
+ ```bibtex
102
+ @misc{marathi-bpe-tokenizer,
103
+ author = {Your Name},
104
+ title = {Marathi BPE Tokenizer},
105
+ year = {2025},
106
+ publisher = {HuggingFace},
107
+ url = {https://huggingface.co/pandurangpatil/sample-marathi-bpe-tokenizer}
108
+ }
109
+ ```
110
+
111
+ ## License
112
+
113
+ MIT License - see LICENSE file for details.