Uzbek
Jamshid Ahmadov commited on
Commit
0e25312
·
verified ·
1 Parent(s): b050f80

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +36 -3
README.md CHANGED
@@ -1,3 +1,36 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ # Tokenizer for Common Voice Dataset
6
+
7
+ ## Introduction
8
+ Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences
9
+
10
+ ## Features
11
+ - Matnlarni tokenlarga ajratadi.
12
+ - Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi.
13
+
14
+ ## Installation
15
+ Python va kerakli kutubxonalar:
16
+ ```
17
+ pip install transformers datasets
18
+ ```
19
+
20
+ ## Usage
21
+ ```python
22
+ from transformers import AutoTokenizer
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer")
25
+
26
+ text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda"
27
+ tokens = tokenizer.tokenize(text)
28
+ print(tokens)
29
+ ```
30
+
31
+ ## Dataset Description
32
+ Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi.
33
+
34
+ ## Contact
35
+ [Jamshid Ahmadov](https://www.linkedin.com/in/jamshid-ds)
36
+