Uzbek
File size: 930 Bytes
0e25312
 
ea1cc06
 
 
 
 
 
0e25312
 
ea1cc06
0e25312
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ea1cc06
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
language:
- uz
base_model:
- FacebookAI/xlm-roberta-base
---

# Tokenizer for Uzbek Language

## Introduction
Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences

## Features
- Matnlarni tokenlarga ajratadi.
- Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi.

## Installation
Python va kerakli kutubxonalar:
```
pip install transformers datasets
```

## Usage
```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer")

text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda"
tokens = tokenizer.tokenize(text)
print(tokens)
```

## Dataset Description
Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi.

## Contact 
[Jamshid Ahmadov](https://www.linkedin.com/in/jamshid-ds)