File size: 930 Bytes
0e25312 ea1cc06 0e25312 ea1cc06 0e25312 ea1cc06 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
language:
- uz
base_model:
- FacebookAI/xlm-roberta-base
---
# Tokenizer for Uzbek Language
## Introduction
Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences
## Features
- Matnlarni tokenlarga ajratadi.
- Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi.
## Installation
Python va kerakli kutubxonalar:
```
pip install transformers datasets
```
## Usage
```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer")
text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda"
tokens = tokenizer.tokenize(text)
print(tokens)
```
## Dataset Description
Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi.
## Contact
[Jamshid Ahmadov](https://www.linkedin.com/in/jamshid-ds) |