|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- mozilla-foundation/common_voice_17_0 |
|
|
language: |
|
|
- uz |
|
|
base_model: |
|
|
- FacebookAI/xlm-roberta-base |
|
|
--- |
|
|
|
|
|
# Tokenizer for Uzbek Language |
|
|
|
|
|
## Introduction |
|
|
Ushbu tokenizer Mozilla Common Voice dataset ma'lumotlariga asoslangan. train+validated 130.000 sentences |
|
|
|
|
|
## Features |
|
|
- Matnlarni tokenlarga ajratadi. |
|
|
- Ko'p bo'lmagan talaffuz va aksentlarni qo'llab-quvvatlaydi. |
|
|
|
|
|
## Installation |
|
|
Python va kerakli kutubxonalar: |
|
|
``` |
|
|
pip install transformers datasets |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
```python |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("jamshidahmadov/uz_tokenizer") |
|
|
|
|
|
text = "O'zbekistonda turli xil NLP loyihalari qurilmoqda" |
|
|
tokens = tokenizer.tokenize(text) |
|
|
print(tokens) |
|
|
``` |
|
|
|
|
|
## Dataset Description |
|
|
Common Voice 17.0 dataseti multilangual ya'ni ko'p tilli bo'lib o'zbek tilini ham qo'llab quvvatlaydi. |
|
|
|
|
|
## Contact |
|
|
[Jamshid Ahmadov](https://www.linkedin.com/in/jamshid-ds) |