Upload README.md with huggingface_hub

15aaf17 verified about 1 year ago

1.63 kB

language: uz
license: apache-2.0
tags:
  - uzbek
  - pos-tagging
  - universal-dependencies
  - nlp
datasets:
  - universal_dependencies
metrics:
  - accuracy
  - f1

Uzbek POS Tagger

This model predicts Universal Dependencies part-of-speech (POS) tags for Uzbek text.

Model details

The model was fine-tuned on a Universal Dependencies treebank containing approximately 600 annotated sentences. It is based on the XLM-RoBERTa base model and adapted for token classification.

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("Arofat/uzbek-pos-tagger")
model = AutoModelForTokenClassification.from_pretrained("Arofat/uzbek-pos-tagger")

# Prepare text
text = "Men O'zbekistonda yashayman."
tokens = text.split()

# Get predictions
inputs = tokenizer(tokens, is_split_into_words=True, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs
predictions = torch.argmax(outputs.logits, dim=2)
id2label = model.config.id2label

# Get POS tags
pos_tags = []
word_ids = inputs.word_ids(batch_index=0)
prev_word_id = None
for idx, word_id in enumerate(word_ids):
    if word_id is None or word_id == prev_word_id:
        continue
    pos_tags.append(id2label[predictions[0, idx].item()])
    prev_word_id = word_id

# Print results
for token, tag in zip(tokens, pos_tags):
    print(f"{token}: {tag}")

Limitations

This model was trained on a relatively small dataset and may not generalize well to all domains of Uzbek text.