|
|
--- |
|
|
language: |
|
|
- ug |
|
|
tags: |
|
|
- token-classification |
|
|
- punctuation-restoration |
|
|
- asr |
|
|
- nlp |
|
|
license: apache-2.0 |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: Uyghur_ASR_Restore_Punctuation |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
# Uyghur ASR Punctuation Restoration Model |
|
|
|
|
|
This model is designed to restore punctuation to raw text, specifically targeting **Uyghur** language inputs. It is particularly useful for post-processing the output of Automatic Speech Recognition (ASR) systems, which typically generate text without punctuation. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model ID:** `piyazon/Uyghur_ASR_Restore_Punctuation` |
|
|
- **Task:** Token Classification (Punctuation Restoration) |
|
|
- **Language:** Uyghur (ug) |
|
|
|
|
|
The model predicts a punctuation mark for each token in the sequence. It uses a specific label mapping to append punctuation marks to words. |
|
|
|
|
|
### Label Map |
|
|
|
|
|
The model output corresponds to the following punctuation marks: |
|
|
|
|
|
| ID | Label | Description | |
|
|
|:--:|:-----:|:-----------:| |
|
|
| 0 | `0` | No punctuation | |
|
|
| 1 | `.` | Period | |
|
|
| 2 | `،` | Comma (corresponds to `،` in Uyghur context) | |
|
|
| 3 | `؟` | Question mark (corresponds to `؟` in Uyghur context) | |
|
|
| 4 | `-` | Hyphen | |
|
|
| 5 | `:` | Colon | |
|
|
| 6 | `؛` | Semicolon | |
|
|
|
|
|
## How to Use |
|
|
|
|
|
This model requires specific inference logic to handle subword tokenization (handling the ` ` / `\u2581` character) and correctly attach the predicted punctuation to the end of full words. |
|
|
|
|
|
You can use the following Python script to run inference: |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
model_id = "piyazon/Uyghur_ASR_Restore_Punctuation" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id, fix_mistral_regex=True) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_id) |
|
|
|
|
|
# Label mapping |
|
|
label_map = { |
|
|
0: "0", |
|
|
1: ".", |
|
|
2: "،", |
|
|
3: "؟", |
|
|
4: "-", |
|
|
5: ":", |
|
|
6: "؛" |
|
|
} |
|
|
|
|
|
def restore_punctuation(text): |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
logits = model(**inputs).logits |
|
|
|
|
|
predictions = torch.argmax(logits, dim=2)[0].tolist() |
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) |
|
|
|
|
|
result = "" |
|
|
current_word = "" |
|
|
current_label = "0" |
|
|
|
|
|
for i, token in enumerate(tokens): |
|
|
if token in tokenizer.all_special_tokens: |
|
|
continue |
|
|
|
|
|
# Check for SentencePiece/Unigram underscore |
|
|
is_start_of_word = token.startswith("\u2581") |
|
|
|
|
|
if is_start_of_word: |
|
|
# 1. Finish the PREVIOUS word |
|
|
if current_word: |
|
|
result += current_word |
|
|
# Add punctuation if predicted |
|
|
if current_label != "0": |
|
|
result += current_label |
|
|
# Add a space |
|
|
result += " " |
|
|
|
|
|
# 2. Start NEW word (remove the underscore) |
|
|
current_word = token.replace("\u2581", "") |
|
|
|
|
|
# Reset label to the prediction of this new token |
|
|
pred_id = predictions[i] |
|
|
current_label = label_map.get(pred_id, "0") |
|
|
|
|
|
else: |
|
|
# It is a sub-part of the word (merge it) |
|
|
current_word += token |
|
|
|
|
|
# Update label: The label of the LAST sub-token is usually the valid one |
|
|
pred_id = predictions[i] |
|
|
if pred_id in label_map and label_map[pred_id] != "0": |
|
|
current_label = label_map[pred_id] |
|
|
|
|
|
# Process the very last word |
|
|
if current_word: |
|
|
result += current_word |
|
|
if current_label != "0": |
|
|
result += current_label |
|
|
|
|
|
return result.strip() |
|
|
|
|
|
# Example Usage |
|
|
text_input = """ |
|
|
چىنلىق بىلەن توقۇلمىنىڭ رېئاللىق بىلەن تەسەۋۋۇرنىڭ ماكان بىلەن زاماننىڭ مۇناسىۋىتىنى قانداق بولىدۇ |
|
|
""" |
|
|
# (Input: "The weather is very good today" without punctuation) |
|
|
|
|
|
restored_text = restore_punctuation(text_input) |
|
|
print(restored_text) |
|
|
|