Uyghur ASR Punctuation Restoration Model
This model is designed to restore punctuation to raw text, specifically targeting Uyghur language inputs. It is particularly useful for post-processing the output of Automatic Speech Recognition (ASR) systems, which typically generate text without punctuation.
Model Description
- Model ID:
piyazon/Uyghur_ASR_Restore_Punctuation - Task: Token Classification (Punctuation Restoration)
- Language: Uyghur (ug)
The model predicts a punctuation mark for each token in the sequence. It uses a specific label mapping to append punctuation marks to words.
Label Map
The model output corresponds to the following punctuation marks:
| ID | Label | Description |
|---|---|---|
| 0 | 0 |
No punctuation |
| 1 | . |
Period |
| 2 | ، |
Comma (corresponds to ، in Uyghur context) |
| 3 | ؟ |
Question mark (corresponds to ؟ in Uyghur context) |
| 4 | - |
Hyphen |
| 5 | : |
Colon |
| 6 | ؛ |
Semicolon |
How to Use
This model requires specific inference logic to handle subword tokenization (handling the / \u2581 character) and correctly attach the predicted punctuation to the end of full words.
You can use the following Python script to run inference:
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model_id = "piyazon/Uyghur_ASR_Restore_Punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_id, fix_mistral_regex=True)
model = AutoModelForTokenClassification.from_pretrained(model_id)
# Label mapping
label_map = {
0: "0",
1: ".",
2: "،",
3: "؟",
4: "-",
5: ":",
6: "؛"
}
def restore_punctuation(text):
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predictions = torch.argmax(logits, dim=2)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
result = ""
current_word = ""
current_label = "0"
for i, token in enumerate(tokens):
if token in tokenizer.all_special_tokens:
continue
# Check for SentencePiece/Unigram underscore
is_start_of_word = token.startswith("\u2581")
if is_start_of_word:
# 1. Finish the PREVIOUS word
if current_word:
result += current_word
# Add punctuation if predicted
if current_label != "0":
result += current_label
# Add a space
result += " "
# 2. Start NEW word (remove the underscore)
current_word = token.replace("\u2581", "")
# Reset label to the prediction of this new token
pred_id = predictions[i]
current_label = label_map.get(pred_id, "0")
else:
# It is a sub-part of the word (merge it)
current_word += token
# Update label: The label of the LAST sub-token is usually the valid one
pred_id = predictions[i]
if pred_id in label_map and label_map[pred_id] != "0":
current_label = label_map[pred_id]
# Process the very last word
if current_word:
result += current_word
if current_label != "0":
result += current_label
return result.strip()
# Example Usage
text_input = """
چىنلىق بىلەن توقۇلمىنىڭ رېئاللىق بىلەن تەسەۋۋۇرنىڭ ماكان بىلەن زاماننىڭ مۇناسىۋىتىنى قانداق بولىدۇ
"""
# (Input: "The weather is very good today" without punctuation)
restored_text = restore_punctuation(text_input)
print(restored_text)
- Downloads last month
- 78