piyazon's picture
Update README.md
04eb247 verified
metadata
language:
  - ug
tags:
  - token-classification
  - punctuation-restoration
  - asr
  - nlp
license: apache-2.0
metrics:
  - accuracy
model-index:
  - name: Uyghur_ASR_Restore_Punctuation
    results: []

Uyghur ASR Punctuation Restoration Model

This model is designed to restore punctuation to raw text, specifically targeting Uyghur language inputs. It is particularly useful for post-processing the output of Automatic Speech Recognition (ASR) systems, which typically generate text without punctuation.

Model Description

  • Model ID: piyazon/Uyghur_ASR_Restore_Punctuation
  • Task: Token Classification (Punctuation Restoration)
  • Language: Uyghur (ug)

The model predicts a punctuation mark for each token in the sequence. It uses a specific label mapping to append punctuation marks to words.

Label Map

The model output corresponds to the following punctuation marks:

ID Label Description
0 0 No punctuation
1 . Period
2 ، Comma (corresponds to ، in Uyghur context)
3 ؟ Question mark (corresponds to ؟ in Uyghur context)
4 - Hyphen
5 : Colon
6 ؛ Semicolon

How to Use

This model requires specific inference logic to handle subword tokenization (handling the / \u2581 character) and correctly attach the predicted punctuation to the end of full words.

You can use the following Python script to run inference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_id = "piyazon/Uyghur_ASR_Restore_Punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_id, fix_mistral_regex=True)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Label mapping
label_map = {
    0: "0",
    1: ".",   
    2: "،",   
    3: "؟",   
    4: "-",   
    5: ":",   
    6: "؛"    
}

def restore_punctuation(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    
    predictions = torch.argmax(logits, dim=2)[0].tolist()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    result = ""
    current_word = ""
    current_label = "0"
    
    for i, token in enumerate(tokens):
        if token in tokenizer.all_special_tokens:
            continue
            
        # Check for SentencePiece/Unigram underscore
        is_start_of_word = token.startswith("\u2581")
        
        if is_start_of_word:
            # 1. Finish the PREVIOUS word
            if current_word:
                result += current_word
                # Add punctuation if predicted
                if current_label != "0":
                    result += current_label
                # Add a space
                result += " " 
            
            # 2. Start NEW word (remove the underscore)
            current_word = token.replace("\u2581", "")
            
            # Reset label to the prediction of this new token
            pred_id = predictions[i]
            current_label = label_map.get(pred_id, "0")
                
        else:
            # It is a sub-part of the word (merge it)
            current_word += token
            
            # Update label: The label of the LAST sub-token is usually the valid one
            pred_id = predictions[i]
            if pred_id in label_map and label_map[pred_id] != "0":
                current_label = label_map[pred_id]
    
    # Process the very last word
    if current_word:
        result += current_word
        if current_label != "0":
            result += current_label
            
    return result.strip()

# Example Usage
text_input = """
چىنلىق بىلەن توقۇلمىنىڭ رېئاللىق بىلەن تەسەۋۋۇرنىڭ ماكان بىلەن زاماننىڭ مۇناسىۋىتىنى قانداق بولىدۇ
""" 
# (Input: "The weather is very good today" without punctuation)

restored_text = restore_punctuation(text_input)
print(restored_text)