You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Uyghur ASR Punctuation Restoration Model

This model is designed to restore punctuation to raw text, specifically targeting Uyghur language inputs. It is particularly useful for post-processing the output of Automatic Speech Recognition (ASR) systems, which typically generate text without punctuation.

Model Description

Model ID: piyazon/Uyghur_ASR_Restore_Punctuation
Task: Token Classification (Punctuation Restoration)
Language: Uyghur (ug)

The model predicts a punctuation mark for each token in the sequence. It uses a specific label mapping to append punctuation marks to words.

Label Map

The model output corresponds to the following punctuation marks:

ID	Label	Description
0	`0`	No punctuation
1	`.`	Period
2	`،`	Comma (corresponds to `،` in Uyghur context)
3	`؟`	Question mark (corresponds to `؟` in Uyghur context)
4	`-`	Hyphen
5	`:`	Colon
6	`؛`	Semicolon

How to Use

This model requires specific inference logic to handle subword tokenization (handling the / \u2581 character) and correctly attach the predicted punctuation to the end of full words.

You can use the following Python script to run inference:

from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch

# Load model and tokenizer
model_id = "piyazon/Uyghur_ASR_Restore_Punctuation"
tokenizer = AutoTokenizer.from_pretrained(model_id, fix_mistral_regex=True)
model = AutoModelForTokenClassification.from_pretrained(model_id)

# Label mapping
label_map = {
    0: "0",
    1: ".",   
    2: "،",   
    3: "؟",   
    4: "-",   
    5: ":",   
    6: "؛"    
}

def restore_punctuation(text):
    inputs = tokenizer(text, return_tensors="pt")
    with torch.no_grad():
        logits = model(**inputs).logits
    
    predictions = torch.argmax(logits, dim=2)[0].tolist()
    tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
    
    result = ""
    current_word = ""
    current_label = "0"
    
    for i, token in enumerate(tokens):
        if token in tokenizer.all_special_tokens:
            continue
            
        # Check for SentencePiece/Unigram underscore
        is_start_of_word = token.startswith("\u2581")
        
        if is_start_of_word:
            # 1. Finish the PREVIOUS word
            if current_word:
                result += current_word
                # Add punctuation if predicted
                if current_label != "0":
                    result += current_label
                # Add a space
                result += " " 
            
            # 2. Start NEW word (remove the underscore)
            current_word = token.replace("\u2581", "")
            
            # Reset label to the prediction of this new token
            pred_id = predictions[i]
            current_label = label_map.get(pred_id, "0")
                
        else:
            # It is a sub-part of the word (merge it)
            current_word += token
            
            # Update label: The label of the LAST sub-token is usually the valid one
            pred_id = predictions[i]
            if pred_id in label_map and label_map[pred_id] != "0":
                current_label = label_map[pred_id]
    
    # Process the very last word
    if current_word:
        result += current_word
        if current_label != "0":
            result += current_label
            
    return result.strip()

# Example Usage
text_input = """
چىنلىق بىلەن توقۇلمىنىڭ رېئاللىق بىلەن تەسەۋۋۇرنىڭ ماكان بىلەن زاماننىڭ مۇناسىۋىتىنى قانداق بولىدۇ
""" 
# (Input: "The weather is very good today" without punctuation)

restored_text = restore_punctuation(text_input)
print(restored_text)

Downloads last month: 3

Safetensors

Model size

0.3B params

Tensor type

F32

piyazon
/

Uyghur_ASR_Restore_Punctuation

You need to agree to share your contact information to access this model

Uyghur ASR Punctuation Restoration Model

Model Description

Label Map

How to Use

Space using piyazon/Uyghur_ASR_Restore_Punctuation 1