piyazon
/

Uyghur_ASR_Restore_Punctuation

Token Classification

punctuation-restoration

Model card Files Files and versions

Uyghur_ASR_Restore_Punctuation / README.md

piyazon's picture

Update README.md

04eb247 verified 29 days ago

|

history blame contribute delete

4.09 kB

	---
	language:
	- ug
	tags:
	- token-classification
	- punctuation-restoration
	- asr
	- nlp
	license: apache-2.0
	metrics:
	- accuracy
	model-index:
	- name: Uyghur_ASR_Restore_Punctuation
	results: []
	---

	# Uyghur ASR Punctuation Restoration Model

	This model is designed to restore punctuation to raw text, specifically targeting Uyghur language inputs. It is particularly useful for post-processing the output of Automatic Speech Recognition (ASR) systems, which typically generate text without punctuation.

	## Model Description

	- Model ID: `piyazon/Uyghur_ASR_Restore_Punctuation`
	- Task: Token Classification (Punctuation Restoration)
	- Language: Uyghur (ug)

	The model predicts a punctuation mark for each token in the sequence. It uses a specific label mapping to append punctuation marks to words.

	### Label Map

	The model output corresponds to the following punctuation marks:

	\| ID \| Label \| Description \|
	\|:--:\|:-----:\|:-----------:\|
	\| 0 \| `0` \| No punctuation \|
	\| 1 \| `.` \| Period \|
	\| 2 \| `،` \| Comma (corresponds to `،` in Uyghur context) \|
	\| 3 \| `؟` \| Question mark (corresponds to `؟` in Uyghur context) \|
	\| 4 \| `-` \| Hyphen \|
	\| 5 \| `:` \| Colon \|
	\| 6 \| `؛` \| Semicolon \|

	## How to Use

	This model requires specific inference logic to handle subword tokenization (handling the ` ` / `\u2581` character) and correctly attach the predicted punctuation to the end of full words.

	You can use the following Python script to run inference:

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer
	model_id = "piyazon/Uyghur_ASR_Restore_Punctuation"
	tokenizer = AutoTokenizer.from_pretrained(model_id, fix_mistral_regex=True)
	model = AutoModelForTokenClassification.from_pretrained(model_id)

	# Label mapping
	label_map = {
	0: "0",
	1: ".",
	2: "،",
	3: "؟",
	4: "-",
	5: ":",
	6: "؛"
	}

	def restore_punctuation(text):
	inputs = tokenizer(text, return_tensors="pt")
	with torch.no_grad():
	logits = model(**inputs).logits

	predictions = torch.argmax(logits, dim=2)[0].tolist()
	tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])

	result = ""
	current_word = ""
	current_label = "0"

	for i, token in enumerate(tokens):
	if token in tokenizer.all_special_tokens:
	continue

	# Check for SentencePiece/Unigram underscore
	is_start_of_word = token.startswith("\u2581")

	if is_start_of_word:
	# 1. Finish the PREVIOUS word
	if current_word:
	result += current_word
	# Add punctuation if predicted
	if current_label != "0":
	result += current_label
	# Add a space
	result += " "

	# 2. Start NEW word (remove the underscore)
	current_word = token.replace("\u2581", "")

	# Reset label to the prediction of this new token
	pred_id = predictions[i]
	current_label = label_map.get(pred_id, "0")

	else:
	# It is a sub-part of the word (merge it)
	current_word += token

	# Update label: The label of the LAST sub-token is usually the valid one
	pred_id = predictions[i]
	if pred_id in label_map and label_map[pred_id] != "0":
	current_label = label_map[pred_id]

	# Process the very last word
	if current_word:
	result += current_word
	if current_label != "0":
	result += current_label

	return result.strip()

	# Example Usage
	text_input = """
	چىنلىق بىلەن توقۇلمىنىڭ رېئاللىق بىلەن تەسەۋۋۇرنىڭ ماكان بىلەن زاماننىڭ مۇناسىۋىتىنى قانداق بولىدۇ
	"""
	# (Input: "The weather is very good today" without punctuation)

	restored_text = restore_punctuation(text_input)
	print(restored_text)