ByT5-Vi-Normalization / README.md

nmcuong

Upload ByT5 Vietnamese Normalization model

2a1a34c verified 3 months ago

preview code

raw

history blame contribute delete

7.32 kB

metadata

license: mit
language:
  - vi
base_model:
  - google/byt5-small
pipeline_tag: text2text-generation
library_name: transformers

ByT5-Vi-Normalization

Model Description

Model Type: T5 (Text-to-Text Transfer Transformer)
Base Model: google/byt5-small
Task: Text Normalization
Language: Vietnamese
License: MIT

ByT5-Vi-Normalization is a fine-tuned version of Google's ByT5-small model, specifically designed for Vietnamese text normalization. The model is capable of handling a wide range of normalization tasks, including standardizing time expressions, locations, numbers, youth slang, contextual language, and more. It is particularly useful for pre-processing text in Vietnamese NLP pipelines and speech applications.

Training Data

Number of sentence pairs: 73,000
Data source: Automatically generated by Gemini (Google). As a result, some sentence pairs may be inconsistent or contain noise. Users should be aware that the model's outputs may reflect these inconsistencies in certain edge cases.
Content: Each pair consists of an unnormalized and a normalized Vietnamese sentence. The dataset covers various real-life scenarios, including informal language, numbers, dates, locations, abbreviations, slang, and contextual expressions.

Applications

Text-to-Speech (TTS) systems
Speech Synthesis
Preprocessing for NLP pipelines (chatbots, virtual assistants, etc.)
Data normalization for downstream tasks such as ASR, NLU, and more

Performance

Test hardware: NVIDIA RTX 4090
Library: Hugging Face Transformers
Average inference time: ~1 second per request (sentence) on GPU

Usage

Hugging Face

You can use the model with Hugging Face Transformers as follows:

from transformers import T5ForConditionalGeneration, AutoTokenizer
import torch
# Load model and tokenizer
model_dir = "nmcuong/ByT5-Vi-Normalization"
model = T5ForConditionalGeneration.from_pretrained(model_dir)
tokenizer = AutoTokenizer.from_pretrained(model_dir)
# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device).to(dtype=torch.bfloat16)
model.eval()
# Example usage
input_text = "Normalize: Theo thông tư số 01/2023/TT-BTC, từ ngày 1/1/2024, Việt Nam sẽ áp dụng thuế giá trị gia tăng (VAT) mới cho các mặt hàng tiêu dùng."
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_length=768, num_beams=2)
    decoded = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Result:", decoded)
# Result: Theo thông tư số không một, năm hai nghìn không trăm hai mươi ba, Thông tư của Bộ Tài chính, từ ngày một tháng một, năm hai nghìn không trăm hai mươi tư, Việt Nam sẽ áp dụng thuế giá trị gia tăng mới cho các mặt hàng tiêu dùng.

CTranslate2

you can also use the model with CTranslate2 for faster inference:

ct2-transformers-converter --model nmcuong/ByT5-Vi-Normalization --output_dir ByT5-Vi-Normalization-CT2

Then, you can load the model in Python:

import ctranslate2
import transformers

# https://opennmt.net/CTranslate2/python/ctranslate2.Translator.html
translator = ctranslate2.Translator(
    "ByT5-Vi-Normalization-CT2",
    device="cuda",
    device_index=0,
    compute_type="bfloat16",
    
)
tokenizer = transformers.AutoTokenizer.from_pretrained("nmcuong/ByT5-Vi-Normalization")


input_text = "Normalize: Hôm nay là ngày 15/07/2025. Giá xăng tăng lên 25.000 đồng/lít"
input_tokens = tokenizer.convert_ids_to_tokens(
    tokenizer.encode(input_text),
)

results = translator.translate_batch([input_tokens], max_decoding_length=768)

output_tokens = results[0].hypotheses[0]
output_text = tokenizer.decode(tokenizer.convert_tokens_to_ids(output_tokens))

print(output_text)

# Result: Hôm nay là ngày mười lăm tháng bảy năm hai nghìn không trăm hai mươi lăm. Giá xăng tăng lên hai mươi lăm nghìn đồng một lít.

Note: The news in this example is for testing purposes only and does not represent real-life news.

Example Inputs & Outputs

Example 1:
- Input: Normalize: ngày 19/05/1890 là ngày sinh của Chủ tịch Hồ Chí Minh, người đã lãnh đạo Việt Nam trong cuộc kháng chiến chống Pháp và Mỹ.
- Output: ngày mười chín tháng năm năm một nghìn tám trăm chín mươi là ngày sinh của Chủ tịch Hồ Chí Minh, người đã lãnh đạo Việt Nam trong cuộc kháng chiến chống Pháp và Mỹ.
Example 2:
- Input: D/c: thôn An Phú, xã An Thượng, huyện Hoài Đức, Hà Nội. Số nhà 123/45. Đây là nơi sản xuất chính của nông trại.
- Output: Địa chỉ: thôn An Phú, xã An Thượng, huyện Hoài Đức, Hà Nội. Số nhà một trăm hai mươi ba, trên bốn mươi lăm. Đây là nơi sản xuất chính của nông trại.
Example 3:
- Input: Thế kỷ XXI chứng kiến sự bùng nổ của CNTT và trí tuệ nhân tạo, làm thay đổi sâu sắc mọi mặt của đời sống xã hội. Những thách thức toàn cầu như biến đổi khí hậu cũng trở nên cấp bách hơn bao giờ hết trong giai đoạn này
- Output: Thế kỷ hai mươi mốt chứng kiến sự bùng nổ của công nghệ thông tin và trí tuệ nhân tạo, làm thay đổi sâu sắc mọi mặt của đời sống xã hội. Những thách thức toàn cầu như biến đổi khí hậu cũng trở nên cấp bách hơn bao giờ hết trong giai đoạn này.
Example 4:
- Input: Hoa quả năm nay rẻ hơn năm ngoái, đặc biệt là táo và cam. Giá táo chỉ còn 20.000 đồng/kg, trong khi cam là 15.000 đồng/kg.
- Output: Hoa quả năm nay rẻ hơn năm ngoái, đặc biệt là táo và cam. Giá táo chỉ còn hai mươi nghìn đồng một ki lô gam, trong khi cam là mười lăm nghìn đồng một ki lô gam.

License

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.