mt5_correct_puntuation

本模型使用中文維基百科語料微調 google/mt5-base預訓練模型之中文標點符號訂正器。目前之準確率為 0.794。

This is a google/mt5-base model trained on Mandarin Wikipedia corpus and finetuned for Mandarin punctuation correction. Currently the accuracy is 0.794.

Datasets

模型使用中文維基百科公開資料微調。將取得的文本以「。」或「,」切分為不超過100字的句子。因為逗號和句號數量壓倒性地多,為盡量平衡資料集,僅保留包含冒號、分號、驚嘆號、問號的句子,作為正確句。將正確句之「,。:;、!?」隨機以「,。:;、!?」,製作為不正確句。訓練用句子共有291,112句。

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • num_epochs: 1

Framework versions

  • Transformers 4.20.1
  • Pytorch 1.12.0+cu113
  • Datasets 2.3.2
  • Tokenizers 0.12.1
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using jamie613/mt5_correct_puntuation 1