chinese-ambiguous-chars-model
繁简转换歧义字判断模型,基于 hfl/chinese-roberta-wwm-ext 微调,用于处理 OpenCC 无法正确判断的一繁对多简场景。
背景
在繁简转换中,存在一个繁体字对应多个简体字的情况(参见繁简转换一对多列表)。OpenCC 等基于字典的工具在处理此类歧义时,无法根据上下文做出正确判断。
典型例子:
| 繁体 | 语境 | 正确简体 |
|---|---|---|
| 著 | 著名、著作 | 著 |
| 著 | 走著、看著 | 着 |
| 乾 | 乾隆、乾坤 | 乾 |
| 乾 | 乾燥、乾爹 | 干 |
本模型通过 MLM(Masked Language Model)的方式,在给定上下文的情况下,预测被遮盖的歧义字应转换为哪个简体字。
使用场景
推荐用法:OpenCC 初步转换 → 本模型二次修正
繁体原文:隨著時間的推移
↓ OpenCC t2s
随著时间的推移 ← 著未被正确转换
↓ 本模型
随着时间的推移 ✅
训练数据
从以下数据源过滤得到:
训练字对与数据量
依据繁简转换一对多列表,结合数据分布进行筛选,最终保留以下 7 组歧义字:
| 繁体 | 简体 | 数据量 | 说明 |
|---|---|---|---|
| 著 | 著 | 249,116 | 著名、著作等 |
| 著 | 着 | 108,632 | 走着、看着等动态助词 |
| 畫 | 画 | 155,935 | 绘画、画面等 |
| 畫 | 划 | 180,328 | 划分、规划等 |
| 覆 | 覆 | 23,468 | 覆盖、倾覆等 |
| 覆 | 复 | 183,362 | 答复、回复等 |
| 鍊 | 炼 | 11,019 | 炼铁、锻炼等 |
| 鍊 | 链 | 16,014 | 链条、链接等 |
| 乾 | 乾 | 20,686 | 乾隆、乾坤等专有名词 |
| 乾 | 干 | 105,431 | 干燥、干净等 |
| 帳 | 帐 | 9,361 | 帐篷、蚊帐等 |
| 帳 | 账 | 3,998 | 账目、账单等 |
| 藉 | 藉 | 4,618 | 慰藉、狼藉等 |
| 藉 | 借 | 5,000 | 借口、凭借等(截断至5000) |
训练配置
| 参数 | 值 |
|---|---|
| 基座模型 | hfl/chinese-roberta-wwm-ext |
| epochs | 3(最优在 epoch 1) |
| batch size | 8(gradient accumulation 4步,等效32) |
| learning rate | 2e-5 |
| max length | 64 tokens |
| fp16 | ✅ |
| gradient checkpointing | ✅ |
训练结果
| Epoch | eval/loss | eval/accuracy |
|---|---|---|
| 1 | 0.19 ✅ | 93.2% |
| 2 | 0.20 | 93.6% |
| 3 | 0.27 | 93.9% |
epoch 1 的 eval/loss 最低,为实际使用的权重(load_best_model_at_end=True)。
模型文件
| 文件 | 说明 |
|---|---|
model.safetensors |
PyTorch 模型权重(474MB) |
config.json |
模型结构配置 |
tokenizer.json |
分词器(Fast Tokenizer 格式) |
tokenizer_config.json |
分词器配置 |
ONNX 版本另见对应仓库(通过 convert-to-onnx 转换)。
已知问题与可优化方向
- 训练数据存在噪声:部分简体语料中「著」和「着」存在混用,导致训练时梯度波动较大,可通过 OpenCC 验证 label 合理性进行清洗
- 覆盖字对有限:目前仅处理 7 组高频歧义字,其他数据量不足的字对(如「瞭/了」、「閒/间」等)未纳入训练
- 极端不均衡字对已排除:如「臺/台」(676362 vs 0)、「甦/苏」(215855 vs 0)等,因简体候选字在语料中几乎不存在,已放弃用模型处理,建议直接用 OpenCC 词典映射
相关链接
- 基座模型:hfl/chinese-roberta-wwm-ext
- 训练数据集:tardigrade-doc/t2c-plus
- ONNX 转换工具:onnx-community/convert-to-onnx
- 参考资料:繁简转换一对多列表 - 维基百科
chinese-ambiguous-chars-model (ONNX)
This is an ONNX version of tardigrade-doc/chinese-ambiguous-chars-model. It was automatically converted and uploaded using this Hugging Face Space.
Usage with Transformers.js
See the pipeline documentation for fill-mask: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FillMaskPipeline
Model Card for Model ID
Model Details
Model Description
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
- Developed by: [More Information Needed]
- Funded by [optional]: [More Information Needed]
- Shared by [optional]: [More Information Needed]
- Model type: [More Information Needed]
- Language(s) (NLP): [More Information Needed]
- License: [More Information Needed]
- Finetuned from model [optional]: [More Information Needed]
Model Sources [optional]
- Repository: [More Information Needed]
- Paper [optional]: [More Information Needed]
- Demo [optional]: [More Information Needed]
Uses
Direct Use
[More Information Needed]
Downstream Use [optional]
[More Information Needed]
Out-of-Scope Use
[More Information Needed]
Bias, Risks, and Limitations
[More Information Needed]
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
Use the code below to get started with the model.
[More Information Needed]
Training Details
Training Data
[More Information Needed]
Training Procedure
Preprocessing [optional]
[More Information Needed]
Training Hyperparameters
- Training regime: [More Information Needed]
Speeds, Sizes, Times [optional]
[More Information Needed]
Evaluation
Testing Data, Factors & Metrics
Testing Data
[More Information Needed]
Factors
[More Information Needed]
Metrics
[More Information Needed]
Results
[More Information Needed]
Summary
Model Examination [optional]
[More Information Needed]
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).
- Hardware Type: [More Information Needed]
- Hours used: [More Information Needed]
- Cloud Provider: [More Information Needed]
- Compute Region: [More Information Needed]
- Carbon Emitted: [More Information Needed]
Technical Specifications [optional]
Model Architecture and Objective
[More Information Needed]
Compute Infrastructure
[More Information Needed]
Hardware
[More Information Needed]
Software
[More Information Needed]
Citation [optional]
BibTeX:
[More Information Needed]
APA:
[More Information Needed]
Glossary [optional]
[More Information Needed]
More Information [optional]
[More Information Needed]
Model Card Authors [optional]
[More Information Needed]
Model Card Contact
[More Information Needed]
- Downloads last month
- 53
Model tree for tardigrade-doc/chinese-ambiguous-chars-model
Unable to build the model tree, the base model loops to the model itself. Learn more.