chinese-ambiguous-chars-model

繁简转换歧义字判断模型，基于 hfl/chinese-roberta-wwm-ext 微调，用于处理 OpenCC 无法正确判断的一繁对多简场景。

背景

在繁简转换中，存在一个繁体字对应多个简体字的情况（参见繁简转换一对多列表）。OpenCC 等基于字典的工具在处理此类歧义时，无法根据上下文做出正确判断。

典型例子：

繁体	语境	正确简体
著	著名、著作	著
著	走著、看著	着
乾	乾隆、乾坤	乾
乾	乾燥、乾爹	干

本模型通过 MLM（Masked Language Model）的方式，在给定上下文的情况下，预测被遮盖的歧义字应转换为哪个简体字。

使用场景

推荐用法：OpenCC 初步转换 → 本模型二次修正

繁体原文：隨著時間的推移
    ↓ OpenCC t2s
随著时间的推移        ← 著未被正确转换
    ↓ 本模型
随着时间的推移        ✅

训练数据

数据集：tardigrade-doc/t2c-plus

从以下数据源过滤得到：

训练字对与数据量

依据繁简转换一对多列表，结合数据分布进行筛选，最终保留以下 7 组歧义字：

繁体	简体	数据量	说明
著	著	249,116	著名、著作等
著	着	108,632	走着、看着等动态助词
畫	画	155,935	绘画、画面等
畫	划	180,328	划分、规划等
覆	覆	23,468	覆盖、倾覆等
覆	复	183,362	答复、回复等
鍊	炼	11,019	炼铁、锻炼等
鍊	链	16,014	链条、链接等
乾	乾	20,686	乾隆、乾坤等专有名词
乾	干	105,431	干燥、干净等
帳	帐	9,361	帐篷、蚊帐等
帳	账	3,998	账目、账单等
藉	藉	4,618	慰藉、狼藉等
藉	借	5,000	借口、凭借等（截断至5000）

训练配置

参数	值
基座模型	hfl/chinese-roberta-wwm-ext
epochs	3（最优在 epoch 1）
batch size	8（gradient accumulation 4步，等效32）
learning rate	2e-5
max length	64 tokens
fp16	✅
gradient checkpointing	✅

训练结果

Epoch	eval/loss	eval/accuracy
1	0.19 ✅	93.2%
2	0.20	93.6%
3	0.27	93.9%

epoch 1 的 eval/loss 最低，为实际使用的权重（load_best_model_at_end=True）。

模型文件

文件	说明
`model.safetensors`	PyTorch 模型权重（474MB）
`config.json`	模型结构配置
`tokenizer.json`	分词器（Fast Tokenizer 格式）
`tokenizer_config.json`	分词器配置

ONNX 版本另见对应仓库（通过 convert-to-onnx 转换）。

已知问题与可优化方向

训练数据存在噪声：部分简体语料中「著」和「着」存在混用，导致训练时梯度波动较大，可通过 OpenCC 验证 label 合理性进行清洗
覆盖字对有限：目前仅处理 7 组高频歧义字，其他数据量不足的字对（如「瞭/了」、「閒/间」等）未纳入训练
极端不均衡字对已排除：如「臺/台」（676362 vs 0）、「甦/苏」（215855 vs 0）等，因简体候选字在语料中几乎不存在，已放弃用模型处理，建议直接用 OpenCC 词典映射

chinese-ambiguous-chars-model (ONNX)

This is an ONNX version of tardigrade-doc/chinese-ambiguous-chars-model. It was automatically converted and uploaded using this Hugging Face Space.

Usage with Transformers.js

See the pipeline documentation for fill-mask: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FillMaskPipeline

Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

Developed by: [More Information Needed]
Funded by [optional]: [More Information Needed]
Shared by [optional]: [More Information Needed]
Model type: [More Information Needed]
Language(s) (NLP): [More Information Needed]
License: [More Information Needed]
Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

Repository: [More Information Needed]
Paper [optional]: [More Information Needed]
Demo [optional]: [More Information Needed]

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month: 5

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for tardigrade-doc/chinese-ambiguous-chars-model

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for tardigrade-doc/chinese-ambiguous-chars-model

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 49

chinese-ambiguous-chars-model

背景

使用场景

训练数据

训练字对与数据量

训练配置

训练结果

模型文件

已知问题与可优化方向

相关链接

chinese-ambiguous-chars-model (ONNX)

Usage with Transformers.js

Model Card for Model ID

Model Details

Model Description

Model Sources [optional]

Uses

Direct Use

Downstream Use [optional]

Out-of-Scope Use

Bias, Risks, and Limitations

Recommendations

How to Get Started with the Model

Training Details

Training Data

Training Procedure

Preprocessing [optional]

Training Hyperparameters

Speeds, Sizes, Times [optional]

Evaluation

Testing Data, Factors & Metrics

Testing Data

Factors

Metrics

Results

Summary

Model Examination [optional]

Environmental Impact

Technical Specifications [optional]

Model Architecture and Objective

Compute Infrastructure

Hardware

Software

Citation [optional]

Glossary [optional]

More Information [optional]

Model Card Authors [optional]

Model Card Contact

Model tree for tardigrade-doc/chinese-ambiguous-chars-model

Paper for tardigrade-doc/chinese-ambiguous-chars-model