chinese-ambiguous-chars-model

繁简转换歧义字判断模型,基于 hfl/chinese-roberta-wwm-ext 微调,用于处理 OpenCC 无法正确判断的一繁对多简场景。


背景

在繁简转换中,存在一个繁体字对应多个简体字的情况(参见繁简转换一对多列表)。OpenCC 等基于字典的工具在处理此类歧义时,无法根据上下文做出正确判断。

典型例子:

繁体 语境 正确简体
著名、著作
走著、看著
乾隆、乾坤
乾燥、乾爹

本模型通过 MLM(Masked Language Model)的方式,在给定上下文的情况下,预测被遮盖的歧义字应转换为哪个简体字。


使用场景

推荐用法:OpenCC 初步转换 → 本模型二次修正

繁体原文:隨著時間的推移
    ↓ OpenCC t2s
随著时间的推移        ← 著未被正确转换
    ↓ 本模型
随着时间的推移        ✅

训练数据

数据集:tardigrade-doc/t2c-plus

从以下数据源过滤得到:

训练字对与数据量

依据繁简转换一对多列表,结合数据分布进行筛选,最终保留以下 7 组歧义字:

繁体 简体 数据量 说明
249,116 著名、著作等
108,632 走着、看着等动态助词
155,935 绘画、画面等
180,328 划分、规划等
23,468 覆盖、倾覆等
183,362 答复、回复等
11,019 炼铁、锻炼等
16,014 链条、链接等
20,686 乾隆、乾坤等专有名词
105,431 干燥、干净等
9,361 帐篷、蚊帐等
3,998 账目、账单等
4,618 慰藉、狼藉等
5,000 借口、凭借等(截断至5000)

训练配置

参数
基座模型 hfl/chinese-roberta-wwm-ext
epochs 3(最优在 epoch 1)
batch size 8(gradient accumulation 4步,等效32)
learning rate 2e-5
max length 64 tokens
fp16
gradient checkpointing

训练结果

Epoch eval/loss eval/accuracy
1 0.19 93.2%
2 0.20 93.6%
3 0.27 93.9%

epoch 1 的 eval/loss 最低,为实际使用的权重(load_best_model_at_end=True)。


模型文件

文件 说明
model.safetensors PyTorch 模型权重(474MB)
config.json 模型结构配置
tokenizer.json 分词器(Fast Tokenizer 格式)
tokenizer_config.json 分词器配置

ONNX 版本另见对应仓库(通过 convert-to-onnx 转换)。


已知问题与可优化方向

  • 训练数据存在噪声:部分简体语料中「著」和「着」存在混用,导致训练时梯度波动较大,可通过 OpenCC 验证 label 合理性进行清洗
  • 覆盖字对有限:目前仅处理 7 组高频歧义字,其他数据量不足的字对(如「瞭/了」、「閒/间」等)未纳入训练
  • 极端不均衡字对已排除:如「臺/台」(676362 vs 0)、「甦/苏」(215855 vs 0)等,因简体候选字在语料中几乎不存在,已放弃用模型处理,建议直接用 OpenCC 词典映射

相关链接

chinese-ambiguous-chars-model (ONNX)

This is an ONNX version of tardigrade-doc/chinese-ambiguous-chars-model. It was automatically converted and uploaded using this Hugging Face Space.

Usage with Transformers.js

See the pipeline documentation for fill-mask: https://huggingface.co/docs/transformers.js/api/pipelines#module_pipelines.FillMaskPipeline


Model Card for Model ID

Model Details

Model Description

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

  • Developed by: [More Information Needed]
  • Funded by [optional]: [More Information Needed]
  • Shared by [optional]: [More Information Needed]
  • Model type: [More Information Needed]
  • Language(s) (NLP): [More Information Needed]
  • License: [More Information Needed]
  • Finetuned from model [optional]: [More Information Needed]

Model Sources [optional]

  • Repository: [More Information Needed]
  • Paper [optional]: [More Information Needed]
  • Demo [optional]: [More Information Needed]

Uses

Direct Use

[More Information Needed]

Downstream Use [optional]

[More Information Needed]

Out-of-Scope Use

[More Information Needed]

Bias, Risks, and Limitations

[More Information Needed]

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

Training Details

Training Data

[More Information Needed]

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

  • Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: [More Information Needed]
  • Hours used: [More Information Needed]
  • Cloud Provider: [More Information Needed]
  • Compute Region: [More Information Needed]
  • Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]

Downloads last month
53
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for tardigrade-doc/chinese-ambiguous-chars-model

Unable to build the model tree, the base model loops to the model itself. Learn more.

Paper for tardigrade-doc/chinese-ambiguous-chars-model