| --- |
| license: mit |
| language: |
| - zh |
| tags: |
| - chinese |
| - word-segmentation |
| - word-sense-disambiguation |
| - coreml |
| - ios |
| - epub-reader |
| datasets: |
| - wyy209/MiCLS |
| base_model: |
| - hfl/chinese-electra-180g-base-discriminator |
| - hfl/chinese-electra-180g-small-discriminator |
| - thenlper/gte-base-zh |
| --- |
| |
| # Wen Reader — On-Device Chinese NLP Models |
|
|
| Model weights for [Wen Reader](https://github.com/oliverzh2000/wen-reader), an open-source Chinese EPUB reader for iOS with popup dictionary. |
|
|
| These models run on-device via CoreML to provide context-aware Chinese word segmentation and word sense disambiguation. |
|
|
| ## Models |
|
|
| ### cws-span-scorer-electra-base |
|
|
| Chinese Word Segmentation model using span scoring. |
|
|
| - **Architecture:** ELECTRA-base (12 layers, 768 hidden, 12 heads) + span scoring MLP head |
| - **Base model:** [hfl/chinese-electra-180g-base-discriminator](https://huggingface.co/hfl/chinese-electra-180g-base-discriminator) |
| - **Parameters:** ~102M (encoder) + span head |
| - **Task:** Given a sentence, score candidate word spans to find optimal segmentation |
| - **Eval:** 90/97 perfect segmentation on hand-curated test set (93/97 including acceptable over-splits) |
|
|
| ### cws-span-scorer-electra-small |
|
|
| Smaller/faster variant of the above for lower-end devices. |
|
|
| - **Architecture:** ELECTRA-small (12 layers, 256 hidden, 4 heads) + span scoring MLP head |
| - **Base model:** [hfl/chinese-electra-180g-small-discriminator](https://huggingface.co/hfl/chinese-electra-180g-small-discriminator) |
| - **Parameters:** ~12M (encoder) + span head |
| - **Eval:** 85/97 perfect (91/97 including acceptable over-splits) |
|
|
| ### wsd-biencoder-gte-base |
|
|
| Word Sense Disambiguation bi-encoder for selecting the correct dictionary definition in context. |
|
|
| - **Architecture:** GTE-base-zh fine-tuned as a bi-encoder (SentenceTransformers) |
| - **Base model:** [thenlper/gte-base-zh](https://huggingface.co/thenlper/gte-base-zh) |
| - **Parameters:** ~102M |
| - **Task:** Encode context sentence and candidate sense labels, rank by cosine similarity |
| - **Eval:** Top-1 accuracy 88.7%, Top-3 99.6%, MRR 0.936 on 239 hand-curated disambiguation examples |
|
|
| ## Usage |
|
|
| These models are used by the Wen Reader iOS app via CoreML conversion. |
|
|
| To build the app from a fresh clone: |
|
|
| ```bash |
| cd ml |
| |
| # 1. Download model weights from HF Hub |
| uv run python scripts/download_models.py |
| |
| # 2. Export to CoreML |
| uv run python scripts/export_coreml.py span --model-dir models/cws_span_scorer_electra_base/final |
| uv run python scripts/export_coreml.py wsd --model-dir models/wsd_biencoder_gte_base/final |
| |
| # 3. Bundle into app resources (vocab, CoreML packages, CEDICT database) |
| ./scripts/run_pipeline.sh bundle |
| ``` |
|
|
| Or use the pipeline script: |
|
|
| ```bash |
| ./scripts/run_pipeline.sh download |
| ./scripts/run_pipeline.sh export |
| ./scripts/run_pipeline.sh bundle |
| ``` |
|
|
| ## Training |
|
|
| ### CWS Span Scorer |
|
|
| Fine-tuned on a custom dataset built from: |
| - [ICWB2](https://github.com/yuikns/icwb2-data) segmentation corpus (MSR, auto-matched where greedy segmenter agrees with gold, plus LLM-annotated disagreement cases) |
| - Chinese [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and [OpenSubtitles](https://huggingface.co/datasets/FradSer/OpenSubtitles-en-zh-cn-20m) sentences, LLM-annotated (Claude Opus) |
|
|
| Sentences with ambiguous segmentation boundaries (where multiple valid CEDICT words overlap) are |
| identified, then an LLM annotates the correct segmentation. The model scores all candidate word |
| spans at each ambiguous position using a span scoring MLP head (taking span boundary token |
| representations + learned width embeddings as input). A dynamic programming decoder finds the |
| optimal segmentation at inference time. Training uses cross-entropy loss over candidate spans at |
| each ambiguous position. Encoder and head use discriminative learning rates. |
|
|
| ### WSD Bi-encoder |
|
|
| Fine-tuned on CC-CEDICT sense clusters with data from: |
| - [MiCLS](https://huggingface.co/datasets/wyy209/MiCLS) WSD corpus (mapped to CEDICT senses via BGE embedding similarity) |
| - LLM-generated context examples for each sense cluster (Claude Opus and Sonnet) |
| - LLM-annotated ebook sentences segmented by the CWS span scorer (Claude Sonnet) |
|
|
| Sense clusters are derived from CC-CEDICT entries with related senses merged using LLM-assisted |
| clustering. Chinese sense labels are LLM-translated from the English CEDICT definitions. |
|
|
| Training uses grouped cross-entropy loss: contexts and sense labels are encoded by the same model, |
| cosine similarity is computed between context embeddings and sense embeddings, and the loss is |
| cross-entropy over the similarity scores (with temperature scaling at τ=0.1). Contexts are grouped |
| by word in each batch so sense embeddings are encoded once and reused across all contexts for that word. |