Wen Reader — On-Device Chinese NLP Models
Model weights for Wen Reader, an open-source Chinese EPUB reader for iOS with popup dictionary.
These models run on-device via CoreML to provide context-aware Chinese word segmentation and word sense disambiguation.
Models
cws-span-scorer-electra-base
Chinese Word Segmentation model using span scoring.
- Architecture: ELECTRA-base (12 layers, 768 hidden, 12 heads) + span scoring MLP head
- Base model: hfl/chinese-electra-180g-base-discriminator
- Parameters: ~102M (encoder) + span head
- Task: Given a sentence, score candidate word spans to find optimal segmentation
- Eval: 90/97 perfect segmentation on hand-curated test set (93/97 including acceptable over-splits)
cws-span-scorer-electra-small
Smaller/faster variant of the above for lower-end devices.
- Architecture: ELECTRA-small (12 layers, 256 hidden, 4 heads) + span scoring MLP head
- Base model: hfl/chinese-electra-180g-small-discriminator
- Parameters: ~12M (encoder) + span head
- Eval: 85/97 perfect (91/97 including acceptable over-splits)
wsd-biencoder-gte-base
Word Sense Disambiguation bi-encoder for selecting the correct dictionary definition in context.
- Architecture: GTE-base-zh fine-tuned as a bi-encoder (SentenceTransformers)
- Base model: thenlper/gte-base-zh
- Parameters: ~102M
- Task: Encode context sentence and candidate sense labels, rank by cosine similarity
- Eval: Top-1 accuracy 88.7%, Top-3 99.6%, MRR 0.936 on 239 hand-curated disambiguation examples
Usage
These models are used by the Wen Reader iOS app via CoreML conversion.
To build the app from a fresh clone:
cd ml
# 1. Download model weights from HF Hub
uv run python scripts/download_models.py
# 2. Export to CoreML
uv run python scripts/export_coreml.py span --model-dir models/cws_span_scorer_electra_base/final
uv run python scripts/export_coreml.py wsd --model-dir models/wsd_biencoder_gte_base/final
# 3. Bundle into app resources (vocab, CoreML packages, CEDICT database)
./scripts/run_pipeline.sh bundle
Or use the pipeline script:
./scripts/run_pipeline.sh download
./scripts/run_pipeline.sh export
./scripts/run_pipeline.sh bundle
Training
CWS Span Scorer
Fine-tuned on a custom dataset built from:
- ICWB2 segmentation corpus (MSR, auto-matched where greedy segmenter agrees with gold, plus LLM-annotated disagreement cases)
- Chinese Wikipedia and OpenSubtitles sentences, LLM-annotated (Claude Opus)
Sentences with ambiguous segmentation boundaries (where multiple valid CEDICT words overlap) are identified, then an LLM annotates the correct segmentation. The model scores all candidate word spans at each ambiguous position using a span scoring MLP head (taking span boundary token representations + learned width embeddings as input). A dynamic programming decoder finds the optimal segmentation at inference time. Training uses cross-entropy loss over candidate spans at each ambiguous position. Encoder and head use discriminative learning rates.
WSD Bi-encoder
Fine-tuned on CC-CEDICT sense clusters with data from:
- MiCLS WSD corpus (mapped to CEDICT senses via BGE embedding similarity)
- LLM-generated context examples for each sense cluster (Claude Opus and Sonnet)
- LLM-annotated ebook sentences segmented by the CWS span scorer (Claude Sonnet)
Sense clusters are derived from CC-CEDICT entries with related senses merged using LLM-assisted clustering. Chinese sense labels are LLM-translated from the English CEDICT definitions.
Training uses grouped cross-entropy loss: contexts and sense labels are encoded by the same model, cosine similarity is computed between context embeddings and sense embeddings, and the loss is cross-entropy over the similarity scores (with temperature scaling at Ï„=0.1). Contexts are grouped by word in each batch so sense embeddings are encoded once and reused across all contexts for that word.
Model tree for oliverzh2000/wen-reader
Base model
hfl/chinese-electra-180g-base-discriminator