Wen Reader — On-Device Chinese NLP Models

Model weights for Wen Reader, an open-source Chinese EPUB reader for iOS with popup dictionary.

These models run on-device via CoreML to provide context-aware Chinese word segmentation and word sense disambiguation.

Models

cws-span-scorer-electra-base

Chinese Word Segmentation model using span scoring.

  • Architecture: ELECTRA-base (12 layers, 768 hidden, 12 heads) + span scoring MLP head
  • Base model: hfl/chinese-electra-180g-base-discriminator
  • Parameters: ~102M (encoder) + span head
  • Task: Given a sentence, score candidate word spans to find optimal segmentation
  • Eval: 90/97 perfect segmentation on hand-curated test set (93/97 including acceptable over-splits)

cws-span-scorer-electra-small

Smaller/faster variant of the above for lower-end devices.

  • Architecture: ELECTRA-small (12 layers, 256 hidden, 4 heads) + span scoring MLP head
  • Base model: hfl/chinese-electra-180g-small-discriminator
  • Parameters: ~12M (encoder) + span head
  • Eval: 85/97 perfect (91/97 including acceptable over-splits)

wsd-biencoder-gte-base

Word Sense Disambiguation bi-encoder for selecting the correct dictionary definition in context.

  • Architecture: GTE-base-zh fine-tuned as a bi-encoder (SentenceTransformers)
  • Base model: thenlper/gte-base-zh
  • Parameters: ~102M
  • Task: Encode context sentence and candidate sense labels, rank by cosine similarity
  • Eval: Top-1 accuracy 88.7%, Top-3 99.6%, MRR 0.936 on 239 hand-curated disambiguation examples

Usage

These models are used by the Wen Reader iOS app via CoreML conversion.

To build the app from a fresh clone:

cd ml

# 1. Download model weights from HF Hub
uv run python scripts/download_models.py

# 2. Export to CoreML
uv run python scripts/export_coreml.py span --model-dir models/cws_span_scorer_electra_base/final
uv run python scripts/export_coreml.py wsd --model-dir models/wsd_biencoder_gte_base/final

# 3. Bundle into app resources (vocab, CoreML packages, CEDICT database)
./scripts/run_pipeline.sh bundle

Or use the pipeline script:

./scripts/run_pipeline.sh download
./scripts/run_pipeline.sh export
./scripts/run_pipeline.sh bundle

Training

CWS Span Scorer

Fine-tuned on a custom dataset built from:

  • ICWB2 segmentation corpus (MSR, auto-matched where greedy segmenter agrees with gold, plus LLM-annotated disagreement cases)
  • Chinese Wikipedia and OpenSubtitles sentences, LLM-annotated (Claude Opus)

Sentences with ambiguous segmentation boundaries (where multiple valid CEDICT words overlap) are identified, then an LLM annotates the correct segmentation. The model scores all candidate word spans at each ambiguous position using a span scoring MLP head (taking span boundary token representations + learned width embeddings as input). A dynamic programming decoder finds the optimal segmentation at inference time. Training uses cross-entropy loss over candidate spans at each ambiguous position. Encoder and head use discriminative learning rates.

WSD Bi-encoder

Fine-tuned on CC-CEDICT sense clusters with data from:

  • MiCLS WSD corpus (mapped to CEDICT senses via BGE embedding similarity)
  • LLM-generated context examples for each sense cluster (Claude Opus and Sonnet)
  • LLM-annotated ebook sentences segmented by the CWS span scorer (Claude Sonnet)

Sense clusters are derived from CC-CEDICT entries with related senses merged using LLM-assisted clustering. Chinese sense labels are LLM-translated from the English CEDICT definitions.

Training uses grouped cross-entropy loss: contexts and sense labels are encoded by the same model, cosine similarity is computed between context embeddings and sense embeddings, and the loss is cross-entropy over the similarity scores (with temperature scaling at Ï„=0.1). Contexts are grouped by word in each batch so sense embeddings are encoded once and reused across all contexts for that word.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for oliverzh2000/wen-reader

Finetuned
(2)
this model

Dataset used to train oliverzh2000/wen-reader