oliverzh2000

Upload README.md with huggingface_hub

6c4f0af verified 10 days ago

4.72 kB

	---
	license: mit
	language:
	- zh
	tags:
	- chinese
	- word-segmentation
	- word-sense-disambiguation
	- coreml
	- ios
	- epub-reader
	datasets:
	- wyy209/MiCLS
	base_model:
	- hfl/chinese-electra-180g-base-discriminator
	- hfl/chinese-electra-180g-small-discriminator
	- thenlper/gte-base-zh
	---

	# Wen Reader — On-Device Chinese NLP Models

	Model weights for [Wen Reader](https://github.com/oliverzh2000/wen-reader), an open-source Chinese EPUB reader for iOS with popup dictionary.

	These models run on-device via CoreML to provide context-aware Chinese word segmentation and word sense disambiguation.

	## Models

	### cws-span-scorer-electra-base

	Chinese Word Segmentation model using span scoring.

	- Architecture: ELECTRA-base (12 layers, 768 hidden, 12 heads) + span scoring MLP head
	- Base model: [hfl/chinese-electra-180g-base-discriminator](https://huggingface.co/hfl/chinese-electra-180g-base-discriminator)
	- Parameters: ~102M (encoder) + span head
	- Task: Given a sentence, score candidate word spans to find optimal segmentation
	- Eval: 90/97 perfect segmentation on hand-curated test set (93/97 including acceptable over-splits)

	### cws-span-scorer-electra-small

	Smaller/faster variant of the above for lower-end devices.

	- Architecture: ELECTRA-small (12 layers, 256 hidden, 4 heads) + span scoring MLP head
	- Base model: [hfl/chinese-electra-180g-small-discriminator](https://huggingface.co/hfl/chinese-electra-180g-small-discriminator)
	- Parameters: ~12M (encoder) + span head
	- Eval: 85/97 perfect (91/97 including acceptable over-splits)

	### wsd-biencoder-gte-base

	Word Sense Disambiguation bi-encoder for selecting the correct dictionary definition in context.

	- Architecture: GTE-base-zh fine-tuned as a bi-encoder (SentenceTransformers)
	- Base model: [thenlper/gte-base-zh](https://huggingface.co/thenlper/gte-base-zh)
	- Parameters: ~102M
	- Task: Encode context sentence and candidate sense labels, rank by cosine similarity
	- Eval: Top-1 accuracy 88.7%, Top-3 99.6%, MRR 0.936 on 239 hand-curated disambiguation examples

	## Usage

	These models are used by the Wen Reader iOS app via CoreML conversion.

	To build the app from a fresh clone:

	```bash
	cd ml

	# 1. Download model weights from HF Hub
	uv run python scripts/download_models.py

	# 2. Export to CoreML
	uv run python scripts/export_coreml.py span --model-dir models/cws_span_scorer_electra_base/final
	uv run python scripts/export_coreml.py wsd --model-dir models/wsd_biencoder_gte_base/final

	# 3. Bundle into app resources (vocab, CoreML packages, CEDICT database)
	./scripts/run_pipeline.sh bundle
	```

	Or use the pipeline script:

	```bash
	./scripts/run_pipeline.sh download
	./scripts/run_pipeline.sh export
	./scripts/run_pipeline.sh bundle
	```

	## Training

	### CWS Span Scorer

	Fine-tuned on a custom dataset built from:
	- [ICWB2](https://github.com/yuikns/icwb2-data) segmentation corpus (MSR, auto-matched where greedy segmenter agrees with gold, plus LLM-annotated disagreement cases)
	- Chinese [Wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia) and [OpenSubtitles](https://huggingface.co/datasets/FradSer/OpenSubtitles-en-zh-cn-20m) sentences, LLM-annotated (Claude Opus)

	Sentences with ambiguous segmentation boundaries (where multiple valid CEDICT words overlap) are
	identified, then an LLM annotates the correct segmentation. The model scores all candidate word
	spans at each ambiguous position using a span scoring MLP head (taking span boundary token
	representations + learned width embeddings as input). A dynamic programming decoder finds the
	optimal segmentation at inference time. Training uses cross-entropy loss over candidate spans at
	each ambiguous position. Encoder and head use discriminative learning rates.

	### WSD Bi-encoder

	Fine-tuned on CC-CEDICT sense clusters with data from:
	- [MiCLS](https://huggingface.co/datasets/wyy209/MiCLS) WSD corpus (mapped to CEDICT senses via BGE embedding similarity)
	- LLM-generated context examples for each sense cluster (Claude Opus and Sonnet)
	- LLM-annotated ebook sentences segmented by the CWS span scorer (Claude Sonnet)

	Sense clusters are derived from CC-CEDICT entries with related senses merged using LLM-assisted
	clustering. Chinese sense labels are LLM-translated from the English CEDICT definitions.

	Training uses grouped cross-entropy loss: contexts and sense labels are encoded by the same model,
	cosine similarity is computed between context embeddings and sense embeddings, and the loss is
	cross-entropy over the similarity scores (with temperature scaling at τ=0.1). Contexts are grouped
	by word in each batch so sense embeddings are encoded once and reused across all contexts for that word.