Upload folder using huggingface_hub

357ae2c verified 11 days ago

5.47 kB

	# WtpsplitKit

	On-device text segmentation for iOS / macOS using Core ML SaT models
	(["Segment any Text"](https://github.com/segment-any-text/wtpsplit)). It splits
	text into TTS-friendly chunks that respect length bounds and break at natural
	pauses — a Swift port of `scripts/run_segmentation.py` from this repo.

	The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and
	the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML).

	## What it does

	```
	tokenize (XLM-R Unigram)
	→ per-token boundary logits (windowed Core ML inference)
	→ scatter onto characters → sigmoid
	→ pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word)
	→ length-constrained DP (Viterbi or greedy) with a length prior
	```

	The output chunks always rejoin to the exact input (`chunks.joined() == text`).

	## Fidelity

	This port is verified against the Python/Hugging Face references:

	- Tokenizer — ids and character offsets match Hugging Face `tokenizers`
	(xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and
	mixed/whitespace edge cases.
	- Segmentation — mask + prior + DP produce byte-identical chunks to
	`wtpsplit/utils/{constraints,priors}.py` and `pause_aware_mask` when fed the
	same probabilities, across every prior / algorithm / overflow / allow-midword
	combination.

	The only intentional difference from the ONNX path is logit precision: the iOS
	models are quantized (fp16 / int8 / palettized), so boundary logits differ
	slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact.

	## Installation

	Swift Package Manager:

	```swift
	.package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0")
	```

	```swift
	.target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")])
	```

	Requires iOS 16+ / macOS 13+.

	## Models

	The Core ML models are not bundled (they are 40–430 MB). Build them with
	`scripts/build_ios_coreml.py` (produces `ios_models/<variant>/<variant>-<quant>.mlpackage`)
	and ship the one you want with your app, or download it on first launch.

	\| Vocabulary \| Variants \| Notes \|
	\|------------\|----------\|-------\|
	\| `-full-` \| full XLM-R (250 k) \| every language SaT supports \|
	\| `-en_zh-` \| pruned (≈101 k) \| English + Chinese only, ~2× smaller \|

	`en_zh` models need token-id remapping; the kit handles it automatically (the
	remap table is bundled). Pass `.auto` and it infers the vocabulary from the path
	(`en_zh` ⇒ pruned), or set it explicitly.

	The 4 MB tokenizer vocabulary and 1 MB remap table are bundled as package
	resources — no network or extra files needed for tokenization.

	## Usage

	```swift
	import WtpsplitKit

	let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16",
	withExtension: "mlpackage")!
	let segmenter = try SaTSegmenter(modelURL: modelURL) // load/compile once, reuse

	var options = SegmentationOptions()
	options.maxLength = 80 // hard cap (unless overflow > 0)
	options.minLength = 40
	options.prior = .gaussian // .uniform \| .gaussian \| .clippedPolynomial
	options.targetLength = 70
	options.spread = 12

	let chunks = try segmenter.segment(
	"Breaking News: Scientists announced a discovery. 这是一个测试。It works well!",
	options: options)
	// → ["Breaking News: Scientists announced a discovery. 这是一个测试。",
	// "It works well!"]
	```

	`SaTSegmenter` is reusable; create it once (loading/compiling the model is the
	expensive step) and call `segment` as needed. Run it off the main thread for long
	inputs.

	### Options (mirror `run_segmentation.py`)

	\| Option \| Default \| Meaning \|
	\|--------\|---------\|---------\|
	\| `maxLength` \| 80 \| target max characters per chunk \|
	\| `minLength` \| 40 \| min characters per chunk (best effort) \|
	\| `overflow` \| 0 \| chars a chunk may exceed `maxLength` to reach a pause (0 = hard cap) \|
	\| `prior` \| `.gaussian` \| length-prior shape \|
	\| `targetLength` \| 70 \| gaussian / polynomial target \|
	\| `spread` \| 12 \| gaussian / polynomial spread \|
	\| `algorithm` \| `.viterbi` \| `.viterbi` (optimal) or `.greedy` \|
	\| `allowMidword` \| `false` \| permit breaks inside words (skips the pause-aware mask) \|
	\| `windowOverlap` \| 16 \| token overlap between inference windows for long text \|

	### Compute units

	```swift
	let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine)
	```

	### Tokenizer-only

	```swift
	let tok = try XLMRobertaTokenizer()
	let (ids, charEnds) = tok.encode("Hello world! 你好。")
	```

	## CLI (for validation / experimentation)

	```bash
	swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \
	--text "Your text. 你的文字。" --max-length 80 --min-length 40

	# flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial}
	# --target --spread --algorithm {viterbi,greedy} --allow-midword
	# --vocab {auto,full,en_zh} --file <path>
	# --tokens / --dump-probs / --dump-chunks (debug dumps)
	```

	## Notes

	- Long inputs are processed in `seqLen`-sized windows (the models have a fixed
	sequence length, 256 by default) with overlapping logits averaged.
	- `.mlpackage` is compiled on first load. To skip recompilation, pass a
	precompiled `.mlmodelc` URL instead.
	- All text indexing is by Unicode scalar (code point) to match Python `str`
	semantics, so split positions and lengths line up with the reference.