File size: 5,471 Bytes
357ae2c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | # WtpsplitKit
On-device text segmentation for iOS / macOS using **Core ML** SaT models
(["Segment any Text"](https://github.com/segment-any-text/wtpsplit)). It splits
text into TTS-friendly chunks that respect length bounds and break at natural
pauses — a Swift port of `scripts/run_segmentation.py` from this repo.
The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and
the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML).
## What it does
```
tokenize (XLM-R Unigram)
→ per-token boundary logits (windowed Core ML inference)
→ scatter onto characters → sigmoid
→ pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word)
→ length-constrained DP (Viterbi or greedy) with a length prior
```
The output chunks always rejoin to the exact input (`chunks.joined() == text`).
## Fidelity
This port is verified against the Python/Hugging Face references:
- **Tokenizer** — ids *and* character offsets match Hugging Face `tokenizers`
(xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and
mixed/whitespace edge cases.
- **Segmentation** — mask + prior + DP produce byte-identical chunks to
`wtpsplit/utils/{constraints,priors}.py` and `pause_aware_mask` when fed the
same probabilities, across every prior / algorithm / overflow / allow-midword
combination.
The only intentional difference from the ONNX path is logit precision: the iOS
models are quantized (fp16 / int8 / palettized), so boundary logits differ
slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact.
## Installation
Swift Package Manager:
```swift
.package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0")
```
```swift
.target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")])
```
Requires iOS 16+ / macOS 13+.
## Models
The Core ML models are **not** bundled (they are 40–430 MB). Build them with
`scripts/build_ios_coreml.py` (produces `ios_models/<variant>/<variant>-<quant>.mlpackage`)
and ship the one you want with your app, or download it on first launch.
| Vocabulary | Variants | Notes |
|------------|----------|-------|
| `*-full-*` | full XLM-R (250 k) | every language SaT supports |
| `*-en_zh-*` | pruned (≈101 k) | English + Chinese only, ~2× smaller |
`en_zh` models need token-id remapping; the kit handles it automatically (the
remap table is bundled). Pass `.auto` and it infers the vocabulary from the path
(`en_zh` ⇒ pruned), or set it explicitly.
The 4 MB tokenizer vocabulary and 1 MB remap table **are** bundled as package
resources — no network or extra files needed for tokenization.
## Usage
```swift
import WtpsplitKit
let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16",
withExtension: "mlpackage")!
let segmenter = try SaTSegmenter(modelURL: modelURL) // load/compile once, reuse
var options = SegmentationOptions()
options.maxLength = 80 // hard cap (unless overflow > 0)
options.minLength = 40
options.prior = .gaussian // .uniform | .gaussian | .clippedPolynomial
options.targetLength = 70
options.spread = 12
let chunks = try segmenter.segment(
"Breaking News: Scientists announced a discovery. 这是一个测试。It works well!",
options: options)
// → ["Breaking News: Scientists announced a discovery. 这是一个测试。",
// "It works well!"]
```
`SaTSegmenter` is reusable; create it once (loading/compiling the model is the
expensive step) and call `segment` as needed. Run it off the main thread for long
inputs.
### Options (mirror `run_segmentation.py`)
| Option | Default | Meaning |
|--------|---------|---------|
| `maxLength` | 80 | target max characters per chunk |
| `minLength` | 40 | min characters per chunk (best effort) |
| `overflow` | 0 | chars a chunk may exceed `maxLength` to reach a pause (0 = hard cap) |
| `prior` | `.gaussian` | length-prior shape |
| `targetLength` | 70 | gaussian / polynomial target |
| `spread` | 12 | gaussian / polynomial spread |
| `algorithm` | `.viterbi` | `.viterbi` (optimal) or `.greedy` |
| `allowMidword` | `false` | permit breaks inside words (skips the pause-aware mask) |
| `windowOverlap` | 16 | token overlap between inference windows for long text |
### Compute units
```swift
let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine)
```
### Tokenizer-only
```swift
let tok = try XLMRobertaTokenizer()
let (ids, charEnds) = tok.encode("Hello world! ä½ å¥½ã€‚")
```
## CLI (for validation / experimentation)
```bash
swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \
--text "Your text. ä½ çš„æ–‡å—。" --max-length 80 --min-length 40
# flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial}
# --target --spread --algorithm {viterbi,greedy} --allow-midword
# --vocab {auto,full,en_zh} --file <path>
# --tokens / --dump-probs / --dump-chunks (debug dumps)
```
## Notes
- Long inputs are processed in `seqLen`-sized windows (the models have a fixed
sequence length, 256 by default) with overlapping logits averaged.
- `.mlpackage` is compiled on first load. To skip recompilation, pass a
precompiled `.mlmodelc` URL instead.
- All text indexing is by Unicode scalar (code point) to match Python `str`
semantics, so split positions and lengths line up with the reference.
|