| # WtpsplitKit |
|
|
| On-device text segmentation for iOS / macOS using **Core ML** SaT models |
| (["Segment any Text"](https://github.com/segment-any-text/wtpsplit)). It splits |
| text into TTS-friendly chunks that respect length bounds and break at natural |
| pauses — a Swift port of `scripts/run_segmentation.py` from this repo. |
|
|
| The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and |
| the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML). |
|
|
| ## What it does |
|
|
| ``` |
| tokenize (XLM-R Unigram) |
| → per-token boundary logits (windowed Core ML inference) |
| → scatter onto characters → sigmoid |
| → pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word) |
| → length-constrained DP (Viterbi or greedy) with a length prior |
| ``` |
|
|
| The output chunks always rejoin to the exact input (`chunks.joined() == text`). |
|
|
| ## Fidelity |
|
|
| This port is verified against the Python/Hugging Face references: |
|
|
| - **Tokenizer** — ids *and* character offsets match Hugging Face `tokenizers` |
| (xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and |
| mixed/whitespace edge cases. |
| - **Segmentation** — mask + prior + DP produce byte-identical chunks to |
| `wtpsplit/utils/{constraints,priors}.py` and `pause_aware_mask` when fed the |
| same probabilities, across every prior / algorithm / overflow / allow-midword |
| combination. |
|
|
| The only intentional difference from the ONNX path is logit precision: the iOS |
| models are quantized (fp16 / int8 / palettized), so boundary logits differ |
| slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact. |
|
|
| ## Installation |
|
|
| Swift Package Manager: |
|
|
| ```swift |
| .package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0") |
| ``` |
|
|
| ```swift |
| .target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")]) |
| ``` |
|
|
| Requires iOS 16+ / macOS 13+. |
|
|
| ## Models |
|
|
| The Core ML models are **not** bundled (they are 40–430 MB). Build them with |
| `scripts/build_ios_coreml.py` (produces `ios_models/<variant>/<variant>-<quant>.mlpackage`) |
| and ship the one you want with your app, or download it on first launch. |
|
|
| | Vocabulary | Variants | Notes | |
| |------------|----------|-------| |
| | `*-full-*` | full XLM-R (250 k) | every language SaT supports | |
| | `*-en_zh-*` | pruned (≈101 k) | English + Chinese only, ~2× smaller | |
|
|
| `en_zh` models need token-id remapping; the kit handles it automatically (the |
| remap table is bundled). Pass `.auto` and it infers the vocabulary from the path |
| (`en_zh` ⇒ pruned), or set it explicitly. |
|
|
| The 4 MB tokenizer vocabulary and 1 MB remap table **are** bundled as package |
| resources — no network or extra files needed for tokenization. |
|
|
| ## Usage |
|
|
| ```swift |
| import WtpsplitKit |
| |
| let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16", |
| withExtension: "mlpackage")! |
| let segmenter = try SaTSegmenter(modelURL: modelURL) // load/compile once, reuse |
| |
| var options = SegmentationOptions() |
| options.maxLength = 80 // hard cap (unless overflow > 0) |
| options.minLength = 40 |
| options.prior = .gaussian // .uniform | .gaussian | .clippedPolynomial |
| options.targetLength = 70 |
| options.spread = 12 |
| |
| let chunks = try segmenter.segment( |
| "Breaking News: Scientists announced a discovery. 这是一个测试。It works well!", |
| options: options) |
| // → ["Breaking News: Scientists announced a discovery. 这是一个测试。", |
| // "It works well!"] |
| ``` |
|
|
| `SaTSegmenter` is reusable; create it once (loading/compiling the model is the |
| expensive step) and call `segment` as needed. Run it off the main thread for long |
| inputs. |
|
|
| ### Options (mirror `run_segmentation.py`) |
| |
| | Option | Default | Meaning | |
| |--------|---------|---------| |
| | `maxLength` | 80 | target max characters per chunk | |
| | `minLength` | 40 | min characters per chunk (best effort) | |
| | `overflow` | 0 | chars a chunk may exceed `maxLength` to reach a pause (0 = hard cap) | |
| | `prior` | `.gaussian` | length-prior shape | |
| | `targetLength` | 70 | gaussian / polynomial target | |
| | `spread` | 12 | gaussian / polynomial spread | |
| | `algorithm` | `.viterbi` | `.viterbi` (optimal) or `.greedy` | |
| | `allowMidword` | `false` | permit breaks inside words (skips the pause-aware mask) | |
| | `windowOverlap` | 16 | token overlap between inference windows for long text | |
| |
| ### Compute units |
| |
| ```swift |
| let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine) |
| ``` |
| |
| ### Tokenizer-only |
| |
| ```swift |
| let tok = try XLMRobertaTokenizer() |
| let (ids, charEnds) = tok.encode("Hello world! ä½ å¥½ã€‚") |
| ``` |
| |
| ## CLI (for validation / experimentation) |
| |
| ```bash |
| swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \ |
| --text "Your text. ä½ çš„æ–‡å—。" --max-length 80 --min-length 40 |
| |
| # flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial} |
| # --target --spread --algorithm {viterbi,greedy} --allow-midword |
| # --vocab {auto,full,en_zh} --file <path> |
| # --tokens / --dump-probs / --dump-chunks (debug dumps) |
| ``` |
| |
| ## Notes |
| |
| - Long inputs are processed in `seqLen`-sized windows (the models have a fixed |
| sequence length, 256 by default) with overlapping logits averaged. |
| - `.mlpackage` is compiled on first load. To skip recompilation, pass a |
| precompiled `.mlmodelc` URL instead. |
| - All text indexing is by Unicode scalar (code point) to match Python `str` |
| semantics, so split positions and lengths line up with the reference. |
| |