WtpsplitKit

On-device text segmentation for iOS / macOS using Core ML SaT models ("Segment any Text"). It splits text into TTS-friendly chunks that respect length bounds and break at natural pauses — a Swift port of scripts/run_segmentation.py from this repo.

The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML).

What it does

tokenize (XLM-R Unigram)
  → per-token boundary logits (windowed Core ML inference)
  → scatter onto characters → sigmoid
  → pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word)
  → length-constrained DP (Viterbi or greedy) with a length prior

The output chunks always rejoin to the exact input (chunks.joined() == text).

Fidelity

This port is verified against the Python/Hugging Face references:

Tokenizer — ids and character offsets match Hugging Face tokenizers (xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and mixed/whitespace edge cases.
Segmentation — mask + prior + DP produce byte-identical chunks to wtpsplit/utils/{constraints,priors}.py and pause_aware_mask when fed the same probabilities, across every prior / algorithm / overflow / allow-midword combination.

The only intentional difference from the ONNX path is logit precision: the iOS models are quantized (fp16 / int8 / palettized), so boundary logits differ slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact.

Installation

Swift Package Manager:

.package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0")

.target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")])

Requires iOS 16+ / macOS 13+.

Models

The Core ML models are not bundled (they are 40–430 MB). Build them with scripts/build_ios_coreml.py (produces ios_models/<variant>/<variant>-<quant>.mlpackage) and ship the one you want with your app, or download it on first launch.

Vocabulary	Variants	Notes
`-full-`	full XLM-R (250 k)	every language SaT supports
`-en_zh-`	pruned (≈101 k)	English + Chinese only, ~2× smaller

en_zh models need token-id remapping; the kit handles it automatically (the remap table is bundled). Pass .auto and it infers the vocabulary from the path (en_zh ⇒ pruned), or set it explicitly.

The 4 MB tokenizer vocabulary and 1 MB remap table are bundled as package resources — no network or extra files needed for tokenization.

Usage

import WtpsplitKit

let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16",
                               withExtension: "mlpackage")!
let segmenter = try SaTSegmenter(modelURL: modelURL)   // load/compile once, reuse

var options = SegmentationOptions()
options.maxLength = 80      // hard cap (unless overflow > 0)
options.minLength = 40
options.prior = .gaussian   // .uniform | .gaussian | .clippedPolynomial
options.targetLength = 70
options.spread = 12

let chunks = try segmenter.segment(
    "Breaking News: Scientists announced a discovery. 这是一个测试。It works well!",
    options: options)
// → ["Breaking News: Scientists announced a discovery. 这是一个测试。",
//    "It works well!"]

SaTSegmenter is reusable; create it once (loading/compiling the model is the expensive step) and call segment as needed. Run it off the main thread for long inputs.

Options (mirror `run_segmentation.py`)

Option	Default	Meaning
`maxLength`	80	target max characters per chunk
`minLength`	40	min characters per chunk (best effort)
`overflow`	0	chars a chunk may exceed `maxLength` to reach a pause (0 = hard cap)
`prior`	`.gaussian`	length-prior shape
`targetLength`	70	gaussian / polynomial target
`spread`	12	gaussian / polynomial spread
`algorithm`	`.viterbi`	`.viterbi` (optimal) or `.greedy`
`allowMidword`	`false`	permit breaks inside words (skips the pause-aware mask)
`windowOverlap`	16	token overlap between inference windows for long text

Compute units

let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine)

Tokenizer-only

let tok = try XLMRobertaTokenizer()
let (ids, charEnds) = tok.encode("Hello world! 你好。")

CLI (for validation / experimentation)

swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \
    --text "Your text. 你的文字。" --max-length 80 --min-length 40

# flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial}
#        --target --spread --algorithm {viterbi,greedy} --allow-midword
#        --vocab {auto,full,en_zh} --file <path>
#        --tokens / --dump-probs / --dump-chunks  (debug dumps)

Notes

Long inputs are processed in seqLen-sized windows (the models have a fixed sequence length, 256 by default) with overlapping logits averaged.
.mlpackage is compiled on first load. To skip recompilation, pass a precompiled .mlmodelc URL instead.
All text indexing is by Unicode scalar (code point) to match Python str semantics, so split positions and lengths line up with the reference.