File size: 5,471 Bytes
357ae2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
# WtpsplitKit

On-device text segmentation for iOS / macOS using **Core ML** SaT models
(["Segment any Text"](https://github.com/segment-any-text/wtpsplit)). It splits
text into TTS-friendly chunks that respect length bounds and break at natural
pauses — a Swift port of `scripts/run_segmentation.py` from this repo.

The whole pipeline runs locally: XLM-RoBERTa tokenization, Core ML inference, and
the length-constrained Viterbi segmentation are all in pure Swift (plus Core ML).

## What it does

```
tokenize (XLM-R Unigram)
  → per-token boundary logits (windowed Core ML inference)
  → scatter onto characters → sigmoid
  → pause-aware mask (bias breaks to commas / clauses / connectors, never mid-word)
  → length-constrained DP (Viterbi or greedy) with a length prior
```

The output chunks always rejoin to the exact input (`chunks.joined() == text`).

## Fidelity

This port is verified against the Python/Hugging Face references:

- **Tokenizer** — ids *and* character offsets match Hugging Face `tokenizers`
  (xlm-roberta-base) exactly across Latin, CJK, full-width, emoji, Devanagari and
  mixed/whitespace edge cases.
- **Segmentation** — mask + prior + DP produce byte-identical chunks to
  `wtpsplit/utils/{constraints,priors}.py` and `pause_aware_mask` when fed the
  same probabilities, across every prior / algorithm / overflow / allow-midword
  combination.

The only intentional difference from the ONNX path is logit precision: the iOS
models are quantized (fp16 / int8 / palettized), so boundary logits differ
slightly from the fp32 ONNX models. Tokenization and segmentation logic are exact.

## Installation

Swift Package Manager:

```swift
.package(url: "https://github.com/krmanik/wtpsplit-kit", from: "0.1.0")
```

```swift
.target(name: "App", dependencies: [.product(name: "WtpsplitKit", package: "wtpsplit-kit")])
```

Requires iOS 16+ / macOS 13+.

## Models

The Core ML models are **not** bundled (they are 40–430 MB). Build them with
`scripts/build_ios_coreml.py` (produces `ios_models/<variant>/<variant>-<quant>.mlpackage`)
and ship the one you want with your app, or download it on first launch.

| Vocabulary | Variants | Notes |
|------------|----------|-------|
| `*-full-*` | full XLM-R (250 k) | every language SaT supports |
| `*-en_zh-*` | pruned (≈101 k) | English + Chinese only, ~2× smaller |

`en_zh` models need token-id remapping; the kit handles it automatically (the
remap table is bundled). Pass `.auto` and it infers the vocabulary from the path
(`en_zh` ⇒ pruned), or set it explicitly.

The 4 MB tokenizer vocabulary and 1 MB remap table **are** bundled as package
resources — no network or extra files needed for tokenization.

## Usage

```swift
import WtpsplitKit

let modelURL = Bundle.main.url(forResource: "sat-3l-sm-full-fp16",
                               withExtension: "mlpackage")!
let segmenter = try SaTSegmenter(modelURL: modelURL)   // load/compile once, reuse

var options = SegmentationOptions()
options.maxLength = 80      // hard cap (unless overflow > 0)
options.minLength = 40
options.prior = .gaussian   // .uniform | .gaussian | .clippedPolynomial
options.targetLength = 70
options.spread = 12

let chunks = try segmenter.segment(
    "Breaking News: Scientists announced a discovery. 这是一个测试。It works well!",
    options: options)
// → ["Breaking News: Scientists announced a discovery. 这是一个测试。",
//    "It works well!"]
```

`SaTSegmenter` is reusable; create it once (loading/compiling the model is the
expensive step) and call `segment` as needed. Run it off the main thread for long
inputs.

### Options (mirror `run_segmentation.py`)

| Option | Default | Meaning |
|--------|---------|---------|
| `maxLength` | 80 | target max characters per chunk |
| `minLength` | 40 | min characters per chunk (best effort) |
| `overflow` | 0 | chars a chunk may exceed `maxLength` to reach a pause (0 = hard cap) |
| `prior` | `.gaussian` | length-prior shape |
| `targetLength` | 70 | gaussian / polynomial target |
| `spread` | 12 | gaussian / polynomial spread |
| `algorithm` | `.viterbi` | `.viterbi` (optimal) or `.greedy` |
| `allowMidword` | `false` | permit breaks inside words (skips the pause-aware mask) |
| `windowOverlap` | 16 | token overlap between inference windows for long text |

### Compute units

```swift
let segmenter = try SaTSegmenter(modelURL: modelURL, computeUnits: .cpuAndNeuralEngine)
```

### Tokenizer-only

```swift
let tok = try XLMRobertaTokenizer()
let (ids, charEnds) = tok.encode("Hello world! 你好。")
```

## CLI (for validation / experimentation)

```bash
swift run wtpseg --model path/to/sat-3l-sm-full-fp16.mlpackage \
    --text "Your text. 你的文字。" --max-length 80 --min-length 40

# flags: --max-length --min-length --overflow --prior {uniform,gaussian,clipped_polynomial}
#        --target --spread --algorithm {viterbi,greedy} --allow-midword
#        --vocab {auto,full,en_zh} --file <path>
#        --tokens / --dump-probs / --dump-chunks  (debug dumps)
```

## Notes

- Long inputs are processed in `seqLen`-sized windows (the models have a fixed
  sequence length, 256 by default) with overlapping logits averaged.
- `.mlpackage` is compiled on first load. To skip recompilation, pass a
  precompiled `.mlmodelc` URL instead.
- All text indexing is by Unicode scalar (code point) to match Python `str`
  semantics, so split positions and lengths line up with the reference.