File size: 9,497 Bytes
fae24bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
# @cp500/infon-coref

Multilingual coreference resolution in the browser or Node, via ONNX.

The trained model is a pointer-network coref resolver fine-tuned on
top of a multilingual MiniLM-L12 distilled from XLM-R. It handles
**English, Japanese, Korean, Thai, and Chinese** β€” replaces
English-only [fastcoref](https://github.com/shon-otmazgin/fastcoref)
for use cases that need multilingual coverage.

The model artefacts live at
[**cp500/infon-coref-pointer**](https://huggingface.co/cp500/infon-coref-pointer)
on the Hugging Face Hub. This package is the JavaScript client that
loads them.

## Install

```bash
npm install @cp500/infon-coref onnxruntime-web
# or for Node:
npm install @cp500/infon-coref onnxruntime-node
```

The ONNX runtime is a **peer dependency** so you only install the one
your environment needs. ``@huggingface/tokenizers`` is **optional**;
if installed, we use its WASM SentencePiece tokenizer (faster and
fully spec-compliant). Otherwise the package falls back to a minimal
pure-JS tokenizer that handles the XLM-R vocabulary.

## Quick start (browser)

```ts
import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',   // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB)
  device: 'auto',      // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic on battery technology. ' +
  'The Japanese automaker said the deal is worth $250 million.'
);

for (const cluster of result.clusters) {
  const surfaces = cluster.map(i => result.mentions[i].text);
  console.log(surfaces.join('  ↔  '));
  // Toyota  ↔  The Japanese automaker
}
```

## Quick start (Node)

```ts
import { InfonCorefModel } from '@cp500/infon-coref';

// Same API as fromHub, but reads from local files (e.g. after a
// huggingface-cli download).
const model = await InfonCorefModel.fromLocal('./models/infon-coref/');
const result = await model.resolve('Toyota e Panasonic anunciaram...');
```

## What you get back

```ts
interface CorefResult {
  text: string;                 // original input, unchanged
  tokens: Token[];              // wordpieces with char offsets
  mentions: Mention[];          // detected mentions in document order
  clusters: number[][];         // clusters[c] = list of mention indices
  timing: {
    tokenize: number;
    backbone: number;
    bioDecode: number;
    scorer: number;
    total: number;              // ms
  };
}

interface Mention {
  start: number;                // wordpiece index, inclusive
  end: number;                  // wordpiece index, inclusive
  charStart: number;            // char offset in source text
  charEnd: number;
  text: string;                 // literal substring of source text
  cluster: number;              // -1 for singleton
  antecedent: number;           // 0-based mention index, -1 = no antecedent
}
```

## Languages

Trained on synthetic Bedrock/Claude-generated data balanced across:

| Code | Language       |
|------|----------------|
| `en` | English        |
| `ja` | Japanese       |
| `ko` | Korean         |
| `th` | Thai           |
| `zh` | Chinese (Simplified) |

The XLM-R backbone covers ~100 languages but mention detection +
pointer-net heads were only trained on these 5. Other languages may
work via zero-shot transfer; verify on your domain before shipping.

## API

### `InfonCorefModel.fromHub(repo, options?)`

Load model artefacts from a Hugging Face repo. Downloads (and caches
in the browser Cache API) ``meta.json``, the chosen ONNX backbone,
the mention scorer, and ``tokenizer.json``.

| Option         | Type                                    | Default   | Notes |
|----------------|-----------------------------------------|-----------|-------|
| `precision`    | `'fp32' \| 'fp16'`                      | `'fp16'`  | FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo. |
| `device`       | `'auto' \| 'webgpu' \| 'wasm' \| 'cpu' \| 'cuda'` | `'auto'` | Browser auto-prefers WebGPU. |
| `maxLength`    | `number`                                | `256`     | Truncates inputs longer than N wordpieces. |
| `bioThreshold` | `number`                                | none      | If set, suppresses low-confidence span detections. `0.7` is a common stricter setting. |
| `revision`     | `string`                                | `'main'`  | HF branch/tag/commit-SHA pin. |
| `debug`        | `boolean`                               | `false`   | Logs per-stage timings to `console.debug`. |

### `InfonCorefModel.fromLocal(baseUrl, options?)`

Same as `fromHub` but loads files relative to a base URL or
filesystem path. Browser: `baseUrl` is a URL prefix
(`/models/coref/`). Node: a directory path (`./models/coref/`).

The directory must contain:

```
meta.json
tokenizer.json
onnx/backbone_bio.onnx               (and .onnx.data sidecar if present)
onnx/backbone_bio_fp16.onnx
onnx/mention_scorer.onnx
onnx/mention_scorer_fp16.onnx
```

### `model.resolve(text, options?)`

Run end-to-end coref on a single document. Returns
[`CorefResult`](#what-you-get-back).

`options` accepts the same per-call overrides as `fromHub`'s
`maxLength`, `bioThreshold`, `debug`.

## Power-user exports

If you want to swap one stage of the pipeline (e.g. a custom
tokenizer or a different ORT runtime), the helpers are exported
individually:

```ts
import {
  buildPairs,            // mention M β†’ flat (pair_i, pair_j) tensors
  decodeBio,             // BIO logits β†’ wordpiece spans
  groupClusters,         // antecedent decisions β†’ union-find clusters
  loadTokenizer,         // SentencePiece JSON β†’ Tokenizer
  fetchHubFile,          // HF Hub fetch + browser-cache
} from '@cp500/infon-coref';
```

These match the Python reference implementation in
[`scripts/coref_onnx_experiment.py`](https://github.com/cp500/overlord/blob/main/infon/scripts/coref_onnx_experiment.py)
exactly β€” useful when comparing a Python/TS pipeline at the
intermediate-tensor level.

## Architecture

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  text                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SentencePiece tokenize β”‚   tokenizer.json (XLM-R vocab)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό   input_ids, attention_mask
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  backbone_bio.onnx      β”‚   MiniLM-L12 (12 layers, H=384)
β”‚   β€’ XLM-R encoder       β”‚   + 3-class BIO head
β”‚   β€’ bio_logits (T,3)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚        β”‚
         β”‚        β–Ό  bio_logits β†’ run-length decode β†’ spans
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  β”‚  decodeBio (TS)      β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚             β–Ό  span_starts, span_ends
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  β”‚  buildPairs (TS)     β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚             β–Ό  pair_i, pair_j (triangular)
         β–Ό             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  mention_scorer.onnx    β”‚   gather + segment-mean pool +
β”‚   β€’ pair_scores (P,)    β”‚   3-vector pair MLP
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  pickAntecedents (TS)   β”‚
β”‚  + groupClusters (TS)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό
        CorefResult
```

The split between the two ONNX graphs exists so the BIO head can
share computation with the backbone (one forward pass), while the
mention scorer can be re-run with different `(pair_i, pair_j)`
batches without recomputing hidden states. It also keeps each ONNX
file's input signature simple enough to trace cleanly.

## Performance ballpark

Numbers from a 2024 M1 Pro Macbook on a 110-token English document:

| Stage     | WASM (FP16) | WebGPU (FP16) | Node CPU (FP16) |
|-----------|-------------|---------------|-----------------|
| Tokenize  | 4 ms        | 4 ms          | 2 ms            |
| Backbone  | 220 ms      | 70 ms         | 90 ms           |
| BIO       | <1 ms       | <1 ms         | <1 ms           |
| Scorer    | 5 ms        | 4 ms          | 2 ms            |
| **Total** | **~230 ms** | **~80 ms**    | **~95 ms**      |

First call adds ~2-4 s for ONNX session warmup. The Cache API in
browsers persists the downloaded model so warmup-after-reload is
limited to session creation.

## License

Apache 2.0. The trained weights at `cp500/infon-coref-pointer` carry
the same license; the underlying MiniLM-L12 backbone is also Apache
2.0.

## Status

Alpha. The API is stable enough to integrate behind your own
abstraction; expect minor breaking changes on the public class
shape until 1.0.

Issue tracker: https://github.com/cp500/infon-coref-js/issues