infon-coref-pointer / js /README.md
cp500's picture
Upload js/README.md with huggingface_hub
fae24bf verified
|
Raw
History Blame Contribute Delete
9.5 kB
# @cp500/infon-coref
Multilingual coreference resolution in the browser or Node, via ONNX.
The trained model is a pointer-network coref resolver fine-tuned on
top of a multilingual MiniLM-L12 distilled from XLM-R. It handles
**English, Japanese, Korean, Thai, and Chinese** β€” replaces
English-only [fastcoref](https://github.com/shon-otmazgin/fastcoref)
for use cases that need multilingual coverage.
The model artefacts live at
[**cp500/infon-coref-pointer**](https://huggingface.co/cp500/infon-coref-pointer)
on the Hugging Face Hub. This package is the JavaScript client that
loads them.
## Install
```bash
npm install @cp500/infon-coref onnxruntime-web
# or for Node:
npm install @cp500/infon-coref onnxruntime-node
```
The ONNX runtime is a **peer dependency** so you only install the one
your environment needs. ``@huggingface/tokenizers`` is **optional**;
if installed, we use its WASM SentencePiece tokenizer (faster and
fully spec-compliant). Otherwise the package falls back to a minimal
pure-JS tokenizer that handles the XLM-R vocabulary.
## Quick start (browser)
```ts
import { InfonCorefModel } from '@cp500/infon-coref';
const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
precision: 'fp16', // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB)
device: 'auto', // tries WebGPU, falls back to WASM
});
const result = await model.resolve(
'Toyota announced a partnership with Panasonic on battery technology. ' +
'The Japanese automaker said the deal is worth $250 million.'
);
for (const cluster of result.clusters) {
const surfaces = cluster.map(i => result.mentions[i].text);
console.log(surfaces.join(' ↔ '));
// Toyota ↔ The Japanese automaker
}
```
## Quick start (Node)
```ts
import { InfonCorefModel } from '@cp500/infon-coref';
// Same API as fromHub, but reads from local files (e.g. after a
// huggingface-cli download).
const model = await InfonCorefModel.fromLocal('./models/infon-coref/');
const result = await model.resolve('Toyota e Panasonic anunciaram...');
```
## What you get back
```ts
interface CorefResult {
text: string; // original input, unchanged
tokens: Token[]; // wordpieces with char offsets
mentions: Mention[]; // detected mentions in document order
clusters: number[][]; // clusters[c] = list of mention indices
timing: {
tokenize: number;
backbone: number;
bioDecode: number;
scorer: number;
total: number; // ms
};
}
interface Mention {
start: number; // wordpiece index, inclusive
end: number; // wordpiece index, inclusive
charStart: number; // char offset in source text
charEnd: number;
text: string; // literal substring of source text
cluster: number; // -1 for singleton
antecedent: number; // 0-based mention index, -1 = no antecedent
}
```
## Languages
Trained on synthetic Bedrock/Claude-generated data balanced across:
| Code | Language |
|------|----------------|
| `en` | English |
| `ja` | Japanese |
| `ko` | Korean |
| `th` | Thai |
| `zh` | Chinese (Simplified) |
The XLM-R backbone covers ~100 languages but mention detection +
pointer-net heads were only trained on these 5. Other languages may
work via zero-shot transfer; verify on your domain before shipping.
## API
### `InfonCorefModel.fromHub(repo, options?)`
Load model artefacts from a Hugging Face repo. Downloads (and caches
in the browser Cache API) ``meta.json``, the chosen ONNX backbone,
the mention scorer, and ``tokenizer.json``.
| Option | Type | Default | Notes |
|----------------|-----------------------------------------|-----------|-------|
| `precision` | `'fp32' \| 'fp16'` | `'fp16'` | FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo. |
| `device` | `'auto' \| 'webgpu' \| 'wasm' \| 'cpu' \| 'cuda'` | `'auto'` | Browser auto-prefers WebGPU. |
| `maxLength` | `number` | `256` | Truncates inputs longer than N wordpieces. |
| `bioThreshold` | `number` | none | If set, suppresses low-confidence span detections. `0.7` is a common stricter setting. |
| `revision` | `string` | `'main'` | HF branch/tag/commit-SHA pin. |
| `debug` | `boolean` | `false` | Logs per-stage timings to `console.debug`. |
### `InfonCorefModel.fromLocal(baseUrl, options?)`
Same as `fromHub` but loads files relative to a base URL or
filesystem path. Browser: `baseUrl` is a URL prefix
(`/models/coref/`). Node: a directory path (`./models/coref/`).
The directory must contain:
```
meta.json
tokenizer.json
onnx/backbone_bio.onnx (and .onnx.data sidecar if present)
onnx/backbone_bio_fp16.onnx
onnx/mention_scorer.onnx
onnx/mention_scorer_fp16.onnx
```
### `model.resolve(text, options?)`
Run end-to-end coref on a single document. Returns
[`CorefResult`](#what-you-get-back).
`options` accepts the same per-call overrides as `fromHub`'s
`maxLength`, `bioThreshold`, `debug`.
## Power-user exports
If you want to swap one stage of the pipeline (e.g. a custom
tokenizer or a different ORT runtime), the helpers are exported
individually:
```ts
import {
buildPairs, // mention M β†’ flat (pair_i, pair_j) tensors
decodeBio, // BIO logits β†’ wordpiece spans
groupClusters, // antecedent decisions β†’ union-find clusters
loadTokenizer, // SentencePiece JSON β†’ Tokenizer
fetchHubFile, // HF Hub fetch + browser-cache
} from '@cp500/infon-coref';
```
These match the Python reference implementation in
[`scripts/coref_onnx_experiment.py`](https://github.com/cp500/overlord/blob/main/infon/scripts/coref_onnx_experiment.py)
exactly β€” useful when comparing a Python/TS pipeline at the
intermediate-tensor level.
## Architecture
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ text β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ SentencePiece tokenize β”‚ tokenizer.json (XLM-R vocab)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό input_ids, attention_mask
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ backbone_bio.onnx β”‚ MiniLM-L12 (12 layers, H=384)
β”‚ β€’ XLM-R encoder β”‚ + 3-class BIO head
β”‚ β€’ bio_logits (T,3) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β–Ό bio_logits β†’ run-length decode β†’ spans
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ decodeBio (TS) β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β–Ό span_starts, span_ends
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ buildPairs (TS) β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β–Ό pair_i, pair_j (triangular)
β–Ό β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ mention_scorer.onnx β”‚ gather + segment-mean pool +
β”‚ β€’ pair_scores (P,) β”‚ 3-vector pair MLP
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ pickAntecedents (TS) β”‚
β”‚ + groupClusters (TS) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
CorefResult
```
The split between the two ONNX graphs exists so the BIO head can
share computation with the backbone (one forward pass), while the
mention scorer can be re-run with different `(pair_i, pair_j)`
batches without recomputing hidden states. It also keeps each ONNX
file's input signature simple enough to trace cleanly.
## Performance ballpark
Numbers from a 2024 M1 Pro Macbook on a 110-token English document:
| Stage | WASM (FP16) | WebGPU (FP16) | Node CPU (FP16) |
|-----------|-------------|---------------|-----------------|
| Tokenize | 4 ms | 4 ms | 2 ms |
| Backbone | 220 ms | 70 ms | 90 ms |
| BIO | <1 ms | <1 ms | <1 ms |
| Scorer | 5 ms | 4 ms | 2 ms |
| **Total** | **~230 ms** | **~80 ms** | **~95 ms** |
First call adds ~2-4 s for ONNX session warmup. The Cache API in
browsers persists the downloaded model so warmup-after-reload is
limited to session creation.
## License
Apache 2.0. The trained weights at `cp500/infon-coref-pointer` carry
the same license; the underlying MiniLM-L12 backbone is also Apache
2.0.
## Status
Alpha. The API is stable enough to integrate behind your own
abstraction; expect minor breaking changes on the public class
shape until 1.0.
Issue tracker: https://github.com/cp500/infon-coref-js/issues