# @cp500/infon-coref Multilingual coreference resolution in the browser or Node, via ONNX. The trained model is a pointer-network coref resolver fine-tuned on top of a multilingual MiniLM-L12 distilled from XLM-R. It handles **English, Japanese, Korean, Thai, and Chinese** — replaces English-only [fastcoref](https://github.com/shon-otmazgin/fastcoref) for use cases that need multilingual coverage. The model artefacts live at [**cp500/infon-coref-pointer**](https://huggingface.co/cp500/infon-coref-pointer) on the Hugging Face Hub. This package is the JavaScript client that loads them. ## Install ```bash npm install @cp500/infon-coref onnxruntime-web # or for Node: npm install @cp500/infon-coref onnxruntime-node ``` The ONNX runtime is a **peer dependency** so you only install the one your environment needs. ``@huggingface/tokenizers`` is **optional**; if installed, we use its WASM SentencePiece tokenizer (faster and fully spec-compliant). Otherwise the package falls back to a minimal pure-JS tokenizer that handles the XLM-R vocabulary. ## Quick start (browser) ```ts import { InfonCorefModel } from '@cp500/infon-coref'; const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', { precision: 'fp16', // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB) device: 'auto', // tries WebGPU, falls back to WASM }); const result = await model.resolve( 'Toyota announced a partnership with Panasonic on battery technology. ' + 'The Japanese automaker said the deal is worth $250 million.' ); for (const cluster of result.clusters) { const surfaces = cluster.map(i => result.mentions[i].text); console.log(surfaces.join(' ↔ ')); // Toyota ↔ The Japanese automaker } ``` ## Quick start (Node) ```ts import { InfonCorefModel } from '@cp500/infon-coref'; // Same API as fromHub, but reads from local files (e.g. after a // huggingface-cli download). const model = await InfonCorefModel.fromLocal('./models/infon-coref/'); const result = await model.resolve('Toyota e Panasonic anunciaram...'); ``` ## What you get back ```ts interface CorefResult { text: string; // original input, unchanged tokens: Token[]; // wordpieces with char offsets mentions: Mention[]; // detected mentions in document order clusters: number[][]; // clusters[c] = list of mention indices timing: { tokenize: number; backbone: number; bioDecode: number; scorer: number; total: number; // ms }; } interface Mention { start: number; // wordpiece index, inclusive end: number; // wordpiece index, inclusive charStart: number; // char offset in source text charEnd: number; text: string; // literal substring of source text cluster: number; // -1 for singleton antecedent: number; // 0-based mention index, -1 = no antecedent } ``` ## Languages Trained on synthetic Bedrock/Claude-generated data balanced across: | Code | Language | |------|----------------| | `en` | English | | `ja` | Japanese | | `ko` | Korean | | `th` | Thai | | `zh` | Chinese (Simplified) | The XLM-R backbone covers ~100 languages but mention detection + pointer-net heads were only trained on these 5. Other languages may work via zero-shot transfer; verify on your domain before shipping. ## API ### `InfonCorefModel.fromHub(repo, options?)` Load model artefacts from a Hugging Face repo. Downloads (and caches in the browser Cache API) ``meta.json``, the chosen ONNX backbone, the mention scorer, and ``tokenizer.json``. | Option | Type | Default | Notes | |----------------|-----------------------------------------|-----------|-------| | `precision` | `'fp32' \| 'fp16'` | `'fp16'` | FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo. | | `device` | `'auto' \| 'webgpu' \| 'wasm' \| 'cpu' \| 'cuda'` | `'auto'` | Browser auto-prefers WebGPU. | | `maxLength` | `number` | `256` | Truncates inputs longer than N wordpieces. | | `bioThreshold` | `number` | none | If set, suppresses low-confidence span detections. `0.7` is a common stricter setting. | | `revision` | `string` | `'main'` | HF branch/tag/commit-SHA pin. | | `debug` | `boolean` | `false` | Logs per-stage timings to `console.debug`. | ### `InfonCorefModel.fromLocal(baseUrl, options?)` Same as `fromHub` but loads files relative to a base URL or filesystem path. Browser: `baseUrl` is a URL prefix (`/models/coref/`). Node: a directory path (`./models/coref/`). The directory must contain: ``` meta.json tokenizer.json onnx/backbone_bio.onnx (and .onnx.data sidecar if present) onnx/backbone_bio_fp16.onnx onnx/mention_scorer.onnx onnx/mention_scorer_fp16.onnx ``` ### `model.resolve(text, options?)` Run end-to-end coref on a single document. Returns [`CorefResult`](#what-you-get-back). `options` accepts the same per-call overrides as `fromHub`'s `maxLength`, `bioThreshold`, `debug`. ## Power-user exports If you want to swap one stage of the pipeline (e.g. a custom tokenizer or a different ORT runtime), the helpers are exported individually: ```ts import { buildPairs, // mention M → flat (pair_i, pair_j) tensors decodeBio, // BIO logits → wordpiece spans groupClusters, // antecedent decisions → union-find clusters loadTokenizer, // SentencePiece JSON → Tokenizer fetchHubFile, // HF Hub fetch + browser-cache } from '@cp500/infon-coref'; ``` These match the Python reference implementation in [`scripts/coref_onnx_experiment.py`](https://github.com/cp500/overlord/blob/main/infon/scripts/coref_onnx_experiment.py) exactly — useful when comparing a Python/TS pipeline at the intermediate-tensor level. ## Architecture ``` ┌─────────────────────────┐ │ text │ └────────────┬────────────┘ ▼ ┌─────────────────────────┐ │ SentencePiece tokenize │ tokenizer.json (XLM-R vocab) └────────────┬────────────┘ ▼ input_ids, attention_mask ┌─────────────────────────┐ │ backbone_bio.onnx │ MiniLM-L12 (12 layers, H=384) │ • XLM-R encoder │ + 3-class BIO head │ • bio_logits (T,3) │ └────────┬────────┬───────┘ │ │ │ ▼ bio_logits → run-length decode → spans │ ┌──────────────────────┐ │ │ decodeBio (TS) │ │ └──────────┬───────────┘ │ ▼ span_starts, span_ends │ ┌──────────────────────┐ │ │ buildPairs (TS) │ │ └──────────┬───────────┘ │ ▼ pair_i, pair_j (triangular) ▼ ▼ ┌─────────────────────────┐ │ mention_scorer.onnx │ gather + segment-mean pool + │ • pair_scores (P,) │ 3-vector pair MLP └────────────┬────────────┘ ▼ ┌─────────────────────────┐ │ pickAntecedents (TS) │ │ + groupClusters (TS) │ └────────────┬────────────┘ ▼ CorefResult ``` The split between the two ONNX graphs exists so the BIO head can share computation with the backbone (one forward pass), while the mention scorer can be re-run with different `(pair_i, pair_j)` batches without recomputing hidden states. It also keeps each ONNX file's input signature simple enough to trace cleanly. ## Performance ballpark Numbers from a 2024 M1 Pro Macbook on a 110-token English document: | Stage | WASM (FP16) | WebGPU (FP16) | Node CPU (FP16) | |-----------|-------------|---------------|-----------------| | Tokenize | 4 ms | 4 ms | 2 ms | | Backbone | 220 ms | 70 ms | 90 ms | | BIO | <1 ms | <1 ms | <1 ms | | Scorer | 5 ms | 4 ms | 2 ms | | **Total** | **~230 ms** | **~80 ms** | **~95 ms** | First call adds ~2-4 s for ONNX session warmup. The Cache API in browsers persists the downloaded model so warmup-after-reload is limited to session creation. ## License Apache 2.0. The trained weights at `cp500/infon-coref-pointer` carry the same license; the underlying MiniLM-L12 backbone is also Apache 2.0. ## Status Alpha. The API is stable enough to integrate behind your own abstraction; expect minor breaking changes on the public class shape until 1.0. Issue tracker: https://github.com/cp500/infon-coref-js/issues