File size: 9,497 Bytes

fae24bf

# @cp500/infon-coref

Multilingual coreference resolution in the browser or Node, via ONNX.

The trained model is a pointer-network coref resolver fine-tuned on
top of a multilingual MiniLM-L12 distilled from XLM-R. It handles
**English, Japanese, Korean, Thai, and Chinese** — replaces
English-only [fastcoref](https://github.com/shon-otmazgin/fastcoref)
for use cases that need multilingual coverage.

The model artefacts live at
[**cp500/infon-coref-pointer**](https://huggingface.co/cp500/infon-coref-pointer)
on the Hugging Face Hub. This package is the JavaScript client that
loads them.

## Install

```bash
npm install @cp500/infon-coref onnxruntime-web
# or for Node:
npm install @cp500/infon-coref onnxruntime-node
```

The ONNX runtime is a **peer dependency** so you only install the one
your environment needs. ``@huggingface/tokenizers`` is **optional**;
if installed, we use its WASM SentencePiece tokenizer (faster and
fully spec-compliant). Otherwise the package falls back to a minimal
pure-JS tokenizer that handles the XLM-R vocabulary.

## Quick start (browser)

```ts
import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',   // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB)
  device: 'auto',      // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic on battery technology. ' +
  'The Japanese automaker said the deal is worth $250 million.'
);

for (const cluster of result.clusters) {
  const surfaces = cluster.map(i => result.mentions[i].text);
  console.log(surfaces.join('  ↔  '));
  // Toyota  ↔  The Japanese automaker
}
```

## Quick start (Node)

```ts
import { InfonCorefModel } from '@cp500/infon-coref';

// Same API as fromHub, but reads from local files (e.g. after a
// huggingface-cli download).
const model = await InfonCorefModel.fromLocal('./models/infon-coref/');
const result = await model.resolve('Toyota e Panasonic anunciaram...');
```

## What you get back

```ts
interface CorefResult {
  text: string;                 // original input, unchanged
  tokens: Token[];              // wordpieces with char offsets
  mentions: Mention[];          // detected mentions in document order
  clusters: number[][];         // clusters[c] = list of mention indices
  timing: {
    tokenize: number;
    backbone: number;
    bioDecode: number;
    scorer: number;
    total: number;              // ms
  };
}

interface Mention {
  start: number;                // wordpiece index, inclusive
  end: number;                  // wordpiece index, inclusive
  charStart: number;            // char offset in source text
  charEnd: number;
  text: string;                 // literal substring of source text
  cluster: number;              // -1 for singleton
  antecedent: number;           // 0-based mention index, -1 = no antecedent
}
```

## Languages

Trained on synthetic Bedrock/Claude-generated data balanced across:

| Code | Language       |
|------|----------------|
| `en` | English        |
| `ja` | Japanese       |
| `ko` | Korean         |
| `th` | Thai           |
| `zh` | Chinese (Simplified) |

The XLM-R backbone covers ~100 languages but mention detection +
pointer-net heads were only trained on these 5. Other languages may
work via zero-shot transfer; verify on your domain before shipping.

## API

### `InfonCorefModel.fromHub(repo, options?)`

Load model artefacts from a Hugging Face repo. Downloads (and caches
in the browser Cache API) ``meta.json``, the chosen ONNX backbone,
the mention scorer, and ``tokenizer.json``.

| Option         | Type                                    | Default   | Notes |
|----------------|-----------------------------------------|-----------|-------|
| `precision`    | `'fp32' \| 'fp16'`                      | `'fp16'`  | FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo. |
| `device`       | `'auto' \| 'webgpu' \| 'wasm' \| 'cpu' \| 'cuda'` | `'auto'` | Browser auto-prefers WebGPU. |
| `maxLength`    | `number`                                | `256`     | Truncates inputs longer than N wordpieces. |
| `bioThreshold` | `number`                                | none      | If set, suppresses low-confidence span detections. `0.7` is a common stricter setting. |
| `revision`     | `string`                                | `'main'`  | HF branch/tag/commit-SHA pin. |
| `debug`        | `boolean`                               | `false`   | Logs per-stage timings to `console.debug`. |

### `InfonCorefModel.fromLocal(baseUrl, options?)`

Same as `fromHub` but loads files relative to a base URL or
filesystem path. Browser: `baseUrl` is a URL prefix
(`/models/coref/`). Node: a directory path (`./models/coref/`).

The directory must contain:

```
meta.json
tokenizer.json
onnx/backbone_bio.onnx               (and .onnx.data sidecar if present)
onnx/backbone_bio_fp16.onnx
onnx/mention_scorer.onnx
onnx/mention_scorer_fp16.onnx
```

### `model.resolve(text, options?)`

Run end-to-end coref on a single document. Returns
[`CorefResult`](#what-you-get-back).

`options` accepts the same per-call overrides as `fromHub`'s
`maxLength`, `bioThreshold`, `debug`.

## Power-user exports

If you want to swap one stage of the pipeline (e.g. a custom
tokenizer or a different ORT runtime), the helpers are exported
individually:

```ts
import {
  buildPairs,            // mention M → flat (pair_i, pair_j) tensors
  decodeBio,             // BIO logits → wordpiece spans
  groupClusters,         // antecedent decisions → union-find clusters
  loadTokenizer,         // SentencePiece JSON → Tokenizer
  fetchHubFile,          // HF Hub fetch + browser-cache
} from '@cp500/infon-coref';
```

These match the Python reference implementation in
[`scripts/coref_onnx_experiment.py`](https://github.com/cp500/overlord/blob/main/infon/scripts/coref_onnx_experiment.py)
exactly — useful when comparing a Python/TS pipeline at the
intermediate-tensor level.

## Architecture

```
┌─────────────────────────┐
│  text                   │
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  SentencePiece tokenize │   tokenizer.json (XLM-R vocab)
└────────────┬────────────┘
             ▼   input_ids, attention_mask
┌─────────────────────────┐
│  backbone_bio.onnx      │   MiniLM-L12 (12 layers, H=384)
│   • XLM-R encoder       │   + 3-class BIO head
│   • bio_logits (T,3)    │
└────────┬────────┬───────┘
         │        │
         │        ▼  bio_logits → run-length decode → spans
         │  ┌──────────────────────┐
         │  │  decodeBio (TS)      │
         │  └──────────┬───────────┘
         │             ▼  span_starts, span_ends
         │  ┌──────────────────────┐
         │  │  buildPairs (TS)     │
         │  └──────────┬───────────┘
         │             ▼  pair_i, pair_j (triangular)
         ▼             ▼
┌─────────────────────────┐
│  mention_scorer.onnx    │   gather + segment-mean pool +
│   • pair_scores (P,)    │   3-vector pair MLP
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  pickAntecedents (TS)   │
│  + groupClusters (TS)   │
└────────────┬────────────┘
             ▼
        CorefResult
```

The split between the two ONNX graphs exists so the BIO head can
share computation with the backbone (one forward pass), while the
mention scorer can be re-run with different `(pair_i, pair_j)`
batches without recomputing hidden states. It also keeps each ONNX
file's input signature simple enough to trace cleanly.

## Performance ballpark

Numbers from a 2024 M1 Pro Macbook on a 110-token English document:

| Stage     | WASM (FP16) | WebGPU (FP16) | Node CPU (FP16) |
|-----------|-------------|---------------|-----------------|
| Tokenize  | 4 ms        | 4 ms          | 2 ms            |
| Backbone  | 220 ms      | 70 ms         | 90 ms           |
| BIO       | <1 ms       | <1 ms         | <1 ms           |
| Scorer    | 5 ms        | 4 ms          | 2 ms            |
| **Total** | **~230 ms** | **~80 ms**    | **~95 ms**      |

First call adds ~2-4 s for ONNX session warmup. The Cache API in
browsers persists the downloaded model so warmup-after-reload is
limited to session creation.

## License

Apache 2.0. The trained weights at `cp500/infon-coref-pointer` carry
the same license; the underlying MiniLM-L12 backbone is also Apache
2.0.

## Status

Alpha. The API is stable enough to integrate behind your own
abstraction; expect minor breaking changes on the public class
shape until 1.0.

Issue tracker: https://github.com/cp500/infon-coref-js/issues