@cp500/infon-coref

Multilingual coreference resolution in the browser or Node, via ONNX.

The trained model is a pointer-network coref resolver fine-tuned on top of a multilingual MiniLM-L12 distilled from XLM-R. It handles English, Japanese, Korean, Thai, and Chinese — replaces English-only fastcoref for use cases that need multilingual coverage.

The model artefacts live at cp500/infon-coref-pointer on the Hugging Face Hub. This package is the JavaScript client that loads them.

Install

npm install @cp500/infon-coref onnxruntime-web
# or for Node:
npm install @cp500/infon-coref onnxruntime-node

The ONNX runtime is a peer dependency so you only install the one your environment needs. @huggingface/tokenizers is optional; if installed, we use its WASM SentencePiece tokenizer (faster and fully spec-compliant). Otherwise the package falls back to a minimal pure-JS tokenizer that handles the XLM-R vocabulary.

Quick start (browser)

import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',   // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB)
  device: 'auto',      // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic on battery technology. ' +
  'The Japanese automaker said the deal is worth $250 million.'
);

for (const cluster of result.clusters) {
  const surfaces = cluster.map(i => result.mentions[i].text);
  console.log(surfaces.join('  ↔  '));
  // Toyota  ↔  The Japanese automaker
}

Quick start (Node)

import { InfonCorefModel } from '@cp500/infon-coref';

// Same API as fromHub, but reads from local files (e.g. after a
// huggingface-cli download).
const model = await InfonCorefModel.fromLocal('./models/infon-coref/');
const result = await model.resolve('Toyota e Panasonic anunciaram...');

What you get back

interface CorefResult {
  text: string;                 // original input, unchanged
  tokens: Token[];              // wordpieces with char offsets
  mentions: Mention[];          // detected mentions in document order
  clusters: number[][];         // clusters[c] = list of mention indices
  timing: {
    tokenize: number;
    backbone: number;
    bioDecode: number;
    scorer: number;
    total: number;              // ms
  };
}

interface Mention {
  start: number;                // wordpiece index, inclusive
  end: number;                  // wordpiece index, inclusive
  charStart: number;            // char offset in source text
  charEnd: number;
  text: string;                 // literal substring of source text
  cluster: number;              // -1 for singleton
  antecedent: number;           // 0-based mention index, -1 = no antecedent
}

Languages

Trained on synthetic Bedrock/Claude-generated data balanced across:

Code	Language
`en`	English
`ja`	Japanese
`ko`	Korean
`th`	Thai
`zh`	Chinese (Simplified)

The XLM-R backbone covers ~100 languages but mention detection + pointer-net heads were only trained on these 5. Other languages may work via zero-shot transfer; verify on your domain before shipping.

API

`InfonCorefModel.fromHub(repo, options?)`

Load model artefacts from a Hugging Face repo. Downloads (and caches in the browser Cache API) meta.json, the chosen ONNX backbone, the mention scorer, and tokenizer.json.

Option	Type	Default	Notes
`precision`	`'fp32' \| 'fp16'`	`'fp16'`	FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo.
`device`	`'auto' \| 'webgpu' \| 'wasm' \| 'cpu' \| 'cuda'`	`'auto'`	Browser auto-prefers WebGPU.
`maxLength`	`number`	`256`	Truncates inputs longer than N wordpieces.
`bioThreshold`	`number`	none	If set, suppresses low-confidence span detections. `0.7` is a common stricter setting.
`revision`	`string`	`'main'`	HF branch/tag/commit-SHA pin.
`debug`	`boolean`	`false`	Logs per-stage timings to `console.debug`.

`InfonCorefModel.fromLocal(baseUrl, options?)`

Same as fromHub but loads files relative to a base URL or filesystem path. Browser: baseUrl is a URL prefix (/models/coref/). Node: a directory path (./models/coref/).

The directory must contain:

meta.json
tokenizer.json
onnx/backbone_bio.onnx               (and .onnx.data sidecar if present)
onnx/backbone_bio_fp16.onnx
onnx/mention_scorer.onnx
onnx/mention_scorer_fp16.onnx

`model.resolve(text, options?)`

Run end-to-end coref on a single document. Returns CorefResult.

options accepts the same per-call overrides as fromHub's maxLength, bioThreshold, debug.

Power-user exports

If you want to swap one stage of the pipeline (e.g. a custom tokenizer or a different ORT runtime), the helpers are exported individually:

import {
  buildPairs,            // mention M → flat (pair_i, pair_j) tensors
  decodeBio,             // BIO logits → wordpiece spans
  groupClusters,         // antecedent decisions → union-find clusters
  loadTokenizer,         // SentencePiece JSON → Tokenizer
  fetchHubFile,          // HF Hub fetch + browser-cache
} from '@cp500/infon-coref';

These match the Python reference implementation in scripts/coref_onnx_experiment.py exactly — useful when comparing a Python/TS pipeline at the intermediate-tensor level.

Architecture

┌─────────────────────────┐
│  text                   │
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  SentencePiece tokenize │   tokenizer.json (XLM-R vocab)
└────────────┬────────────┘
             ▼   input_ids, attention_mask
┌─────────────────────────┐
│  backbone_bio.onnx      │   MiniLM-L12 (12 layers, H=384)
│   • XLM-R encoder       │   + 3-class BIO head
│   • bio_logits (T,3)    │
└────────┬────────┬───────┘
         │        │
         │        ▼  bio_logits → run-length decode → spans
         │  ┌──────────────────────┐
         │  │  decodeBio (TS)      │
         │  └──────────┬───────────┘
         │             ▼  span_starts, span_ends
         │  ┌──────────────────────┐
         │  │  buildPairs (TS)     │
         │  └──────────┬───────────┘
         │             ▼  pair_i, pair_j (triangular)
         ▼             ▼
┌─────────────────────────┐
│  mention_scorer.onnx    │   gather + segment-mean pool +
│   • pair_scores (P,)    │   3-vector pair MLP
└────────────┬────────────┘
             ▼
┌─────────────────────────┐
│  pickAntecedents (TS)   │
│  + groupClusters (TS)   │
└────────────┬────────────┘
             ▼
        CorefResult

The split between the two ONNX graphs exists so the BIO head can share computation with the backbone (one forward pass), while the mention scorer can be re-run with different (pair_i, pair_j) batches without recomputing hidden states. It also keeps each ONNX file's input signature simple enough to trace cleanly.

Performance ballpark

Numbers from a 2024 M1 Pro Macbook on a 110-token English document:

Stage	WASM (FP16)	WebGPU (FP16)	Node CPU (FP16)
Tokenize	4 ms	4 ms	2 ms
Backbone	220 ms	70 ms	90 ms
BIO	<1 ms	<1 ms	<1 ms
Scorer	5 ms	4 ms	2 ms
Total	~230 ms	~80 ms	~95 ms

First call adds ~2-4 s for ONNX session warmup. The Cache API in browsers persists the downloaded model so warmup-after-reload is limited to session creation.

License

Apache 2.0. The trained weights at cp500/infon-coref-pointer carry the same license; the underlying MiniLM-L12 backbone is also Apache 2.0.

Status

Alpha. The API is stable enough to integrate behind your own abstraction; expect minor breaking changes on the public class shape until 1.0.

Issue tracker: https://github.com/cp500/infon-coref-js/issues