infon-coref-pointer / js /README.md
cp500's picture
Upload js/README.md with huggingface_hub
fae24bf verified
|
Raw
History Blame Contribute Delete
9.5 kB

@cp500/infon-coref

Multilingual coreference resolution in the browser or Node, via ONNX.

The trained model is a pointer-network coref resolver fine-tuned on top of a multilingual MiniLM-L12 distilled from XLM-R. It handles English, Japanese, Korean, Thai, and Chinese β€” replaces English-only fastcoref for use cases that need multilingual coverage.

The model artefacts live at cp500/infon-coref-pointer on the Hugging Face Hub. This package is the JavaScript client that loads them.

Install

npm install @cp500/infon-coref onnxruntime-web
# or for Node:
npm install @cp500/infon-coref onnxruntime-node

The ONNX runtime is a peer dependency so you only install the one your environment needs. @huggingface/tokenizers is optional; if installed, we use its WASM SentencePiece tokenizer (faster and fully spec-compliant). Otherwise the package falls back to a minimal pure-JS tokenizer that handles the XLM-R vocabulary.

Quick start (browser)

import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',   // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB)
  device: 'auto',      // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic on battery technology. ' +
  'The Japanese automaker said the deal is worth $250 million.'
);

for (const cluster of result.clusters) {
  const surfaces = cluster.map(i => result.mentions[i].text);
  console.log(surfaces.join('  ↔  '));
  // Toyota  ↔  The Japanese automaker
}

Quick start (Node)

import { InfonCorefModel } from '@cp500/infon-coref';

// Same API as fromHub, but reads from local files (e.g. after a
// huggingface-cli download).
const model = await InfonCorefModel.fromLocal('./models/infon-coref/');
const result = await model.resolve('Toyota e Panasonic anunciaram...');

What you get back

interface CorefResult {
  text: string;                 // original input, unchanged
  tokens: Token[];              // wordpieces with char offsets
  mentions: Mention[];          // detected mentions in document order
  clusters: number[][];         // clusters[c] = list of mention indices
  timing: {
    tokenize: number;
    backbone: number;
    bioDecode: number;
    scorer: number;
    total: number;              // ms
  };
}

interface Mention {
  start: number;                // wordpiece index, inclusive
  end: number;                  // wordpiece index, inclusive
  charStart: number;            // char offset in source text
  charEnd: number;
  text: string;                 // literal substring of source text
  cluster: number;              // -1 for singleton
  antecedent: number;           // 0-based mention index, -1 = no antecedent
}

Languages

Trained on synthetic Bedrock/Claude-generated data balanced across:

Code Language
en English
ja Japanese
ko Korean
th Thai
zh Chinese (Simplified)

The XLM-R backbone covers ~100 languages but mention detection + pointer-net heads were only trained on these 5. Other languages may work via zero-shot transfer; verify on your domain before shipping.

API

InfonCorefModel.fromHub(repo, options?)

Load model artefacts from a Hugging Face repo. Downloads (and caches in the browser Cache API) meta.json, the chosen ONNX backbone, the mention scorer, and tokenizer.json.

Option Type Default Notes
precision 'fp32' | 'fp16' 'fp16' FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo.
device 'auto' | 'webgpu' | 'wasm' | 'cpu' | 'cuda' 'auto' Browser auto-prefers WebGPU.
maxLength number 256 Truncates inputs longer than N wordpieces.
bioThreshold number none If set, suppresses low-confidence span detections. 0.7 is a common stricter setting.
revision string 'main' HF branch/tag/commit-SHA pin.
debug boolean false Logs per-stage timings to console.debug.

InfonCorefModel.fromLocal(baseUrl, options?)

Same as fromHub but loads files relative to a base URL or filesystem path. Browser: baseUrl is a URL prefix (/models/coref/). Node: a directory path (./models/coref/).

The directory must contain:

meta.json
tokenizer.json
onnx/backbone_bio.onnx               (and .onnx.data sidecar if present)
onnx/backbone_bio_fp16.onnx
onnx/mention_scorer.onnx
onnx/mention_scorer_fp16.onnx

model.resolve(text, options?)

Run end-to-end coref on a single document. Returns CorefResult.

options accepts the same per-call overrides as fromHub's maxLength, bioThreshold, debug.

Power-user exports

If you want to swap one stage of the pipeline (e.g. a custom tokenizer or a different ORT runtime), the helpers are exported individually:

import {
  buildPairs,            // mention M β†’ flat (pair_i, pair_j) tensors
  decodeBio,             // BIO logits β†’ wordpiece spans
  groupClusters,         // antecedent decisions β†’ union-find clusters
  loadTokenizer,         // SentencePiece JSON β†’ Tokenizer
  fetchHubFile,          // HF Hub fetch + browser-cache
} from '@cp500/infon-coref';

These match the Python reference implementation in scripts/coref_onnx_experiment.py exactly β€” useful when comparing a Python/TS pipeline at the intermediate-tensor level.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  text                   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  SentencePiece tokenize β”‚   tokenizer.json (XLM-R vocab)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό   input_ids, attention_mask
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  backbone_bio.onnx      β”‚   MiniLM-L12 (12 layers, H=384)
β”‚   β€’ XLM-R encoder       β”‚   + 3-class BIO head
β”‚   β€’ bio_logits (T,3)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚        β”‚
         β”‚        β–Ό  bio_logits β†’ run-length decode β†’ spans
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  β”‚  decodeBio (TS)      β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚             β–Ό  span_starts, span_ends
         β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚  β”‚  buildPairs (TS)     β”‚
         β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚             β–Ό  pair_i, pair_j (triangular)
         β–Ό             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  mention_scorer.onnx    β”‚   gather + segment-mean pool +
β”‚   β€’ pair_scores (P,)    β”‚   3-vector pair MLP
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  pickAntecedents (TS)   β”‚
β”‚  + groupClusters (TS)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β–Ό
        CorefResult

The split between the two ONNX graphs exists so the BIO head can share computation with the backbone (one forward pass), while the mention scorer can be re-run with different (pair_i, pair_j) batches without recomputing hidden states. It also keeps each ONNX file's input signature simple enough to trace cleanly.

Performance ballpark

Numbers from a 2024 M1 Pro Macbook on a 110-token English document:

Stage WASM (FP16) WebGPU (FP16) Node CPU (FP16)
Tokenize 4 ms 4 ms 2 ms
Backbone 220 ms 70 ms 90 ms
BIO <1 ms <1 ms <1 ms
Scorer 5 ms 4 ms 2 ms
Total ~230 ms ~80 ms ~95 ms

First call adds ~2-4 s for ONNX session warmup. The Cache API in browsers persists the downloaded model so warmup-after-reload is limited to session creation.

License

Apache 2.0. The trained weights at cp500/infon-coref-pointer carry the same license; the underlying MiniLM-L12 backbone is also Apache 2.0.

Status

Alpha. The API is stable enough to integrate behind your own abstraction; expect minor breaking changes on the public class shape until 1.0.

Issue tracker: https://github.com/cp500/infon-coref-js/issues