Upload js/README.md with huggingface_hub

fae24bf verified 29 days ago

9.5 kB

	# @cp500/infon-coref

	Multilingual coreference resolution in the browser or Node, via ONNX.

	The trained model is a pointer-network coref resolver fine-tuned on
	top of a multilingual MiniLM-L12 distilled from XLM-R. It handles
	English, Japanese, Korean, Thai, and Chinese — replaces
	English-only [fastcoref](https://github.com/shon-otmazgin/fastcoref)
	for use cases that need multilingual coverage.

	The model artefacts live at
	[cp500/infon-coref-pointer](https://huggingface.co/cp500/infon-coref-pointer)
	on the Hugging Face Hub. This package is the JavaScript client that
	loads them.

	## Install

	```bash
	npm install @cp500/infon-coref onnxruntime-web
	# or for Node:
	npm install @cp500/infon-coref onnxruntime-node
	```

	The ONNX runtime is a peer dependency so you only install the one
	your environment needs. ``@huggingface/tokenizers`` is optional;
	if installed, we use its WASM SentencePiece tokenizer (faster and
	fully spec-compliant). Otherwise the package falls back to a minimal
	pure-JS tokenizer that handles the XLM-R vocabulary.

	## Quick start (browser)

	```ts
	import { InfonCorefModel } from '@cp500/infon-coref';

	const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
	precision: 'fp16', // 'fp16' (default, ~235 MB) or 'fp32' (~470 MB)
	device: 'auto', // tries WebGPU, falls back to WASM
	});

	const result = await model.resolve(
	'Toyota announced a partnership with Panasonic on battery technology. ' +
	'The Japanese automaker said the deal is worth $250 million.'
	);

	for (const cluster of result.clusters) {
	const surfaces = cluster.map(i => result.mentions[i].text);
	console.log(surfaces.join(' ↔ '));
	// Toyota ↔ The Japanese automaker
	}
	```

	## Quick start (Node)

	```ts
	import { InfonCorefModel } from '@cp500/infon-coref';

	// Same API as fromHub, but reads from local files (e.g. after a
	// huggingface-cli download).
	const model = await InfonCorefModel.fromLocal('./models/infon-coref/');
	const result = await model.resolve('Toyota e Panasonic anunciaram...');
	```

	## What you get back

	```ts
	interface CorefResult {
	text: string; // original input, unchanged
	tokens: Token[]; // wordpieces with char offsets
	mentions: Mention[]; // detected mentions in document order
	clusters: number[][]; // clusters[c] = list of mention indices
	timing: {
	tokenize: number;
	backbone: number;
	bioDecode: number;
	scorer: number;
	total: number; // ms
	};
	}

	interface Mention {
	start: number; // wordpiece index, inclusive
	end: number; // wordpiece index, inclusive
	charStart: number; // char offset in source text
	charEnd: number;
	text: string; // literal substring of source text
	cluster: number; // -1 for singleton
	antecedent: number; // 0-based mention index, -1 = no antecedent
	}
	```

	## Languages

	Trained on synthetic Bedrock/Claude-generated data balanced across:

	\| Code \| Language \|
	\|------\|----------------\|
	\| `en` \| English \|
	\| `ja` \| Japanese \|
	\| `ko` \| Korean \|
	\| `th` \| Thai \|
	\| `zh` \| Chinese (Simplified) \|

	The XLM-R backbone covers ~100 languages but mention detection +
	pointer-net heads were only trained on these 5. Other languages may
	work via zero-shot transfer; verify on your domain before shipping.

	## API

	### `InfonCorefModel.fromHub(repo, options?)`

	Load model artefacts from a Hugging Face repo. Downloads (and caches
	in the browser Cache API) ``meta.json``, the chosen ONNX backbone,
	the mention scorer, and ``tokenizer.json``.

	\| Option \| Type \| Default \| Notes \|
	\|----------------\|-----------------------------------------\|-----------\|-------\|
	\| `precision` \| `'fp32' \\| 'fp16'` \| `'fp16'` \| FP16 halves the download. Falls back to FP32 if FP16 is missing in the repo. \|
	\| `device` \| `'auto' \\| 'webgpu' \\| 'wasm' \\| 'cpu' \\| 'cuda'` \| `'auto'` \| Browser auto-prefers WebGPU. \|
	\| `maxLength` \| `number` \| `256` \| Truncates inputs longer than N wordpieces. \|
	\| `bioThreshold` \| `number` \| none \| If set, suppresses low-confidence span detections. `0.7` is a common stricter setting. \|
	\| `revision` \| `string` \| `'main'` \| HF branch/tag/commit-SHA pin. \|
	\| `debug` \| `boolean` \| `false` \| Logs per-stage timings to `console.debug`. \|

	### `InfonCorefModel.fromLocal(baseUrl, options?)`

	Same as `fromHub` but loads files relative to a base URL or
	filesystem path. Browser: `baseUrl` is a URL prefix
	(`/models/coref/`). Node: a directory path (`./models/coref/`).

	The directory must contain:

	```
	meta.json
	tokenizer.json
	onnx/backbone_bio.onnx (and .onnx.data sidecar if present)
	onnx/backbone_bio_fp16.onnx
	onnx/mention_scorer.onnx
	onnx/mention_scorer_fp16.onnx
	```

	### `model.resolve(text, options?)`

	Run end-to-end coref on a single document. Returns
	[`CorefResult`](#what-you-get-back).

	`options` accepts the same per-call overrides as `fromHub`'s
	`maxLength`, `bioThreshold`, `debug`.

	## Power-user exports

	If you want to swap one stage of the pipeline (e.g. a custom
	tokenizer or a different ORT runtime), the helpers are exported
	individually:

	```ts
	import {
	buildPairs, // mention M → flat (pair_i, pair_j) tensors
	decodeBio, // BIO logits → wordpiece spans
	groupClusters, // antecedent decisions → union-find clusters
	loadTokenizer, // SentencePiece JSON → Tokenizer
	fetchHubFile, // HF Hub fetch + browser-cache
	} from '@cp500/infon-coref';
	```

	These match the Python reference implementation in
	[`scripts/coref_onnx_experiment.py`](https://github.com/cp500/overlord/blob/main/infon/scripts/coref_onnx_experiment.py)
	exactly — useful when comparing a Python/TS pipeline at the
	intermediate-tensor level.

	## Architecture

	```
	┌─────────────────────────┐
	│ text │
	└────────────┬────────────┘
	▼
	┌─────────────────────────┐
	│ SentencePiece tokenize │ tokenizer.json (XLM-R vocab)
	└────────────┬────────────┘
	▼ input_ids, attention_mask
	┌─────────────────────────┐
	│ backbone_bio.onnx │ MiniLM-L12 (12 layers, H=384)
	│ • XLM-R encoder │ + 3-class BIO head
	│ • bio_logits (T,3) │
	└────────┬────────┬───────┘
	│ │
	│ ▼ bio_logits → run-length decode → spans
	│ ┌──────────────────────┐
	│ │ decodeBio (TS) │
	│ └──────────┬───────────┘
	│ ▼ span_starts, span_ends
	│ ┌──────────────────────┐
	│ │ buildPairs (TS) │
	│ └──────────┬───────────┘
	│ ▼ pair_i, pair_j (triangular)
	▼ ▼
	┌─────────────────────────┐
	│ mention_scorer.onnx │ gather + segment-mean pool +
	│ • pair_scores (P,) │ 3-vector pair MLP
	└────────────┬────────────┘
	▼
	┌─────────────────────────┐
	│ pickAntecedents (TS) │
	│ + groupClusters (TS) │
	└────────────┬────────────┘
	▼
	CorefResult
	```

	The split between the two ONNX graphs exists so the BIO head can
	share computation with the backbone (one forward pass), while the
	mention scorer can be re-run with different `(pair_i, pair_j)`
	batches without recomputing hidden states. It also keeps each ONNX
	file's input signature simple enough to trace cleanly.

	## Performance ballpark

	Numbers from a 2024 M1 Pro Macbook on a 110-token English document:

	\| Stage \| WASM (FP16) \| WebGPU (FP16) \| Node CPU (FP16) \|
	\|-----------\|-------------\|---------------\|-----------------\|
	\| Tokenize \| 4 ms \| 4 ms \| 2 ms \|
	\| Backbone \| 220 ms \| 70 ms \| 90 ms \|
	\| BIO \| <1 ms \| <1 ms \| <1 ms \|
	\| Scorer \| 5 ms \| 4 ms \| 2 ms \|
	\| Total \| ~230 ms \| ~80 ms \| ~95 ms \|

	First call adds ~2-4 s for ONNX session warmup. The Cache API in
	browsers persists the downloaded model so warmup-after-reload is
	limited to session creation.

	## License

	Apache 2.0. The trained weights at `cp500/infon-coref-pointer` carry
	the same license; the underlying MiniLM-L12 backbone is also Apache
	2.0.

	## Status

	Alpha. The API is stable enough to integrate behind your own
	abstraction; expect minor breaking changes on the public class
	shape until 1.0.

	Issue tracker: https://github.com/cp500/infon-coref-js/issues