Upload README.md with huggingface_hub

073ec8d verified 27 days ago

4.68 kB

	---
	license: apache-2.0
	library_name: transformers.js
	language:
	- en
	- ja
	- zh
	- ko
	- th
	tags:
	- coreference
	- multilingual
	- onnx
	- onnxruntime-web
	- transformers.js
	pipeline_tag: token-classification
	---

	# Infon multilingual coreference pointer

	Multilingual coreference resolution: detects mentions and links them
	into clusters across English, Japanese, Korean, Thai, and Chinese.
	Designed for browser inference via ONNX, replacing the English-only
	fastcoref baseline for multilingual workloads.

	## Quick start (JavaScript)

	```bash
	npm install @cp500/infon-coref onnxruntime-web
	```

	```ts
	import { InfonCorefModel } from '@cp500/infon-coref';

	const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
	precision: 'fp16', // 235 MB (default) — vs 470 MB for fp32
	device: 'auto', // tries WebGPU, falls back to WASM
	});

	const result = await model.resolve(
	'Toyota announced a partnership with Panasonic. ' +
	'The Japanese automaker said the deal is worth $250M.'
	);

	for (const cluster of result.clusters) {
	console.log(cluster.map(i => result.mentions[i].text).join(' = '));
	// Toyota = The Japanese automaker
	}
	```

	The JS client source is mirrored under [`js/`](./tree/main/js) in this
	repo for self-contained installs:

	```bash
	npm install ./js
	```

	## Quick start (Python / PyTorch)

	```python
	import torch
	from transformers import AutoModel, AutoTokenizer
	# Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py
	# (the training repo). Loading the heads is a 4-line check:
	heads = torch.load("heads.pt", map_location="cpu", weights_only=True)
	backbone = AutoModel.from_pretrained("./backbone/")
	tokenizer = AutoTokenizer.from_pretrained("./backbone/")
	```

	## Architecture

	```
	text ─▶ tokenize ─▶ MiniLM-L12 backbone ─▶ ┬─▶ last_hidden_state ─┐
	└─▶ bio_logits (T,3) │
	│ │
	▼ │
	decode BIO spans │
	│ │
	▼ │
	mention_scorer ◀───────────┘
	│
	▼
	pair_scores (P,)
	│
	▼
	per-mention argmax
	│
	▼
	coreference clusters
	```

	Two ONNX graphs:

	- `onnx/coref_backbone_bio.onnx` — XLM-R-distilled MiniLM-L12 (H=384,
	12 layers, 117M params) plus a 3-class BIO mention-detection head.
	- `onnx/coref_mention_scorer.onnx` — vectorised mention pooling
	(boundary tokens + segment-mean) and a pairwise antecedent scorer.
	DUMMY antecedent is concatenated at index 0 so `pair_j == 0` means
	"no antecedent."

	## Evaluation

	Best checkpoint (selected on combined `(ptr_acc + bio_f1) / 2`):

	\| Language \| Pointer acc \| BIO F1 \| Val mentions \|
	\|----------\|-------------\|--------\|--------------\|
	\| en \| 0.805 \| 0.809 \| 1827 \|
	\| ja \| 0.823 \| 0.794 \| 1601 \|
	\| ko \| 0.824 \| 0.814 \| 1702 \|
	\| th \| 0.820 \| 0.906 \| 1495 \|
	\| zh \| 0.829 \| 0.872 \| 1589 \|

	Aggregate: pointer accuracy 0.820, BIO F1 0.815,
	combined score 0.817.

	Trained on
	[cp500/infon-coref-multilingual](https://huggingface.co/datasets/cp500/infon-coref-multilingual).

	### Known limits

	- BIO precision degrades after epoch 0 if training continues with the
	default joint-loss schedule (pointer head saturates and the
	optimizer pushes BIO toward recall). The deployed checkpoint is
	from epoch 0 to keep BIO precision and pointer accuracy balanced.
	A fix using separate optimizers per head is on the roadmap.
	- Trained only on the 5 listed languages. Other XLM-R-supported
	languages may work via zero-shot transfer; verify on your domain.
	- Synthetic training data follows news-article register; out-of-domain
	text (chat, code comments, formal contracts) may underperform.

	## Backbone

	`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` — public Apache-2.0 distillation of XLM-R-base.
	Tokenizer copied here for offline-installable parity.

	## License

	Apache 2.0 for both weights and code.