--- license: apache-2.0 library_name: transformers.js language: - en - ja - zh - ko - th tags: - coreference - multilingual - onnx - onnxruntime-web - transformers.js pipeline_tag: token-classification --- # Infon multilingual coreference pointer Multilingual coreference resolution: detects mentions and links them into clusters across **English, Japanese, Korean, Thai, and Chinese**. Designed for browser inference via ONNX, replacing the English-only fastcoref baseline for multilingual workloads. ## Quick start (JavaScript) ```bash npm install @cp500/infon-coref onnxruntime-web ``` ```ts import { InfonCorefModel } from '@cp500/infon-coref'; const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', { precision: 'fp16', // 235 MB (default) — vs 470 MB for fp32 device: 'auto', // tries WebGPU, falls back to WASM }); const result = await model.resolve( 'Toyota announced a partnership with Panasonic. ' + 'The Japanese automaker said the deal is worth $250M.' ); for (const cluster of result.clusters) { console.log(cluster.map(i => result.mentions[i].text).join(' = ')); // Toyota = The Japanese automaker } ``` The JS client source is mirrored under [`js/`](./tree/main/js) in this repo for self-contained installs: ```bash npm install ./js ``` ## Quick start (Python / PyTorch) ```python import torch from transformers import AutoModel, AutoTokenizer # Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py # (the training repo). Loading the heads is a 4-line check: heads = torch.load("heads.pt", map_location="cpu", weights_only=True) backbone = AutoModel.from_pretrained("./backbone/") tokenizer = AutoTokenizer.from_pretrained("./backbone/") ``` ## Architecture ``` text ─▶ tokenize ─▶ MiniLM-L12 backbone ─▶ ┬─▶ last_hidden_state ─┐ └─▶ bio_logits (T,3) │ │ │ ▼ │ decode BIO spans │ │ │ ▼ │ mention_scorer ◀───────────┘ │ ▼ pair_scores (P,) │ ▼ per-mention argmax │ ▼ coreference clusters ``` Two ONNX graphs: - `onnx/coref_backbone_bio.onnx` — XLM-R-distilled MiniLM-L12 (H=384, 12 layers, 117M params) plus a 3-class BIO mention-detection head. - `onnx/coref_mention_scorer.onnx` — vectorised mention pooling (boundary tokens + segment-mean) and a pairwise antecedent scorer. DUMMY antecedent is concatenated at index 0 so `pair_j == 0` means "no antecedent." ## Evaluation Best checkpoint (selected on combined `(ptr_acc + bio_f1) / 2`): | Language | Pointer acc | BIO F1 | Val mentions | |----------|-------------|--------|--------------| | en | 0.805 | 0.809 | 1827 | | ja | 0.823 | 0.794 | 1601 | | ko | 0.824 | 0.814 | 1702 | | th | 0.820 | 0.906 | 1495 | | zh | 0.829 | 0.872 | 1589 | **Aggregate**: pointer accuracy 0.820, BIO F1 0.815, combined score 0.817. Trained on [cp500/infon-coref-multilingual](https://huggingface.co/datasets/cp500/infon-coref-multilingual). ### Known limits - BIO precision degrades after epoch 0 if training continues with the default joint-loss schedule (pointer head saturates and the optimizer pushes BIO toward recall). The deployed checkpoint is from epoch 0 to keep BIO precision and pointer accuracy balanced. A fix using separate optimizers per head is on the roadmap. - Trained only on the 5 listed languages. Other XLM-R-supported languages may work via zero-shot transfer; verify on your domain. - Synthetic training data follows news-article register; out-of-domain text (chat, code comments, formal contracts) may underperform. ## Backbone `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` — public Apache-2.0 distillation of XLM-R-base. Tokenizer copied here for offline-installable parity. ## License Apache 2.0 for both weights and code.