Token Classification
Transformers.js
ONNX
bert
feature-extraction
coreference
multilingual
onnxruntime-web
Instructions to use cp500/infon-coref-pointer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use cp500/infon-coref-pointer with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('token-classification', 'cp500/infon-coref-pointer');
| license: apache-2.0 | |
| library_name: transformers.js | |
| language: | |
| - en | |
| - ja | |
| - zh | |
| - ko | |
| - th | |
| tags: | |
| - coreference | |
| - multilingual | |
| - onnx | |
| - onnxruntime-web | |
| - transformers.js | |
| pipeline_tag: token-classification | |
| # Infon multilingual coreference pointer | |
| Multilingual coreference resolution: detects mentions and links them | |
| into clusters across **English, Japanese, Korean, Thai, and Chinese**. | |
| Designed for browser inference via ONNX, replacing the English-only | |
| fastcoref baseline for multilingual workloads. | |
| ## Quick start (JavaScript) | |
| ```bash | |
| npm install @cp500/infon-coref onnxruntime-web | |
| ``` | |
| ```ts | |
| import { InfonCorefModel } from '@cp500/infon-coref'; | |
| const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', { | |
| precision: 'fp16', // 235 MB (default) β vs 470 MB for fp32 | |
| device: 'auto', // tries WebGPU, falls back to WASM | |
| }); | |
| const result = await model.resolve( | |
| 'Toyota announced a partnership with Panasonic. ' + | |
| 'The Japanese automaker said the deal is worth $250M.' | |
| ); | |
| for (const cluster of result.clusters) { | |
| console.log(cluster.map(i => result.mentions[i].text).join(' = ')); | |
| // Toyota = The Japanese automaker | |
| } | |
| ``` | |
| The JS client source is mirrored under [`js/`](./tree/main/js) in this | |
| repo for self-contained installs: | |
| ```bash | |
| npm install ./js | |
| ``` | |
| ## Quick start (Python / PyTorch) | |
| ```python | |
| import torch | |
| from transformers import AutoModel, AutoTokenizer | |
| # Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py | |
| # (the training repo). Loading the heads is a 4-line check: | |
| heads = torch.load("heads.pt", map_location="cpu", weights_only=True) | |
| backbone = AutoModel.from_pretrained("./backbone/") | |
| tokenizer = AutoTokenizer.from_pretrained("./backbone/") | |
| ``` | |
| ## Architecture | |
| ``` | |
| text ββΆ tokenize ββΆ MiniLM-L12 backbone ββΆ β¬ββΆ last_hidden_state ββ | |
| βββΆ bio_logits (T,3) β | |
| β β | |
| βΌ β | |
| decode BIO spans β | |
| β β | |
| βΌ β | |
| mention_scorer βββββββββββββ | |
| β | |
| βΌ | |
| pair_scores (P,) | |
| β | |
| βΌ | |
| per-mention argmax | |
| β | |
| βΌ | |
| coreference clusters | |
| ``` | |
| Two ONNX graphs: | |
| - `onnx/coref_backbone_bio.onnx` β XLM-R-distilled MiniLM-L12 (H=384, | |
| 12 layers, 117M params) plus a 3-class BIO mention-detection head. | |
| - `onnx/coref_mention_scorer.onnx` β vectorised mention pooling | |
| (boundary tokens + segment-mean) and a pairwise antecedent scorer. | |
| DUMMY antecedent is concatenated at index 0 so `pair_j == 0` means | |
| "no antecedent." | |
| ## Evaluation | |
| Best checkpoint (selected on combined `(ptr_acc + bio_f1) / 2`): | |
| | Language | Pointer acc | BIO F1 | Val mentions | | |
| |----------|-------------|--------|--------------| | |
| | en | 0.805 | 0.809 | 1827 | | |
| | ja | 0.823 | 0.794 | 1601 | | |
| | ko | 0.824 | 0.814 | 1702 | | |
| | th | 0.820 | 0.906 | 1495 | | |
| | zh | 0.829 | 0.872 | 1589 | | |
| **Aggregate**: pointer accuracy 0.820, BIO F1 0.815, | |
| combined score 0.817. | |
| Trained on | |
| [cp500/infon-coref-multilingual](https://huggingface.co/datasets/cp500/infon-coref-multilingual). | |
| ### Known limits | |
| - BIO precision degrades after epoch 0 if training continues with the | |
| default joint-loss schedule (pointer head saturates and the | |
| optimizer pushes BIO toward recall). The deployed checkpoint is | |
| from epoch 0 to keep BIO precision and pointer accuracy balanced. | |
| A fix using separate optimizers per head is on the roadmap. | |
| - Trained only on the 5 listed languages. Other XLM-R-supported | |
| languages may work via zero-shot transfer; verify on your domain. | |
| - Synthetic training data follows news-article register; out-of-domain | |
| text (chat, code comments, formal contracts) may underperform. | |
| ## Backbone | |
| `sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` β public Apache-2.0 distillation of XLM-R-base. | |
| Tokenizer copied here for offline-installable parity. | |
| ## License | |
| Apache 2.0 for both weights and code. | |