Token Classification
Transformers.js
ONNX
bert
feature-extraction
coreference
multilingual
onnxruntime-web
Instructions to use cp500/infon-coref-pointer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use cp500/infon-coref-pointer with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('token-classification', 'cp500/infon-coref-pointer');
File size: 4,681 Bytes
073ec8d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | ---
license: apache-2.0
library_name: transformers.js
language:
- en
- ja
- zh
- ko
- th
tags:
- coreference
- multilingual
- onnx
- onnxruntime-web
- transformers.js
pipeline_tag: token-classification
---
# Infon multilingual coreference pointer
Multilingual coreference resolution: detects mentions and links them
into clusters across **English, Japanese, Korean, Thai, and Chinese**.
Designed for browser inference via ONNX, replacing the English-only
fastcoref baseline for multilingual workloads.
## Quick start (JavaScript)
```bash
npm install @cp500/infon-coref onnxruntime-web
```
```ts
import { InfonCorefModel } from '@cp500/infon-coref';
const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
precision: 'fp16', // 235 MB (default) β vs 470 MB for fp32
device: 'auto', // tries WebGPU, falls back to WASM
});
const result = await model.resolve(
'Toyota announced a partnership with Panasonic. ' +
'The Japanese automaker said the deal is worth $250M.'
);
for (const cluster of result.clusters) {
console.log(cluster.map(i => result.mentions[i].text).join(' = '));
// Toyota = The Japanese automaker
}
```
The JS client source is mirrored under [`js/`](./tree/main/js) in this
repo for self-contained installs:
```bash
npm install ./js
```
## Quick start (Python / PyTorch)
```python
import torch
from transformers import AutoModel, AutoTokenizer
# Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py
# (the training repo). Loading the heads is a 4-line check:
heads = torch.load("heads.pt", map_location="cpu", weights_only=True)
backbone = AutoModel.from_pretrained("./backbone/")
tokenizer = AutoTokenizer.from_pretrained("./backbone/")
```
## Architecture
```
text ββΆ tokenize ββΆ MiniLM-L12 backbone ββΆ β¬ββΆ last_hidden_state ββ
βββΆ bio_logits (T,3) β
β β
βΌ β
decode BIO spans β
β β
βΌ β
mention_scorer βββββββββββββ
β
βΌ
pair_scores (P,)
β
βΌ
per-mention argmax
β
βΌ
coreference clusters
```
Two ONNX graphs:
- `onnx/coref_backbone_bio.onnx` β XLM-R-distilled MiniLM-L12 (H=384,
12 layers, 117M params) plus a 3-class BIO mention-detection head.
- `onnx/coref_mention_scorer.onnx` β vectorised mention pooling
(boundary tokens + segment-mean) and a pairwise antecedent scorer.
DUMMY antecedent is concatenated at index 0 so `pair_j == 0` means
"no antecedent."
## Evaluation
Best checkpoint (selected on combined `(ptr_acc + bio_f1) / 2`):
| Language | Pointer acc | BIO F1 | Val mentions |
|----------|-------------|--------|--------------|
| en | 0.805 | 0.809 | 1827 |
| ja | 0.823 | 0.794 | 1601 |
| ko | 0.824 | 0.814 | 1702 |
| th | 0.820 | 0.906 | 1495 |
| zh | 0.829 | 0.872 | 1589 |
**Aggregate**: pointer accuracy 0.820, BIO F1 0.815,
combined score 0.817.
Trained on
[cp500/infon-coref-multilingual](https://huggingface.co/datasets/cp500/infon-coref-multilingual).
### Known limits
- BIO precision degrades after epoch 0 if training continues with the
default joint-loss schedule (pointer head saturates and the
optimizer pushes BIO toward recall). The deployed checkpoint is
from epoch 0 to keep BIO precision and pointer accuracy balanced.
A fix using separate optimizers per head is on the roadmap.
- Trained only on the 5 listed languages. Other XLM-R-supported
languages may work via zero-shot transfer; verify on your domain.
- Synthetic training data follows news-article register; out-of-domain
text (chat, code comments, formal contracts) may underperform.
## Backbone
`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` β public Apache-2.0 distillation of XLM-R-base.
Tokenizer copied here for offline-installable parity.
## License
Apache 2.0 for both weights and code.
|