File size: 4,681 Bytes
073ec8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
---
license: apache-2.0
library_name: transformers.js
language:
  - en
  - ja
  - zh
  - ko
  - th
tags:
  - coreference
  - multilingual
  - onnx
  - onnxruntime-web
  - transformers.js
pipeline_tag: token-classification
---

# Infon multilingual coreference pointer

Multilingual coreference resolution: detects mentions and links them
into clusters across **English, Japanese, Korean, Thai, and Chinese**.
Designed for browser inference via ONNX, replacing the English-only
fastcoref baseline for multilingual workloads.

## Quick start (JavaScript)

```bash
npm install @cp500/infon-coref onnxruntime-web
```

```ts
import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',     // 235 MB (default) β€” vs 470 MB for fp32
  device: 'auto',        // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic. ' +
  'The Japanese automaker said the deal is worth $250M.'
);

for (const cluster of result.clusters) {
  console.log(cluster.map(i => result.mentions[i].text).join(' = '));
  // Toyota = The Japanese automaker
}
```

The JS client source is mirrored under [`js/`](./tree/main/js) in this
repo for self-contained installs:

```bash
npm install ./js
```

## Quick start (Python / PyTorch)

```python
import torch
from transformers import AutoModel, AutoTokenizer
# Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py
# (the training repo). Loading the heads is a 4-line check:
heads = torch.load("heads.pt", map_location="cpu", weights_only=True)
backbone = AutoModel.from_pretrained("./backbone/")
tokenizer = AutoTokenizer.from_pretrained("./backbone/")
```

## Architecture

```
text ─▢ tokenize ─▢ MiniLM-L12 backbone ─▢ ┬─▢ last_hidden_state ─┐
                                            └─▢ bio_logits (T,3)    β”‚
                                                       β”‚             β”‚
                                                       β–Ό             β”‚
                                              decode BIO spans       β”‚
                                                       β”‚             β”‚
                                                       β–Ό             β”‚
                                          mention_scorer β—€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β”‚
                                                  β–Ό
                                            pair_scores (P,)
                                                  β”‚
                                                  β–Ό
                                          per-mention argmax
                                                  β”‚
                                                  β–Ό
                                          coreference clusters
```

Two ONNX graphs:

- `onnx/coref_backbone_bio.onnx` β€” XLM-R-distilled MiniLM-L12 (H=384,
  12 layers, 117M params) plus a 3-class BIO mention-detection head.
- `onnx/coref_mention_scorer.onnx` β€” vectorised mention pooling
  (boundary tokens + segment-mean) and a pairwise antecedent scorer.
  DUMMY antecedent is concatenated at index 0 so `pair_j == 0` means
  "no antecedent."

## Evaluation

Best checkpoint (selected on combined `(ptr_acc + bio_f1) / 2`):

| Language | Pointer acc | BIO F1 | Val mentions |
|----------|-------------|--------|--------------|
| en | 0.805 | 0.809 | 1827 |
| ja | 0.823 | 0.794 | 1601 |
| ko | 0.824 | 0.814 | 1702 |
| th | 0.820 | 0.906 | 1495 |
| zh | 0.829 | 0.872 | 1589 |

**Aggregate**: pointer accuracy 0.820, BIO F1 0.815,
combined score 0.817.

Trained on
[cp500/infon-coref-multilingual](https://huggingface.co/datasets/cp500/infon-coref-multilingual).

### Known limits

- BIO precision degrades after epoch 0 if training continues with the
  default joint-loss schedule (pointer head saturates and the
  optimizer pushes BIO toward recall). The deployed checkpoint is
  from epoch 0 to keep BIO precision and pointer accuracy balanced.
  A fix using separate optimizers per head is on the roadmap.
- Trained only on the 5 listed languages. Other XLM-R-supported
  languages may work via zero-shot transfer; verify on your domain.
- Synthetic training data follows news-article register; out-of-domain
  text (chat, code comments, formal contracts) may underperform.

## Backbone

`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` β€” public Apache-2.0 distillation of XLM-R-base.
Tokenizer copied here for offline-installable parity.

## License

Apache 2.0 for both weights and code.