File size: 4,681 Bytes

073ec8d

---
license: apache-2.0
library_name: transformers.js
language:
  - en
  - ja
  - zh
  - ko
  - th
tags:
  - coreference
  - multilingual
  - onnx
  - onnxruntime-web
  - transformers.js
pipeline_tag: token-classification
---

# Infon multilingual coreference pointer

Multilingual coreference resolution: detects mentions and links them
into clusters across **English, Japanese, Korean, Thai, and Chinese**.
Designed for browser inference via ONNX, replacing the English-only
fastcoref baseline for multilingual workloads.

## Quick start (JavaScript)

```bash
npm install @cp500/infon-coref onnxruntime-web
```

```ts
import { InfonCorefModel } from '@cp500/infon-coref';

const model = await InfonCorefModel.fromHub('cp500/infon-coref-pointer', {
  precision: 'fp16',     // 235 MB (default) — vs 470 MB for fp32
  device: 'auto',        // tries WebGPU, falls back to WASM
});

const result = await model.resolve(
  'Toyota announced a partnership with Panasonic. ' +
  'The Japanese automaker said the deal is worth $250M.'
);

for (const cluster of result.clusters) {
  console.log(cluster.map(i => result.mentions[i].text).join(' = '));
  // Toyota = The Japanese automaker
}
```

The JS client source is mirrored under [`js/`](./tree/main/js) in this
repo for self-contained installs:

```bash
npm install ./js
```

## Quick start (Python / PyTorch)

```python
import torch
from transformers import AutoModel, AutoTokenizer
# Architecture lives in scripts/train_coref_pointer.py / coref_onnx_experiment.py
# (the training repo). Loading the heads is a 4-line check:
heads = torch.load("heads.pt", map_location="cpu", weights_only=True)
backbone = AutoModel.from_pretrained("./backbone/")
tokenizer = AutoTokenizer.from_pretrained("./backbone/")
```

## Architecture

```
text ─▶ tokenize ─▶ MiniLM-L12 backbone ─▶ ┬─▶ last_hidden_state ─┐
                                            └─▶ bio_logits (T,3)    │
                                                       │             │
                                                       ▼             │
                                              decode BIO spans       │
                                                       │             │
                                                       ▼             │
                                          mention_scorer ◀───────────┘
                                                  │
                                                  ▼
                                            pair_scores (P,)
                                                  │
                                                  ▼
                                          per-mention argmax
                                                  │
                                                  ▼
                                          coreference clusters
```

Two ONNX graphs:

- `onnx/coref_backbone_bio.onnx` — XLM-R-distilled MiniLM-L12 (H=384,
  12 layers, 117M params) plus a 3-class BIO mention-detection head.
- `onnx/coref_mention_scorer.onnx` — vectorised mention pooling
  (boundary tokens + segment-mean) and a pairwise antecedent scorer.
  DUMMY antecedent is concatenated at index 0 so `pair_j == 0` means
  "no antecedent."

## Evaluation

Best checkpoint (selected on combined `(ptr_acc + bio_f1) / 2`):

| Language | Pointer acc | BIO F1 | Val mentions |
|----------|-------------|--------|--------------|
| en | 0.805 | 0.809 | 1827 |
| ja | 0.823 | 0.794 | 1601 |
| ko | 0.824 | 0.814 | 1702 |
| th | 0.820 | 0.906 | 1495 |
| zh | 0.829 | 0.872 | 1589 |

**Aggregate**: pointer accuracy 0.820, BIO F1 0.815,
combined score 0.817.

Trained on
[cp500/infon-coref-multilingual](https://huggingface.co/datasets/cp500/infon-coref-multilingual).

### Known limits

- BIO precision degrades after epoch 0 if training continues with the
  default joint-loss schedule (pointer head saturates and the
  optimizer pushes BIO toward recall). The deployed checkpoint is
  from epoch 0 to keep BIO precision and pointer accuracy balanced.
  A fix using separate optimizers per head is on the roadmap.
- Trained only on the 5 listed languages. Other XLM-R-supported
  languages may work via zero-shot transfer; verify on your domain.
- Synthetic training data follows news-article register; out-of-domain
  text (chat, code comments, formal contracts) may underperform.

## Backbone

`sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2` — public Apache-2.0 distillation of XLM-R-base.
Tokenizer copied here for offline-installable parity.

## License

Apache 2.0 for both weights and code.