Minns-ai
/

accounting-ner

+---
+language: en
+license: mit
+library_name: transformers
+tags:
+  - ner
+  - token-classification
+  - accounting
+  - finance
+  - bert
+  - onnx
+  - netting
+  - settlement
+datasets:
+  - expertai/BUSTER
+  - nikitpatel/invoice-ner-dataset
+pipeline_tag: token-classification
+base_model: google-bert/bert-base-uncased
+---
+# Accounting NER: PAYER / PAYEE / AMOUNT
+A fine-tuned BERT model for extracting **payer**, **payee**, and **amount** entities from transaction text. Designed for accounting reconciliation and netting tasks where an agent must parse transaction histories and compute final settlements between parties.
+## Entity Types
+| Label | Description | Example |
+|-------|-------------|---------|
+| `PAYER` | The party sending/owing money | "**Alice** paid $500 to Bob" |
+| `PAYEE` | The party receiving money | "Alice paid $500 to **Bob**" |
+| `AMOUNT` | Monetary amounts | "Alice paid **$500** to Bob" |
+## Performance
+Evaluated on a held-out validation set (2,385 examples):
+| Entity | Precision | Recall | F1 |
+|--------|-----------|--------|----|
+| AMOUNT | 0.96 | 0.98 | 0.97 |
+| PAYEE | 0.89 | 0.91 | 0.90 |
+| PAYER | 0.88 | 0.91 | 0.89 |
+| **Overall** | **0.89** | **0.92** | **0.90** |
+## Usage
+### Python (Transformers)
+```python
+from transformers import pipeline
+ner = pipeline("ner", model="Minns-ai/accounting-ner", aggregation_strategy="simple")
+results = ner("Alice paid $500 to Bob for dinner.")
+```
+### ONNX Runtime
+The `onnx/` directory contains `model.onnx` and `tokenizer.json` for deployment with ONNX Runtime (e.g. in a Rust or C++ service).
+```python
+import onnxruntime as ort
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_file("onnx/tokenizer.json")
+session = ort.InferenceSession("onnx/model.onnx")
+encoding = tokenizer.encode("Sam supplied $1,200 for Grace.")
+outputs = session.run(None, {
+    "input_ids": [encoding.ids],
+    "attention_mask": [encoding.attention_mask],
+    "token_type_ids": [encoding.type_ids],
+})
+```
+### Example Output
+```json
+{
+  "model": "bert-base-NER-onnx",
+  "entities": [
+    {"label": "PAYER", "start_offset": 0, "end_offset": 4, "confidence": 0.9996, "text": "anna"},
+    {"label": "PAYEE", "start_offset": 11, "end_offset": 15, "confidence": 0.9996, "text": "john"},
+    {"label": "PAYER", "start_offset": 35, "end_offset": 39, "confidence": 0.9991, "text": "tine"},
+    {"label": "PAYEE", "start_offset": 45, "end_offset": 49, "confidence": 0.9996, "text": "john"},
+    {"label": "PAYEE", "start_offset": 54, "end_offset": 58, "confidence": 0.9996, "text": "anna"}
+  ]
+}
+```
+Input: `"anna payed john for the cinema but tine owes john and anna for covering her 20"`
+## Training
+### Base Model
+`bert-base-uncased` fine-tuned for token classification with 7 labels (BIO format):
+`O`, `B-PAYER`, `I-PAYER`, `B-PAYEE`, `I-PAYEE`, `B-AMOUNT`, `I-AMOUNT`
+### Training Data (~10K examples from three sources)
+**1. [expertai/BUSTER](https://huggingface.co/datasets/expertai/BUSTER) (9,861 examples)**
+Business transaction documents from SEC EDGAR filings. Entity types remapped:
+- `Parties.BUYING_COMPANY` -> `PAYER`
+- `Parties.SELLING_COMPANY` -> `PAYEE`
+- `Generic_Info.ANNUAL_REVENUES` -> `AMOUNT`
+Licensed under Apache 2.0.
+**2. [Kaggle Invoice NER](https://www.kaggle.com/datasets/nikitpatel/invoice-ner-dataset) (64 examples)**
+Invoice documents with extracted fields (`TOTAL_AMOUNT`, `DUE_AMOUNT`, `ACCOUNT_NAME`) converted to token-level BIO annotations.
+**3. Synthetic Data (2,400 examples)**
+Programmatically generated transaction sentences to cover patterns underrepresented in the real datasets:
+- Formal ledger entries: `"Sam supplied $1,200 for Grace."`
+- Informal/casual language: `"Leo payed Lucy 500 for cleaning."`
+- Misspellings: `"payed"` instead of `"paid"`
+- Compound payers/payees: `"Tom and Lucy paid Mike $200."`
+- Missing amounts: `"Alice covered Bob for dinner."`
+- Multi-transaction sentences with conjunctions: `"Anna paid John $50 but Tine owes John and Anna for covering her 20."`
+- Transaction histories (3-8 concatenated transactions)
+The synthetic data generator (`training/data/create_dataset.py`) uses 30+ templates, 60+ party names, and 40+ transaction reasons to produce diverse examples.
+### Hyperparameters
+| Parameter | Value |
+|-----------|-------|
+| Learning rate | 3e-5 |
+| Batch size | 16 |
+| Epochs | 5 |
+| Warmup ratio | 0.1 |
+| Weight decay | 0.01 |
+| Max sequence length | 128 |
+## Intended Use
+Extracting structured (payer, payee, amount) triples from:
+- Transaction histories for **netting and settlement computation** (canceling circular debts)
+- Accounting statements and ledger entries
+- Informal payment descriptions
+- Multi-party transactions
+This supports tasks where an agent observes a history of transactions (e.g. "A supplied $X for B") between multiple parties and must compute the final settlement after netting.
+## Limitations
+- Trained primarily on English text
+- Best on short transaction sentences; long documents may need chunking (max 128 tokens)
+- Bare numbers without currency context (e.g. "20" at end of sentence) may not always be tagged as AMOUNT
+- Does not distinguish between different currencies in the same text
+- PAYER/PAYEE distinction relies on contextual cues (verbs like "paid", "owes", "received") — ambiguous sentences may be misclassified
+## Citation
+If you use this model, please cite the BUSTER dataset which contributed the majority of training data:
+```bibtex
+@inproceedings{zugarini-etal-2023-buster,
+    title = "{BUSTER}: a {``}{BUS}iness Transaction Entity Recognition{''} dataset",
+    author = "Zugarini, Andrea and Zamai, Andrew and Ernandes, Marco and Rigutini, Leonardo",
+    booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track",
+    year = "2023",
+    pages = "605--611",
+}
+```