Accounting NER: PAYER / PAYEE / AMOUNT
A fine-tuned BERT model for extracting payer, payee, and amount entities from transaction text. Designed for accounting reconciliation and netting tasks where an agent must parse transaction histories and compute final settlements between parties.
Entity Types
| Label | Description | Example |
|---|---|---|
PAYER |
The party sending/owing money | "Alice paid $500 to Bob" |
PAYEE |
The party receiving money | "Alice paid $500 to Bob" |
AMOUNT |
Monetary amounts | "Alice paid $500 to Bob" |
Performance
Evaluated on a held-out validation set (2,385 examples):
| Entity | Precision | Recall | F1 |
|---|---|---|---|
| AMOUNT | 0.96 | 0.98 | 0.97 |
| PAYEE | 0.89 | 0.91 | 0.90 |
| PAYER | 0.88 | 0.91 | 0.89 |
| Overall | 0.89 | 0.92 | 0.90 |
Usage
Python (Transformers)
from transformers import pipeline
ner = pipeline("ner", model="Minns-ai/accounting-ner", aggregation_strategy="simple")
results = ner("Alice paid $500 to Bob for dinner.")
ONNX Runtime
The onnx/ directory contains model.onnx and tokenizer.json for deployment with ONNX Runtime (e.g. in a Rust or C++ service).
import onnxruntime as ort
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("onnx/tokenizer.json")
session = ort.InferenceSession("onnx/model.onnx")
encoding = tokenizer.encode("Sam supplied $1,200 for Grace.")
outputs = session.run(None, {
"input_ids": [encoding.ids],
"attention_mask": [encoding.attention_mask],
"token_type_ids": [encoding.type_ids],
})
Example Output
{
"model": "bert-base-NER-onnx",
"entities": [
{"label": "PAYER", "start_offset": 0, "end_offset": 4, "confidence": 0.9996, "text": "anna"},
{"label": "PAYEE", "start_offset": 11, "end_offset": 15, "confidence": 0.9996, "text": "john"},
{"label": "PAYER", "start_offset": 35, "end_offset": 39, "confidence": 0.9991, "text": "tine"},
{"label": "PAYEE", "start_offset": 45, "end_offset": 49, "confidence": 0.9996, "text": "john"},
{"label": "PAYEE", "start_offset": 54, "end_offset": 58, "confidence": 0.9996, "text": "anna"}
]
}
Input: "anna payed john for the cinema but tine owes john and anna for covering her 20"
Training
Base Model
bert-base-uncased fine-tuned for token classification with 7 labels (BIO format):
O, B-PAYER, I-PAYER, B-PAYEE, I-PAYEE, B-AMOUNT, I-AMOUNT
Training Data (~10K examples from three sources)
1. expertai/BUSTER (9,861 examples) Business transaction documents from SEC EDGAR filings. Entity types remapped:
Parties.BUYING_COMPANY->PAYERParties.SELLING_COMPANY->PAYEEGeneric_Info.ANNUAL_REVENUES->AMOUNT
Licensed under Apache 2.0.
2. Kaggle Invoice NER (64 examples)
Invoice documents with extracted fields (TOTAL_AMOUNT, DUE_AMOUNT, ACCOUNT_NAME) converted to token-level BIO annotations.
3. Synthetic Data (2,400 examples) Programmatically generated transaction sentences to cover patterns underrepresented in the real datasets:
- Formal ledger entries:
"Sam supplied $1,200 for Grace." - Informal/casual language:
"Leo payed Lucy 500 for cleaning." - Misspellings:
"payed"instead of"paid" - Compound payers/payees:
"Tom and Lucy paid Mike $200." - Missing amounts:
"Alice covered Bob for dinner." - Multi-transaction sentences with conjunctions:
"Anna paid John $50 but Tine owes John and Anna for covering her 20." - Transaction histories (3-8 concatenated transactions)
The synthetic data generator (training/data/create_dataset.py) uses 30+ templates, 60+ party names, and 40+ transaction reasons to produce diverse examples.
Hyperparameters
| Parameter | Value |
|---|---|
| Learning rate | 3e-5 |
| Batch size | 16 |
| Epochs | 5 |
| Warmup ratio | 0.1 |
| Weight decay | 0.01 |
| Max sequence length | 128 |
Intended Use
Extracting structured (payer, payee, amount) triples from:
- Transaction histories for netting and settlement computation (canceling circular debts)
- Accounting statements and ledger entries
- Informal payment descriptions
- Multi-party transactions
This supports tasks where an agent observes a history of transactions (e.g. "A supplied $X for B") between multiple parties and must compute the final settlement after netting.
Limitations
- Trained primarily on English text
- Best on short transaction sentences; long documents may need chunking (max 128 tokens)
- Bare numbers without currency context (e.g. "20" at end of sentence) may not always be tagged as AMOUNT
- Does not distinguish between different currencies in the same text
- PAYER/PAYEE distinction relies on contextual cues (verbs like "paid", "owes", "received") — ambiguous sentences may be misclassified
Citation
If you use this model, please cite the BUSTER dataset which contributed the majority of training data:
@inproceedings{zugarini-etal-2023-buster,
title = "{BUSTER}: a {``}{BUS}iness Transaction Entity Recognition{''} dataset",
author = "Zugarini, Andrea and Zamai, Andrew and Ernandes, Marco and Rigutini, Leonardo",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track",
year = "2023",
pages = "605--611",
}
Model tree for Minns-ai/accounting-ner
Base model
google-bert/bert-base-uncased