JonnyJF commited on
Commit
c7a33ab
·
verified ·
1 Parent(s): 52a1e00

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +164 -3
README.md CHANGED
@@ -1,3 +1,164 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: en
3
+ license: mit
4
+ library_name: transformers
5
+ tags:
6
+ - ner
7
+ - token-classification
8
+ - accounting
9
+ - finance
10
+ - bert
11
+ - onnx
12
+ - netting
13
+ - settlement
14
+ datasets:
15
+ - expertai/BUSTER
16
+ - nikitpatel/invoice-ner-dataset
17
+ pipeline_tag: token-classification
18
+ base_model: google-bert/bert-base-uncased
19
+ ---
20
+
21
+ # Accounting NER: PAYER / PAYEE / AMOUNT
22
+
23
+ A fine-tuned BERT model for extracting **payer**, **payee**, and **amount** entities from transaction text. Designed for accounting reconciliation and netting tasks where an agent must parse transaction histories and compute final settlements between parties.
24
+
25
+ ## Entity Types
26
+
27
+ | Label | Description | Example |
28
+ |-------|-------------|---------|
29
+ | `PAYER` | The party sending/owing money | "**Alice** paid $500 to Bob" |
30
+ | `PAYEE` | The party receiving money | "Alice paid $500 to **Bob**" |
31
+ | `AMOUNT` | Monetary amounts | "Alice paid **$500** to Bob" |
32
+
33
+ ## Performance
34
+
35
+ Evaluated on a held-out validation set (2,385 examples):
36
+
37
+ | Entity | Precision | Recall | F1 |
38
+ |--------|-----------|--------|----|
39
+ | AMOUNT | 0.96 | 0.98 | 0.97 |
40
+ | PAYEE | 0.89 | 0.91 | 0.90 |
41
+ | PAYER | 0.88 | 0.91 | 0.89 |
42
+ | **Overall** | **0.89** | **0.92** | **0.90** |
43
+
44
+ ## Usage
45
+
46
+ ### Python (Transformers)
47
+
48
+ ```python
49
+ from transformers import pipeline
50
+
51
+ ner = pipeline("ner", model="Minns-ai/accounting-ner", aggregation_strategy="simple")
52
+ results = ner("Alice paid $500 to Bob for dinner.")
53
+ ```
54
+
55
+ ### ONNX Runtime
56
+
57
+ The `onnx/` directory contains `model.onnx` and `tokenizer.json` for deployment with ONNX Runtime (e.g. in a Rust or C++ service).
58
+
59
+ ```python
60
+ import onnxruntime as ort
61
+ from tokenizers import Tokenizer
62
+
63
+ tokenizer = Tokenizer.from_file("onnx/tokenizer.json")
64
+ session = ort.InferenceSession("onnx/model.onnx")
65
+
66
+ encoding = tokenizer.encode("Sam supplied $1,200 for Grace.")
67
+ outputs = session.run(None, {
68
+ "input_ids": [encoding.ids],
69
+ "attention_mask": [encoding.attention_mask],
70
+ "token_type_ids": [encoding.type_ids],
71
+ })
72
+ ```
73
+
74
+ ### Example Output
75
+
76
+ ```json
77
+ {
78
+ "model": "bert-base-NER-onnx",
79
+ "entities": [
80
+ {"label": "PAYER", "start_offset": 0, "end_offset": 4, "confidence": 0.9996, "text": "anna"},
81
+ {"label": "PAYEE", "start_offset": 11, "end_offset": 15, "confidence": 0.9996, "text": "john"},
82
+ {"label": "PAYER", "start_offset": 35, "end_offset": 39, "confidence": 0.9991, "text": "tine"},
83
+ {"label": "PAYEE", "start_offset": 45, "end_offset": 49, "confidence": 0.9996, "text": "john"},
84
+ {"label": "PAYEE", "start_offset": 54, "end_offset": 58, "confidence": 0.9996, "text": "anna"}
85
+ ]
86
+ }
87
+ ```
88
+
89
+ Input: `"anna payed john for the cinema but tine owes john and anna for covering her 20"`
90
+
91
+ ## Training
92
+
93
+ ### Base Model
94
+
95
+ `bert-base-uncased` fine-tuned for token classification with 7 labels (BIO format):
96
+ `O`, `B-PAYER`, `I-PAYER`, `B-PAYEE`, `I-PAYEE`, `B-AMOUNT`, `I-AMOUNT`
97
+
98
+ ### Training Data (~10K examples from three sources)
99
+
100
+ **1. [expertai/BUSTER](https://huggingface.co/datasets/expertai/BUSTER) (9,861 examples)**
101
+ Business transaction documents from SEC EDGAR filings. Entity types remapped:
102
+ - `Parties.BUYING_COMPANY` -> `PAYER`
103
+ - `Parties.SELLING_COMPANY` -> `PAYEE`
104
+ - `Generic_Info.ANNUAL_REVENUES` -> `AMOUNT`
105
+
106
+ Licensed under Apache 2.0.
107
+
108
+ **2. [Kaggle Invoice NER](https://www.kaggle.com/datasets/nikitpatel/invoice-ner-dataset) (64 examples)**
109
+ Invoice documents with extracted fields (`TOTAL_AMOUNT`, `DUE_AMOUNT`, `ACCOUNT_NAME`) converted to token-level BIO annotations.
110
+
111
+ **3. Synthetic Data (2,400 examples)**
112
+ Programmatically generated transaction sentences to cover patterns underrepresented in the real datasets:
113
+ - Formal ledger entries: `"Sam supplied $1,200 for Grace."`
114
+ - Informal/casual language: `"Leo payed Lucy 500 for cleaning."`
115
+ - Misspellings: `"payed"` instead of `"paid"`
116
+ - Compound payers/payees: `"Tom and Lucy paid Mike $200."`
117
+ - Missing amounts: `"Alice covered Bob for dinner."`
118
+ - Multi-transaction sentences with conjunctions: `"Anna paid John $50 but Tine owes John and Anna for covering her 20."`
119
+ - Transaction histories (3-8 concatenated transactions)
120
+
121
+ The synthetic data generator (`training/data/create_dataset.py`) uses 30+ templates, 60+ party names, and 40+ transaction reasons to produce diverse examples.
122
+
123
+ ### Hyperparameters
124
+
125
+ | Parameter | Value |
126
+ |-----------|-------|
127
+ | Learning rate | 3e-5 |
128
+ | Batch size | 16 |
129
+ | Epochs | 5 |
130
+ | Warmup ratio | 0.1 |
131
+ | Weight decay | 0.01 |
132
+ | Max sequence length | 128 |
133
+
134
+ ## Intended Use
135
+
136
+ Extracting structured (payer, payee, amount) triples from:
137
+ - Transaction histories for **netting and settlement computation** (canceling circular debts)
138
+ - Accounting statements and ledger entries
139
+ - Informal payment descriptions
140
+ - Multi-party transactions
141
+
142
+ This supports tasks where an agent observes a history of transactions (e.g. "A supplied $X for B") between multiple parties and must compute the final settlement after netting.
143
+
144
+ ## Limitations
145
+
146
+ - Trained primarily on English text
147
+ - Best on short transaction sentences; long documents may need chunking (max 128 tokens)
148
+ - Bare numbers without currency context (e.g. "20" at end of sentence) may not always be tagged as AMOUNT
149
+ - Does not distinguish between different currencies in the same text
150
+ - PAYER/PAYEE distinction relies on contextual cues (verbs like "paid", "owes", "received") — ambiguous sentences may be misclassified
151
+
152
+ ## Citation
153
+
154
+ If you use this model, please cite the BUSTER dataset which contributed the majority of training data:
155
+
156
+ ```bibtex
157
+ @inproceedings{zugarini-etal-2023-buster,
158
+ title = "{BUSTER}: a {``}{BUS}iness Transaction Entity Recognition{''} dataset",
159
+ author = "Zugarini, Andrea and Zamai, Andrew and Ernandes, Marco and Rigutini, Leonardo",
160
+ booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track",
161
+ year = "2023",
162
+ pages = "605--611",
163
+ }
164
+ ```