|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- bert |
|
|
- nlp |
|
|
- ner |
|
|
base_model: |
|
|
- distilbert/distilbert-base-cased |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# Model Card for Bank-Transaction-NER-DistilBERT |
|
|
This model performs token-level Named Entity Recognition (NER) on bank transaction SMS and email messages, identifying entities such as AMOUNT, DATE, TIME, MERCHANT, ACCOUNT, and REFERENCE IDs. |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
This is a DistilBERT-based token classification model fine-tuned for extracting structured information from bank transaction messages. |
|
|
The model identifies entities such as transaction amounts, dates, times, merchant names, account references, and balances from unstructured text. |
|
|
|
|
|
- **Developed by:** Abhijit Das |
|
|
- **Model type:** Token Classification (Named Entity Recognition) |
|
|
- **Language(s):** English |
|
|
- **License:** MIT |
|
|
- **Finetuned from:** distilbert/distilbert-base-cased |
|
|
|
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
### Direct Use |
|
|
The model can be used to: |
|
|
- Extract entities from bank transaction SMS |
|
|
- Parse financial notification emails |
|
|
- Support expense tracking and personal finance applications |
|
|
- Generate structured data for downstream analytics |
|
|
|
|
|
|
|
|
## Label Schema |
|
|
|
|
|
The model predicts the following BIO-formatted labels: |
|
|
|
|
|
- **B-amount / I-amount** |
|
|
- **B-date / I-date** |
|
|
- **B-time / I-time** |
|
|
- **B-merchant / I-merchant** |
|
|
- **B-balance / I-balance** |
|
|
- **B-account / I-account** |
|
|
- **B-ref / I-ref** |
|
|
- **O** (Outside entity) |
|
|
|
|
|
|
|
|
### Recommendations |
|
|
|
|
|
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. --> |
|
|
|
|
|
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
|
|
|
|
|
model_name = "abhijitnumber1/bert-transaction-token-classifier" |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForTokenClassification.from_pretrained(model_name) |
|
|
|
|
|
ner = pipeline( |
|
|
"token-classification", |
|
|
model=model, |
|
|
tokenizer=tokenizer |
|
|
) |
|
|
|
|
|
text = "INR 11025.97 debited from your account at Uber on 31.07.2020" |
|
|
output = ner(text) |
|
|
print(output) |
|
|
``` |
|
|
|
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
This model was trained on semi synthetic bank transaction messages written in English. |
|
|
The data includes: |
|
|
|
|
|
Automatically generated bank SMS and email messages (Data are randomly generated based on some real sample transaction message) |
|
|
|
|
|
Different transaction types like debit, credit, refund, and balance update |
|
|
|
|
|
Messages formatted similar to Indian bank notifications |
|
|
|
|
|
Each Message is dynamically labled. |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
|
|
|
|
|
|
### Training Procedure |
|
|
The model is based on DistilBERT and was trained to label each word in a sentence (Named Entity Recognition). |
|
|
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
#### Preprocessing [optional] |
|
|
|
|
|
Before training: |
|
|
|
|
|
Text was split into tokens using the DistilBERT tokenizer |
|
|
|
|
|
Labels were matched correctly to each token |
|
|
|
|
|
Special tokens like [CLS] and [SEP] were ignored during training |
|
|
|
|
|
Padding tokens were excluded from loss calculation |
|
|
|
|
|
Labels follow the format (Beginning, Inside, Outside) |
|
|
|
|
|
|
|
|
|
|
|
#### Speeds, Sizes, Times [optional] |
|
|
Training time: Around 15 minutes on one CPU |
|
|
|
|
|
Model size: About 261 MB |
|
|
|
|
|
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. --> |
|
|
|