abhijitnumber1's picture
Update README.md
0232234 verified
---
language:
- en
license: mit
library_name: transformers
tags:
- bert
- nlp
- ner
base_model:
- distilbert/distilbert-base-cased
pipeline_tag: token-classification
---
# Model Card for Bank-Transaction-NER-DistilBERT
This model performs token-level Named Entity Recognition (NER) on bank transaction SMS and email messages, identifying entities such as AMOUNT, DATE, TIME, MERCHANT, ACCOUNT, and REFERENCE IDs.
<!-- Provide a quick summary of what the model is/does. -->
## Model Details
### Model Description
This is a DistilBERT-based token classification model fine-tuned for extracting structured information from bank transaction messages.
The model identifies entities such as transaction amounts, dates, times, merchant names, account references, and balances from unstructured text.
- **Developed by:** Abhijit Das
- **Model type:** Token Classification (Named Entity Recognition)
- **Language(s):** English
- **License:** MIT
- **Finetuned from:** distilbert/distilbert-base-cased
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
### Direct Use
The model can be used to:
- Extract entities from bank transaction SMS
- Parse financial notification emails
- Support expense tracking and personal finance applications
- Generate structured data for downstream analytics
## Label Schema
The model predicts the following BIO-formatted labels:
- **B-amount / I-amount**
- **B-date / I-date**
- **B-time / I-time**
- **B-merchant / I-merchant**
- **B-balance / I-balance**
- **B-account / I-account**
- **B-ref / I-ref**
- **O** (Outside entity)
### Recommendations
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
## How to Get Started with the Model
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = "abhijitnumber1/bert-transaction-token-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name)
ner = pipeline(
"token-classification",
model=model,
tokenizer=tokenizer
)
text = "INR 11025.97 debited from your account at Uber on 31.07.2020"
output = ner(text)
print(output)
```
## Training Details
### Training Data
This model was trained on semi synthetic bank transaction messages written in English.
The data includes:
Automatically generated bank SMS and email messages (Data are randomly generated based on some real sample transaction message)
Different transaction types like debit, credit, refund, and balance update
Messages formatted similar to Indian bank notifications
Each Message is dynamically labled.
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
### Training Procedure
The model is based on DistilBERT and was trained to label each word in a sentence (Named Entity Recognition).
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
#### Preprocessing [optional]
Before training:
Text was split into tokens using the DistilBERT tokenizer
Labels were matched correctly to each token
Special tokens like [CLS] and [SEP] were ignored during training
Padding tokens were excluded from loss calculation
Labels follow the format (Beginning, Inside, Outside)
#### Speeds, Sizes, Times [optional]
Training time: Around 15 minutes on one CPU
Model size: About 261 MB
<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->