--- language: - en license: mit library_name: transformers tags: - bert - nlp - ner base_model: - distilbert/distilbert-base-cased pipeline_tag: token-classification --- # Model Card for Bank-Transaction-NER-DistilBERT This model performs token-level Named Entity Recognition (NER) on bank transaction SMS and email messages, identifying entities such as AMOUNT, DATE, TIME, MERCHANT, ACCOUNT, and REFERENCE IDs. ## Model Details ### Model Description This is a DistilBERT-based token classification model fine-tuned for extracting structured information from bank transaction messages. The model identifies entities such as transaction amounts, dates, times, merchant names, account references, and balances from unstructured text. - **Developed by:** Abhijit Das - **Model type:** Token Classification (Named Entity Recognition) - **Language(s):** English - **License:** MIT - **Finetuned from:** distilbert/distilbert-base-cased ## Uses ### Direct Use The model can be used to: - Extract entities from bank transaction SMS - Parse financial notification emails - Support expense tracking and personal finance applications - Generate structured data for downstream analytics ## Label Schema The model predicts the following BIO-formatted labels: - **B-amount / I-amount** - **B-date / I-date** - **B-time / I-time** - **B-merchant / I-merchant** - **B-balance / I-balance** - **B-account / I-account** - **B-ref / I-ref** - **O** (Outside entity) ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline model_name = "abhijitnumber1/bert-transaction-token-classifier" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForTokenClassification.from_pretrained(model_name) ner = pipeline( "token-classification", model=model, tokenizer=tokenizer ) text = "INR 11025.97 debited from your account at Uber on 31.07.2020" output = ner(text) print(output) ``` ## Training Details ### Training Data This model was trained on semi synthetic bank transaction messages written in English. The data includes: Automatically generated bank SMS and email messages (Data are randomly generated based on some real sample transaction message) Different transaction types like debit, credit, refund, and balance update Messages formatted similar to Indian bank notifications Each Message is dynamically labled. ### Training Procedure The model is based on DistilBERT and was trained to label each word in a sentence (Named Entity Recognition). #### Preprocessing [optional] Before training: Text was split into tokens using the DistilBERT tokenizer Labels were matched correctly to each token Special tokens like [CLS] and [SEP] were ignored during training Padding tokens were excluded from loss calculation Labels follow the format (Beginning, Inside, Outside) #### Speeds, Sizes, Times [optional] Training time: Around 15 minutes on one CPU Model size: About 261 MB