--- language: - en - id - my license: apache-2.0 tags: - receipt-parsing - information-extraction - bart - ocr - text-to-text datasets: - dhiaznaidi/receiptdatasetssd300v2 library_name: transformers pipeline-tag: --- # BART-base Receipt Parser ## Model Description This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) for receipt parsing tasks. The model is trained to extract key information from receipt text, specifically: - **Date**: Transaction date from the receipt - **Company Name**: Name of the merchant/store - **Total Amount**: Final amount paid ## Dataset The model was trained using the [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) from Kaggle, which contains receipt images with corresponding labels. ## Data Processing Pipeline 1. **OCR Processing**: All receipt images from the dataset were processed using [EasyOCR](https://github.com/JaidedAI/EasyOCR) to extract raw text 2. **Input-Output Mapping**: The extracted OCR text serves as input, while the labeled data from the Kaggle dataset serves as the target output 3. **Fine-tuning**: Supervised fine-tuning was performed on the facebook/bart-base model ## Usage ```python from transformers import AutoModelForSeq2SeqLM, AutoTokenizer # Load model and tokenizer model = AutoModelForSeq2SeqLM.from_pretrained("farhamu/bart-base-receipt-parser-v1") tokenizer = AutoTokenizer.from_pretrained("farhamu/bart-base-receipt-parser-v1") # Example usage receipt_text = """ SUPERMARKET ABC 123 Main Street City, State 12345 Date: 2024-01-15 Item 1: $5.99 Item 2: $3.50 Tax: $0.76 Total: $10.25 Thank you for shopping! """ # Tokenize input inputs = tokenizer(receipt_text, return_tensors="pt", max_length=512, truncation=True, padding=True) # Generate output outputs = model.generate( **inputs, max_length=150, num_beams=4, early_stopping=True ) # Decode result result = tokenizer.decode(outputs[0], skip_special_tokens=True) print(result) ``` ## Expected Output Format The model outputs structured information in the following format: ``` Date: [extracted_date] Company Name: [extracted_company_name] Total Amount: [extracted_total_amount] ``` ## Training Details - **Base Model**: facebook/bart-base - **Task**: Text-to-Text Generation (Receipt Information Extraction) - **Training Data**: OCR-processed receipt text with labeled ground truth - **Data Source**: [Receipt Dataset SSD300 V2](https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2) - **OCR Tool**: EasyOCR ## Limitations - Performance may vary depending on OCR quality - Trained specifically on the format and style of receipts in the training dataset - May require additional fine-tuning for receipts with significantly different formats or languages ## Use Cases - Automated receipt processing for expense management - Financial document digitization - Retail analytics and data extraction - Accounting automation ## Citation If you use this model, please cite the original dataset: ```bibtex @dataset{dhiaznaidi2024receipt, title={Receipt Dataset SSD300 V2}, author={Dhiaz Naidi}, year={2024}, url={https://www.kaggle.com/datasets/dhiaznaidi/receiptdatasetssd300v2} } ```