| # Invoice Data Extractor Agent | |
| # This notebook demonstrates how to build a free, open-source LLM agent | |
| # for extracting structured data from invoice text using Google Colab (CPU-only). | |
| # --- Markdown Cells --- | |
| # # Invoice Data Extractor LLM Agent | |
| # | |
| # ## Function and Business Value | |
| # | |
| # The **Invoice Data Extractor** agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return. | |
| # | |
| # **Key extracted fields include:** | |
| # - Vendor Name | |
| # - Total Amount | |
| # - Invoice Date | |
| # - Item Descriptions | |
| # - Quantities | |
| # - Unit Prices | |
| # - Line Item Totals | |
| # | |
| # This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers. | |
| # | |
| # ## Target User | |
| # | |
| # Small Business Owners, Accountants, Bookkeepers, Administrative Assistants. | |
| # | |
| # ## Usage Instructions | |
| # | |
| # 1. **Prepare Invoice Text:** Obtain the text content of your invoice. This can be done by: | |
| # * Copy-pasting text directly from a digital PDF invoice. | |
| # * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text. | |
| # 2. **Paste Text:** In the Gradio interface, paste the copied invoice text into the designated input box. | |
| # 3. **Run Agent:** Click the "Extract Data" button. | |
| # 4. **Review Output:** The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use. | |
| # | |
| # ## How it Works (Technical Overview) | |
| # | |
| # This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats. | |
| # | |
| # We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces. | |
| # | |
| # ## Limitations & Considerations | |
| # | |
| # * **CPU Performance:** While optimized for CPU, processing very long or complex invoices might take a few seconds. | |
| # * **Accuracy:** The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results. | |
| # * **Hallucination:** While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information. | |
| # * **No OCR Built-in:** This agent assumes text input. For image-based invoices, an external OCR step is required. | |
| # | |
| # --- Code Cells --- | |
| # 1. Setup and Environment Configuration | |
| # Install necessary libraries. | |
| #!pip install transformers torch gradio -q | |
| # Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU) | |
| import torch | |
| if torch.cuda.is_available(): | |
| print("CUDA is available, but we are forcing CPU usage for this project.") | |
| # This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab. | |
| # We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it. | |
| import gradio as gr | |
| from transformers import AutoTokenizer, AutoModelForCausalLM | |
| import json | |
| import re # For basic text cleaning/pre-processing | |
| # 2. Recommended Open-Source Model (CPU-friendly, <2B preferred) | |
| # For invoice data extraction, a model good at instruction following and structured output is crucial. | |
| # Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory. | |
| # Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility. | |
| # For better performance on structured extraction, a finetuned version or a slightly larger model | |
| # like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness. | |
| # Model ID on Hugging Face | |
| # You might need to experiment with different versions or quantized versions for optimal performance. | |
| # For CPU, a smaller, instruction-tuned model is best. | |
| # "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization. | |
| # Let's try a pre-quantized small model or focus on very small models. | |
| # For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point. | |
| # If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much. | |
| # Let's target TinyLlama and add a strong prompt. | |
| MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" | |
| # For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps. | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| MODEL_NAME, | |
| torch_dtype=torch.float32, # Ensure CPU compatibility | |
| low_cpu_mem_usage=True # Helps with memory during loading | |
| ) | |
| print(f"Model {MODEL_NAME} loaded successfully on CPU.") | |
| # 3. Lightweight Backend Logic and Inference | |
| def extract_invoice_data(invoice_text): | |
| """ | |
| Extracts key information from invoice text using the LLM. | |
| Args: | |
| invoice_text (str): The raw text content of the invoice. | |
| Returns: | |
| str: A JSON string containing the extracted invoice data. | |
| """ | |
| if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation | |
| return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2) | |
| # Clean up common OCR artifacts or formatting issues | |
| invoice_text = invoice_text.strip() | |
| invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space | |
| # Define the extraction prompt. This is crucial for accurate structured output. | |
| # We instruct the LLM to output JSON and provide a clear schema. | |
| prompt = f""" | |
| You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object. | |
| Ensure all fields are correctly identified and populated. If a field is not found, use `null`. | |
| Extract the following: | |
| - `vendor_name`: The name of the company that issued the invoice. | |
| - `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD). | |
| - `total_amount`: The total amount due on the invoice. | |
| - `currency`: The currency of the total amount (e.g., USD, EUR). | |
| - `items`: A list of dictionaries, where each dictionary represents a line item with: | |
| - `description`: A brief description of the item. | |
| - `quantity`: The quantity of the item. | |
| - `unit_price`: The price per unit of the item. | |
| - `line_total`: The total for that line item. | |
| Invoice Text: | |
| --- | |
| {invoice_text} | |
| --- | |
| Please provide the extracted data in the following JSON format: | |
| ```json | |
| {{ | |
| "vendor_name": "string", | |
| "invoice_date": "YYYY-MM-DD", | |
| "total_amount": float, | |
| "currency": "string", | |
| "items": [ | |
| {{ | |
| "description": "string", | |
| "quantity": float, | |
| "unit_price": float, | |
| "line_total": float | |
| }} | |
| ] | |
| }} | |
| ``` | |
| """ | |
| # For TinyLlama, we follow the chat template if available, otherwise raw prompt. | |
| # TinyLlama-1.1B-Chat-v1.0 uses a simple chat format. | |
| messages = [ | |
| {"role": "user", "content": prompt} | |
| ] | |
| # Apply chat template for TinyLlama | |
| input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| # Tokenize input | |
| inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True) | |
| # Move inputs to CPU (they are already on CPU by default if no CUDA) | |
| # inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed | |
| # Generate response | |
| # Use torch.no_grad() for inference to save memory and speed up. | |
| with torch.no_grad(): | |
| output = model.generate( | |
| **inputs, | |
| max_new_tokens=512, # Limit output length for structured data | |
| num_return_sequences=1, | |
| do_sample=False, # For more deterministic output | |
| temperature=0.1, # Low temperature for factual extraction | |
| pad_token_id=tokenizer.eos_token_id # Important for generation | |
| ) | |
| # Decode and clean the output | |
| generated_text = tokenizer.decode(output[0], skip_special_tokens=True) | |
| # Post-processing to extract JSON block | |
| json_match = re.search(r"```json\s*(\{.*\})\s*```", generated_text, re.DOTALL) | |
| if json_match: | |
| json_str = json_match.group(1) | |
| try: | |
| # Validate JSON | |
| extracted_data = json.loads(json_str) | |
| return json.dumps(extracted_data, indent=2) | |
| except json.JSONDecodeError: | |
| return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2) | |
| else: | |
| return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2) | |
| # 4. Output Examples | |
| # Example Invoice Text | |
| example_invoice_text = """ | |
| INVOICE #INV-2023-001 | |
| Date: 2023-10-26 | |
| Vendor: Tech Solutions Inc. | |
| 123 Tech Lane, Innovation City, CA 90210 | |
| Bill To: | |
| Acme Corp. | |
| 456 Business Blvd, Enterprise Town, NY 10001 | |
| Description Qty Unit Price Total | |
| --- | |
| Software License (Annual) 1 999.00 999.00 | |
| Consulting Services (HR) 10 150.00 1500.00 | |
| Premium Support 1 200.00 200.00 | |
| --- | |
| Subtotal: $2699.00 | |
| Tax (8%): $215.92 | |
| Total Amount Due: $2914.92 USD | |
| Payment Due: 2023-11-26 | |
| """ | |
| # print("--- Example Output ---") | |
| # print(extract_invoice_data(example_invoice_text)) | |
| # print("--- End of Example ---") | |
| # 5. Hosting Instructions (Gradio) | |
| # Create a Gradio interface | |
| gr_interface = gr.Interface( | |
| fn=extract_invoice_data, | |
| inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."), | |
| outputs=gr.JSON(label="Extracted Invoice Data"), | |
| title="Invoice Data Extractor LLM Agent", | |
| description="Paste invoice text to automatically extract vendor, amount, date, and item details." | |
| ) | |
| gr_interface.launch(server_name="0.0.0.0", server_port=7860) | |
| # Launch the Gradio interface | |
| # For Hugging Face Spaces deployment, this part is crucial. | |
| # The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)` | |
| # when deployed to Hugging Face Spaces. | |
| # For local Colab testing: | |
| # gr_interface.launch(debug=True) | |
| # To deploy on Hugging Face Spaces: | |
| # 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`. | |
| # 2. Create a new Space on Hugging Face. | |
| # 3. Choose "Gradio" as the SDK. | |
| # 4. Upload your `app.py` file. | |
| # 5. Add a `requirements.txt` file with: | |
| # `transformers` | |
| # `torch` | |
| # `gradio` | |
| # (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`) | |
| # The Space will automatically build and launch your Gradio app. |