# Invoice Data Extractor Agent # This notebook demonstrates how to build a free, open-source LLM agent # for extracting structured data from invoice text using Google Colab (CPU-only). # --- Markdown Cells --- # # Invoice Data Extractor LLM Agent # # ## Function and Business Value # # The **Invoice Data Extractor** agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return. # # **Key extracted fields include:** # - Vendor Name # - Total Amount # - Invoice Date # - Item Descriptions # - Quantities # - Unit Prices # - Line Item Totals # # This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers. # # ## Target User # # Small Business Owners, Accountants, Bookkeepers, Administrative Assistants. # # ## Usage Instructions # # 1. **Prepare Invoice Text:** Obtain the text content of your invoice. This can be done by: # * Copy-pasting text directly from a digital PDF invoice. # * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text. # 2. **Paste Text:** In the Gradio interface, paste the copied invoice text into the designated input box. # 3. **Run Agent:** Click the "Extract Data" button. # 4. **Review Output:** The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use. # # ## How it Works (Technical Overview) # # This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats. # # We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces. # # ## Limitations & Considerations # # * **CPU Performance:** While optimized for CPU, processing very long or complex invoices might take a few seconds. # * **Accuracy:** The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results. # * **Hallucination:** While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information. # * **No OCR Built-in:** This agent assumes text input. For image-based invoices, an external OCR step is required. # # --- Code Cells --- # 1. Setup and Environment Configuration # Install necessary libraries. #!pip install transformers torch gradio -q # Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU) import torch if torch.cuda.is_available(): print("CUDA is available, but we are forcing CPU usage for this project.") # This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab. # We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it. import gradio as gr from transformers import AutoTokenizer, AutoModelForCausalLM import json import re # For basic text cleaning/pre-processing # 2. Recommended Open-Source Model (CPU-friendly, <2B preferred) # For invoice data extraction, a model good at instruction following and structured output is crucial. # Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory. # Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility. # For better performance on structured extraction, a finetuned version or a slightly larger model # like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness. # Model ID on Hugging Face # You might need to experiment with different versions or quantized versions for optimal performance. # For CPU, a smaller, instruction-tuned model is best. # "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization. # Let's try a pre-quantized small model or focus on very small models. # For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point. # If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much. # Let's target TinyLlama and add a strong prompt. MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0" # For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps. tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME) model = AutoModelForCausalLM.from_pretrained( MODEL_NAME, torch_dtype=torch.float32, # Ensure CPU compatibility low_cpu_mem_usage=True # Helps with memory during loading ) print(f"Model {MODEL_NAME} loaded successfully on CPU.") # 3. Lightweight Backend Logic and Inference def extract_invoice_data(invoice_text): """ Extracts key information from invoice text using the LLM. Args: invoice_text (str): The raw text content of the invoice. Returns: str: A JSON string containing the extracted invoice data. """ if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2) # Clean up common OCR artifacts or formatting issues invoice_text = invoice_text.strip() invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space # Define the extraction prompt. This is crucial for accurate structured output. # We instruct the LLM to output JSON and provide a clear schema. prompt = f""" You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object. Ensure all fields are correctly identified and populated. If a field is not found, use `null`. Extract the following: - `vendor_name`: The name of the company that issued the invoice. - `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD). - `total_amount`: The total amount due on the invoice. - `currency`: The currency of the total amount (e.g., USD, EUR). - `items`: A list of dictionaries, where each dictionary represents a line item with: - `description`: A brief description of the item. - `quantity`: The quantity of the item. - `unit_price`: The price per unit of the item. - `line_total`: The total for that line item. Invoice Text: --- {invoice_text} --- Please provide the extracted data in the following JSON format: ```json {{ "vendor_name": "string", "invoice_date": "YYYY-MM-DD", "total_amount": float, "currency": "string", "items": [ {{ "description": "string", "quantity": float, "unit_price": float, "line_total": float }} ] }} ``` """ # For TinyLlama, we follow the chat template if available, otherwise raw prompt. # TinyLlama-1.1B-Chat-v1.0 uses a simple chat format. messages = [ {"role": "user", "content": prompt} ] # Apply chat template for TinyLlama input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Tokenize input inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True) # Move inputs to CPU (they are already on CPU by default if no CUDA) # inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed # Generate response # Use torch.no_grad() for inference to save memory and speed up. with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=512, # Limit output length for structured data num_return_sequences=1, do_sample=False, # For more deterministic output temperature=0.1, # Low temperature for factual extraction pad_token_id=tokenizer.eos_token_id # Important for generation ) # Decode and clean the output generated_text = tokenizer.decode(output[0], skip_special_tokens=True) # Post-processing to extract JSON block json_match = re.search(r"```json\s*(\{.*\})\s*```", generated_text, re.DOTALL) if json_match: json_str = json_match.group(1) try: # Validate JSON extracted_data = json.loads(json_str) return json.dumps(extracted_data, indent=2) except json.JSONDecodeError: return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2) else: return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2) # 4. Output Examples # Example Invoice Text example_invoice_text = """ INVOICE #INV-2023-001 Date: 2023-10-26 Vendor: Tech Solutions Inc. 123 Tech Lane, Innovation City, CA 90210 Bill To: Acme Corp. 456 Business Blvd, Enterprise Town, NY 10001 Description Qty Unit Price Total --- Software License (Annual) 1 999.00 999.00 Consulting Services (HR) 10 150.00 1500.00 Premium Support 1 200.00 200.00 --- Subtotal: $2699.00 Tax (8%): $215.92 Total Amount Due: $2914.92 USD Payment Due: 2023-11-26 """ # print("--- Example Output ---") # print(extract_invoice_data(example_invoice_text)) # print("--- End of Example ---") # 5. Hosting Instructions (Gradio) # Create a Gradio interface gr_interface = gr.Interface( fn=extract_invoice_data, inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."), outputs=gr.JSON(label="Extracted Invoice Data"), title="Invoice Data Extractor LLM Agent", description="Paste invoice text to automatically extract vendor, amount, date, and item details." ) gr_interface.launch(server_name="0.0.0.0", server_port=7860) # Launch the Gradio interface # For Hugging Face Spaces deployment, this part is crucial. # The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)` # when deployed to Hugging Face Spaces. # For local Colab testing: # gr_interface.launch(debug=True) # To deploy on Hugging Face Spaces: # 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`. # 2. Create a new Space on Hugging Face. # 3. Choose "Gradio" as the SDK. # 4. Upload your `app.py` file. # 5. Add a `requirements.txt` file with: # `transformers` # `torch` # `gradio` # (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`) # The Space will automatically build and launch your Gradio app.