Spaces:

makroumi
/

invoice_data_extractor_agent

Sleeping

File size: 11,340 Bytes

# Invoice Data Extractor Agent
# This notebook demonstrates how to build a free, open-source LLM agent
# for extracting structured data from invoice text using Google Colab (CPU-only).

# --- Markdown Cells ---

# # Invoice Data Extractor LLM Agent
#
# ## Function and Business Value
#
# The **Invoice Data Extractor** agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return.
#
# **Key extracted fields include:**
# - Vendor Name
# - Total Amount
# - Invoice Date
# - Item Descriptions
# - Quantities
# - Unit Prices
# - Line Item Totals
#
# This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers.
#
# ## Target User
#
# Small Business Owners, Accountants, Bookkeepers, Administrative Assistants.
#
# ## Usage Instructions
#
# 1.  **Prepare Invoice Text:** Obtain the text content of your invoice. This can be done by:
#     * Copy-pasting text directly from a digital PDF invoice.
#     * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text.
# 2.  **Paste Text:** In the Gradio interface, paste the copied invoice text into the designated input box.
# 3.  **Run Agent:** Click the "Extract Data" button.
# 4.  **Review Output:** The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use.
#
# ## How it Works (Technical Overview)
#
# This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats.
#
# We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces.
#
# ## Limitations & Considerations
#
# * **CPU Performance:** While optimized for CPU, processing very long or complex invoices might take a few seconds.
# * **Accuracy:** The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results.
# * **Hallucination:** While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information.
# * **No OCR Built-in:** This agent assumes text input. For image-based invoices, an external OCR step is required.
#

# --- Code Cells ---

# 1. Setup and Environment Configuration
# Install necessary libraries.
#!pip install transformers torch gradio -q
# Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU)
import torch
if torch.cuda.is_available():
    print("CUDA is available, but we are forcing CPU usage for this project.")
    # This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab.
    # We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it.

import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
import re # For basic text cleaning/pre-processing

# 2. Recommended Open-Source Model (CPU-friendly, <2B preferred)
# For invoice data extraction, a model good at instruction following and structured output is crucial.
# Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory.
# Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility.
# For better performance on structured extraction, a finetuned version or a slightly larger model
# like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness.

# Model ID on Hugging Face
# You might need to experiment with different versions or quantized versions for optimal performance.
# For CPU, a smaller, instruction-tuned model is best.
# "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization.
# Let's try a pre-quantized small model or focus on very small models.
# For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point.
# If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much.
# Let's target TinyLlama and add a strong prompt.

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32, # Ensure CPU compatibility
    low_cpu_mem_usage=True # Helps with memory during loading
)

print(f"Model {MODEL_NAME} loaded successfully on CPU.")

# 3. Lightweight Backend Logic and Inference
def extract_invoice_data(invoice_text):
    """
    Extracts key information from invoice text using the LLM.
    Args:
        invoice_text (str): The raw text content of the invoice.
    Returns:
        str: A JSON string containing the extracted invoice data.
    """
    if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation
        return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2)

    # Clean up common OCR artifacts or formatting issues
    invoice_text = invoice_text.strip()
    invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space

    # Define the extraction prompt. This is crucial for accurate structured output.
    # We instruct the LLM to output JSON and provide a clear schema.
    prompt = f"""
    You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object.
    Ensure all fields are correctly identified and populated. If a field is not found, use `null`.
    Extract the following:
    - `vendor_name`: The name of the company that issued the invoice.
    - `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD).
    - `total_amount`: The total amount due on the invoice.
    - `currency`: The currency of the total amount (e.g., USD, EUR).
    - `items`: A list of dictionaries, where each dictionary represents a line item with:
        - `description`: A brief description of the item.
        - `quantity`: The quantity of the item.
        - `unit_price`: The price per unit of the item.
        - `line_total`: The total for that line item.

    Invoice Text:
    ---
    {invoice_text}
    ---

    Please provide the extracted data in the following JSON format:
    ```json
    {{
        "vendor_name": "string",
        "invoice_date": "YYYY-MM-DD",
        "total_amount": float,
        "currency": "string",
        "items": [
            {{
                "description": "string",
                "quantity": float,
                "unit_price": float,
                "line_total": float
            }}
        ]
    }}
    ```
    """

    # For TinyLlama, we follow the chat template if available, otherwise raw prompt.
    # TinyLlama-1.1B-Chat-v1.0 uses a simple chat format.
    messages = [
        {"role": "user", "content": prompt}
    ]
    # Apply chat template for TinyLlama
    input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
    # Move inputs to CPU (they are already on CPU by default if no CUDA)
    # inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed

    # Generate response
    # Use torch.no_grad() for inference to save memory and speed up.
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=512, # Limit output length for structured data
            num_return_sequences=1,
            do_sample=False, # For more deterministic output
            temperature=0.1, # Low temperature for factual extraction
            pad_token_id=tokenizer.eos_token_id # Important for generation
        )

    # Decode and clean the output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Post-processing to extract JSON block
    json_match = re.search(r"```json\s*(\{.*\})\s*```", generated_text, re.DOTALL)
    if json_match:
        json_str = json_match.group(1)
        try:
            # Validate JSON
            extracted_data = json.loads(json_str)
            return json.dumps(extracted_data, indent=2)
        except json.JSONDecodeError:
            return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2)
    else:
        return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2)


# 4. Output Examples
# Example Invoice Text
example_invoice_text = """
INVOICE #INV-2023-001
Date: 2023-10-26
Vendor: Tech Solutions Inc.
123 Tech Lane, Innovation City, CA 90210

Bill To:
Acme Corp.
456 Business Blvd, Enterprise Town, NY 10001

Description                 Qty   Unit Price   Total
---
Software License (Annual)   1     999.00       999.00
Consulting Services (HR)    10    150.00       1500.00
Premium Support             1     200.00       200.00
---
Subtotal: $2699.00
Tax (8%): $215.92
Total Amount Due: $2914.92 USD
Payment Due: 2023-11-26
"""

# print("--- Example Output ---")
# print(extract_invoice_data(example_invoice_text))
# print("--- End of Example ---")


# 5. Hosting Instructions (Gradio)
# Create a Gradio interface
gr_interface = gr.Interface(
    fn=extract_invoice_data,
    inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."),
    outputs=gr.JSON(label="Extracted Invoice Data"),
    title="Invoice Data Extractor LLM Agent",
    description="Paste invoice text to automatically extract vendor, amount, date, and item details."
)

gr_interface.launch(server_name="0.0.0.0", server_port=7860)

# Launch the Gradio interface
# For Hugging Face Spaces deployment, this part is crucial.
# The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)`
# when deployed to Hugging Face Spaces.
# For local Colab testing:
# gr_interface.launch(debug=True)

# To deploy on Hugging Face Spaces:
# 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`.
# 2. Create a new Space on Hugging Face.
# 3. Choose "Gradio" as the SDK.
# 4. Upload your `app.py` file.
# 5. Add a `requirements.txt` file with:
#    `transformers`
#    `torch`
#    `gradio`
#    (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`)
# The Space will automatically build and launch your Gradio app.