makroumi's picture
Update app.py
f0921d7 verified
# Invoice Data Extractor Agent
# This notebook demonstrates how to build a free, open-source LLM agent
# for extracting structured data from invoice text using Google Colab (CPU-only).
# --- Markdown Cells ---
# # Invoice Data Extractor LLM Agent
#
# ## Function and Business Value
#
# The **Invoice Data Extractor** agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return.
#
# **Key extracted fields include:**
# - Vendor Name
# - Total Amount
# - Invoice Date
# - Item Descriptions
# - Quantities
# - Unit Prices
# - Line Item Totals
#
# This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers.
#
# ## Target User
#
# Small Business Owners, Accountants, Bookkeepers, Administrative Assistants.
#
# ## Usage Instructions
#
# 1. **Prepare Invoice Text:** Obtain the text content of your invoice. This can be done by:
# * Copy-pasting text directly from a digital PDF invoice.
# * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text.
# 2. **Paste Text:** In the Gradio interface, paste the copied invoice text into the designated input box.
# 3. **Run Agent:** Click the "Extract Data" button.
# 4. **Review Output:** The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use.
#
# ## How it Works (Technical Overview)
#
# This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats.
#
# We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces.
#
# ## Limitations & Considerations
#
# * **CPU Performance:** While optimized for CPU, processing very long or complex invoices might take a few seconds.
# * **Accuracy:** The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results.
# * **Hallucination:** While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information.
# * **No OCR Built-in:** This agent assumes text input. For image-based invoices, an external OCR step is required.
#
# --- Code Cells ---
# 1. Setup and Environment Configuration
# Install necessary libraries.
#!pip install transformers torch gradio -q
# Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU)
import torch
if torch.cuda.is_available():
print("CUDA is available, but we are forcing CPU usage for this project.")
# This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab.
# We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it.
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
import re # For basic text cleaning/pre-processing
# 2. Recommended Open-Source Model (CPU-friendly, <2B preferred)
# For invoice data extraction, a model good at instruction following and structured output is crucial.
# Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory.
# Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility.
# For better performance on structured extraction, a finetuned version or a slightly larger model
# like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness.
# Model ID on Hugging Face
# You might need to experiment with different versions or quantized versions for optimal performance.
# For CPU, a smaller, instruction-tuned model is best.
# "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization.
# Let's try a pre-quantized small model or focus on very small models.
# For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point.
# If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much.
# Let's target TinyLlama and add a strong prompt.
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float32, # Ensure CPU compatibility
low_cpu_mem_usage=True # Helps with memory during loading
)
print(f"Model {MODEL_NAME} loaded successfully on CPU.")
# 3. Lightweight Backend Logic and Inference
def extract_invoice_data(invoice_text):
"""
Extracts key information from invoice text using the LLM.
Args:
invoice_text (str): The raw text content of the invoice.
Returns:
str: A JSON string containing the extracted invoice data.
"""
if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation
return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2)
# Clean up common OCR artifacts or formatting issues
invoice_text = invoice_text.strip()
invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space
# Define the extraction prompt. This is crucial for accurate structured output.
# We instruct the LLM to output JSON and provide a clear schema.
prompt = f"""
You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object.
Ensure all fields are correctly identified and populated. If a field is not found, use `null`.
Extract the following:
- `vendor_name`: The name of the company that issued the invoice.
- `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD).
- `total_amount`: The total amount due on the invoice.
- `currency`: The currency of the total amount (e.g., USD, EUR).
- `items`: A list of dictionaries, where each dictionary represents a line item with:
- `description`: A brief description of the item.
- `quantity`: The quantity of the item.
- `unit_price`: The price per unit of the item.
- `line_total`: The total for that line item.
Invoice Text:
---
{invoice_text}
---
Please provide the extracted data in the following JSON format:
```json
{{
"vendor_name": "string",
"invoice_date": "YYYY-MM-DD",
"total_amount": float,
"currency": "string",
"items": [
{{
"description": "string",
"quantity": float,
"unit_price": float,
"line_total": float
}}
]
}}
```
"""
# For TinyLlama, we follow the chat template if available, otherwise raw prompt.
# TinyLlama-1.1B-Chat-v1.0 uses a simple chat format.
messages = [
{"role": "user", "content": prompt}
]
# Apply chat template for TinyLlama
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
# Move inputs to CPU (they are already on CPU by default if no CUDA)
# inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed
# Generate response
# Use torch.no_grad() for inference to save memory and speed up.
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512, # Limit output length for structured data
num_return_sequences=1,
do_sample=False, # For more deterministic output
temperature=0.1, # Low temperature for factual extraction
pad_token_id=tokenizer.eos_token_id # Important for generation
)
# Decode and clean the output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# Post-processing to extract JSON block
json_match = re.search(r"```json\s*(\{.*\})\s*```", generated_text, re.DOTALL)
if json_match:
json_str = json_match.group(1)
try:
# Validate JSON
extracted_data = json.loads(json_str)
return json.dumps(extracted_data, indent=2)
except json.JSONDecodeError:
return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2)
else:
return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2)
# 4. Output Examples
# Example Invoice Text
example_invoice_text = """
INVOICE #INV-2023-001
Date: 2023-10-26
Vendor: Tech Solutions Inc.
123 Tech Lane, Innovation City, CA 90210
Bill To:
Acme Corp.
456 Business Blvd, Enterprise Town, NY 10001
Description Qty Unit Price Total
---
Software License (Annual) 1 999.00 999.00
Consulting Services (HR) 10 150.00 1500.00
Premium Support 1 200.00 200.00
---
Subtotal: $2699.00
Tax (8%): $215.92
Total Amount Due: $2914.92 USD
Payment Due: 2023-11-26
"""
# print("--- Example Output ---")
# print(extract_invoice_data(example_invoice_text))
# print("--- End of Example ---")
# 5. Hosting Instructions (Gradio)
# Create a Gradio interface
gr_interface = gr.Interface(
fn=extract_invoice_data,
inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."),
outputs=gr.JSON(label="Extracted Invoice Data"),
title="Invoice Data Extractor LLM Agent",
description="Paste invoice text to automatically extract vendor, amount, date, and item details."
)
gr_interface.launch(server_name="0.0.0.0", server_port=7860)
# Launch the Gradio interface
# For Hugging Face Spaces deployment, this part is crucial.
# The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)`
# when deployed to Hugging Face Spaces.
# For local Colab testing:
# gr_interface.launch(debug=True)
# To deploy on Hugging Face Spaces:
# 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`.
# 2. Create a new Space on Hugging Face.
# 3. Choose "Gradio" as the SDK.
# 4. Upload your `app.py` file.
# 5. Add a `requirements.txt` file with:
# `transformers`
# `torch`
# `gradio`
# (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`)
# The Space will automatically build and launch your Gradio app.