File size: 11,340 Bytes
a3524f3 3c31b03 a3524f3 f0921d7 a3524f3 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | # Invoice Data Extractor Agent
# This notebook demonstrates how to build a free, open-source LLM agent
# for extracting structured data from invoice text using Google Colab (CPU-only).
# --- Markdown Cells ---
# # Invoice Data Extractor LLM Agent
#
# ## Function and Business Value
#
# The **Invoice Data Extractor** agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return.
#
# **Key extracted fields include:**
# - Vendor Name
# - Total Amount
# - Invoice Date
# - Item Descriptions
# - Quantities
# - Unit Prices
# - Line Item Totals
#
# This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers.
#
# ## Target User
#
# Small Business Owners, Accountants, Bookkeepers, Administrative Assistants.
#
# ## Usage Instructions
#
# 1. **Prepare Invoice Text:** Obtain the text content of your invoice. This can be done by:
# * Copy-pasting text directly from a digital PDF invoice.
# * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text.
# 2. **Paste Text:** In the Gradio interface, paste the copied invoice text into the designated input box.
# 3. **Run Agent:** Click the "Extract Data" button.
# 4. **Review Output:** The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use.
#
# ## How it Works (Technical Overview)
#
# This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats.
#
# We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces.
#
# ## Limitations & Considerations
#
# * **CPU Performance:** While optimized for CPU, processing very long or complex invoices might take a few seconds.
# * **Accuracy:** The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results.
# * **Hallucination:** While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information.
# * **No OCR Built-in:** This agent assumes text input. For image-based invoices, an external OCR step is required.
#
# --- Code Cells ---
# 1. Setup and Environment Configuration
# Install necessary libraries.
#!pip install transformers torch gradio -q
# Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU)
import torch
if torch.cuda.is_available():
print("CUDA is available, but we are forcing CPU usage for this project.")
# This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab.
# We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it.
import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
import re # For basic text cleaning/pre-processing
# 2. Recommended Open-Source Model (CPU-friendly, <2B preferred)
# For invoice data extraction, a model good at instruction following and structured output is crucial.
# Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory.
# Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility.
# For better performance on structured extraction, a finetuned version or a slightly larger model
# like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness.
# Model ID on Hugging Face
# You might need to experiment with different versions or quantized versions for optimal performance.
# For CPU, a smaller, instruction-tuned model is best.
# "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization.
# Let's try a pre-quantized small model or focus on very small models.
# For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point.
# If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much.
# Let's target TinyLlama and add a strong prompt.
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float32, # Ensure CPU compatibility
low_cpu_mem_usage=True # Helps with memory during loading
)
print(f"Model {MODEL_NAME} loaded successfully on CPU.")
# 3. Lightweight Backend Logic and Inference
def extract_invoice_data(invoice_text):
"""
Extracts key information from invoice text using the LLM.
Args:
invoice_text (str): The raw text content of the invoice.
Returns:
str: A JSON string containing the extracted invoice data.
"""
if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation
return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2)
# Clean up common OCR artifacts or formatting issues
invoice_text = invoice_text.strip()
invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space
# Define the extraction prompt. This is crucial for accurate structured output.
# We instruct the LLM to output JSON and provide a clear schema.
prompt = f"""
You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object.
Ensure all fields are correctly identified and populated. If a field is not found, use `null`.
Extract the following:
- `vendor_name`: The name of the company that issued the invoice.
- `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD).
- `total_amount`: The total amount due on the invoice.
- `currency`: The currency of the total amount (e.g., USD, EUR).
- `items`: A list of dictionaries, where each dictionary represents a line item with:
- `description`: A brief description of the item.
- `quantity`: The quantity of the item.
- `unit_price`: The price per unit of the item.
- `line_total`: The total for that line item.
Invoice Text:
---
{invoice_text}
---
Please provide the extracted data in the following JSON format:
```json
{{
"vendor_name": "string",
"invoice_date": "YYYY-MM-DD",
"total_amount": float,
"currency": "string",
"items": [
{{
"description": "string",
"quantity": float,
"unit_price": float,
"line_total": float
}}
]
}}
```
"""
# For TinyLlama, we follow the chat template if available, otherwise raw prompt.
# TinyLlama-1.1B-Chat-v1.0 uses a simple chat format.
messages = [
{"role": "user", "content": prompt}
]
# Apply chat template for TinyLlama
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
# Tokenize input
inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
# Move inputs to CPU (they are already on CPU by default if no CUDA)
# inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed
# Generate response
# Use torch.no_grad() for inference to save memory and speed up.
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512, # Limit output length for structured data
num_return_sequences=1,
do_sample=False, # For more deterministic output
temperature=0.1, # Low temperature for factual extraction
pad_token_id=tokenizer.eos_token_id # Important for generation
)
# Decode and clean the output
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
# Post-processing to extract JSON block
json_match = re.search(r"```json\s*(\{.*\})\s*```", generated_text, re.DOTALL)
if json_match:
json_str = json_match.group(1)
try:
# Validate JSON
extracted_data = json.loads(json_str)
return json.dumps(extracted_data, indent=2)
except json.JSONDecodeError:
return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2)
else:
return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2)
# 4. Output Examples
# Example Invoice Text
example_invoice_text = """
INVOICE #INV-2023-001
Date: 2023-10-26
Vendor: Tech Solutions Inc.
123 Tech Lane, Innovation City, CA 90210
Bill To:
Acme Corp.
456 Business Blvd, Enterprise Town, NY 10001
Description Qty Unit Price Total
---
Software License (Annual) 1 999.00 999.00
Consulting Services (HR) 10 150.00 1500.00
Premium Support 1 200.00 200.00
---
Subtotal: $2699.00
Tax (8%): $215.92
Total Amount Due: $2914.92 USD
Payment Due: 2023-11-26
"""
# print("--- Example Output ---")
# print(extract_invoice_data(example_invoice_text))
# print("--- End of Example ---")
# 5. Hosting Instructions (Gradio)
# Create a Gradio interface
gr_interface = gr.Interface(
fn=extract_invoice_data,
inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."),
outputs=gr.JSON(label="Extracted Invoice Data"),
title="Invoice Data Extractor LLM Agent",
description="Paste invoice text to automatically extract vendor, amount, date, and item details."
)
gr_interface.launch(server_name="0.0.0.0", server_port=7860)
# Launch the Gradio interface
# For Hugging Face Spaces deployment, this part is crucial.
# The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)`
# when deployed to Hugging Face Spaces.
# For local Colab testing:
# gr_interface.launch(debug=True)
# To deploy on Hugging Face Spaces:
# 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`.
# 2. Create a new Space on Hugging Face.
# 3. Choose "Gradio" as the SDK.
# 4. Upload your `app.py` file.
# 5. Add a `requirements.txt` file with:
# `transformers`
# `torch`
# `gradio`
# (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`)
# The Space will automatically build and launch your Gradio app. |