Spaces:

makroumi
/

invoice_data_extractor_agent

Sleeping

App Files Files Community

invoice_data_extractor_agent / app.py

makroumi

Update app.py

f0921d7 verified 12 months ago

raw

history blame contribute delete

11.3 kB

	# Invoice Data Extractor Agent
	# This notebook demonstrates how to build a free, open-source LLM agent
	# for extracting structured data from invoice text using Google Colab (CPU-only).

	# --- Markdown Cells ---

	# # Invoice Data Extractor LLM Agent
	#
	# ## Function and Business Value
	#
	# The Invoice Data Extractor agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return.
	#
	# Key extracted fields include:
	# - Vendor Name
	# - Total Amount
	# - Invoice Date
	# - Item Descriptions
	# - Quantities
	# - Unit Prices
	# - Line Item Totals
	#
	# This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers.
	#
	# ## Target User
	#
	# Small Business Owners, Accountants, Bookkeepers, Administrative Assistants.
	#
	# ## Usage Instructions
	#
	# 1. Prepare Invoice Text: Obtain the text content of your invoice. This can be done by:
	# * Copy-pasting text directly from a digital PDF invoice.
	# * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text.
	# 2. Paste Text: In the Gradio interface, paste the copied invoice text into the designated input box.
	# 3. Run Agent: Click the "Extract Data" button.
	# 4. Review Output: The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use.
	#
	# ## How it Works (Technical Overview)
	#
	# This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats.
	#
	# We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces.
	#
	# ## Limitations & Considerations
	#
	# * CPU Performance: While optimized for CPU, processing very long or complex invoices might take a few seconds.
	# * Accuracy: The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results.
	# * Hallucination: While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information.
	# * No OCR Built-in: This agent assumes text input. For image-based invoices, an external OCR step is required.
	#

	# --- Code Cells ---

	# 1. Setup and Environment Configuration
	# Install necessary libraries.
	#!pip install transformers torch gradio -q
	# Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU)
	import torch
	if torch.cuda.is_available():
	print("CUDA is available, but we are forcing CPU usage for this project.")
	# This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab.
	# We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it.

	import gradio as gr
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import json
	import re # For basic text cleaning/pre-processing

	# 2. Recommended Open-Source Model (CPU-friendly, <2B preferred)
	# For invoice data extraction, a model good at instruction following and structured output is crucial.
	# Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory.
	# Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility.
	# For better performance on structured extraction, a finetuned version or a slightly larger model
	# like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness.

	# Model ID on Hugging Face
	# You might need to experiment with different versions or quantized versions for optimal performance.
	# For CPU, a smaller, instruction-tuned model is best.
	# "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization.
	# Let's try a pre-quantized small model or focus on very small models.
	# For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point.
	# If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much.
	# Let's target TinyLlama and add a strong prompt.

	MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
	# For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps.
	tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
	model = AutoModelForCausalLM.from_pretrained(
	MODEL_NAME,
	torch_dtype=torch.float32, # Ensure CPU compatibility
	low_cpu_mem_usage=True # Helps with memory during loading
	)

	print(f"Model {MODEL_NAME} loaded successfully on CPU.")

	# 3. Lightweight Backend Logic and Inference
	def extract_invoice_data(invoice_text):
	"""
	Extracts key information from invoice text using the LLM.
	Args:
	invoice_text (str): The raw text content of the invoice.
	Returns:
	str: A JSON string containing the extracted invoice data.
	"""
	if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation
	return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2)

	# Clean up common OCR artifacts or formatting issues
	invoice_text = invoice_text.strip()
	invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space

	# Define the extraction prompt. This is crucial for accurate structured output.
	# We instruct the LLM to output JSON and provide a clear schema.
	prompt = f"""
	You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object.
	Ensure all fields are correctly identified and populated. If a field is not found, use `null`.
	Extract the following:
	- `vendor_name`: The name of the company that issued the invoice.
	- `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD).
	- `total_amount`: The total amount due on the invoice.
	- `currency`: The currency of the total amount (e.g., USD, EUR).
	- `items`: A list of dictionaries, where each dictionary represents a line item with:
	- `description`: A brief description of the item.
	- `quantity`: The quantity of the item.
	- `unit_price`: The price per unit of the item.
	- `line_total`: The total for that line item.

	Invoice Text:
	---
	{invoice_text}
	---

	Please provide the extracted data in the following JSON format:
	```json
	{{
	"vendor_name": "string",
	"invoice_date": "YYYY-MM-DD",
	"total_amount": float,
	"currency": "string",
	"items": [
	{{
	"description": "string",
	"quantity": float,
	"unit_price": float,
	"line_total": float
	}}
	]
	}}
	```
	"""

	# For TinyLlama, we follow the chat template if available, otherwise raw prompt.
	# TinyLlama-1.1B-Chat-v1.0 uses a simple chat format.
	messages = [
	{"role": "user", "content": prompt}
	]
	# Apply chat template for TinyLlama
	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


	# Tokenize input
	inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
	# Move inputs to CPU (they are already on CPU by default if no CUDA)
	# inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed

	# Generate response
	# Use torch.no_grad() for inference to save memory and speed up.
	with torch.no_grad():
	output = model.generate(
	**inputs,
	max_new_tokens=512, # Limit output length for structured data
	num_return_sequences=1,
	do_sample=False, # For more deterministic output
	temperature=0.1, # Low temperature for factual extraction
	pad_token_id=tokenizer.eos_token_id # Important for generation
	)

	# Decode and clean the output
	generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

	# Post-processing to extract JSON block
	json_match = re.search(r"```json\s(\{.\})\s*```", generated_text, re.DOTALL)
	if json_match:
	json_str = json_match.group(1)
	try:
	# Validate JSON
	extracted_data = json.loads(json_str)
	return json.dumps(extracted_data, indent=2)
	except json.JSONDecodeError:
	return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2)
	else:
	return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2)


	# 4. Output Examples
	# Example Invoice Text
	example_invoice_text = """
	INVOICE #INV-2023-001
	Date: 2023-10-26
	Vendor: Tech Solutions Inc.
	123 Tech Lane, Innovation City, CA 90210

	Bill To:
	Acme Corp.
	456 Business Blvd, Enterprise Town, NY 10001

	Description Qty Unit Price Total
	---
	Software License (Annual) 1 999.00 999.00
	Consulting Services (HR) 10 150.00 1500.00
	Premium Support 1 200.00 200.00
	---
	Subtotal: $2699.00
	Tax (8%): $215.92
	Total Amount Due: $2914.92 USD
	Payment Due: 2023-11-26
	"""

	# print("--- Example Output ---")
	# print(extract_invoice_data(example_invoice_text))
	# print("--- End of Example ---")


	# 5. Hosting Instructions (Gradio)
	# Create a Gradio interface
	gr_interface = gr.Interface(
	fn=extract_invoice_data,
	inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."),
	outputs=gr.JSON(label="Extracted Invoice Data"),
	title="Invoice Data Extractor LLM Agent",
	description="Paste invoice text to automatically extract vendor, amount, date, and item details."
	)

	gr_interface.launch(server_name="0.0.0.0", server_port=7860)

	# Launch the Gradio interface
	# For Hugging Face Spaces deployment, this part is crucial.
	# The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)`
	# when deployed to Hugging Face Spaces.
	# For local Colab testing:
	# gr_interface.launch(debug=True)

	# To deploy on Hugging Face Spaces:
	# 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`.
	# 2. Create a new Space on Hugging Face.
	# 3. Choose "Gradio" as the SDK.
	# 4. Upload your `app.py` file.
	# 5. Add a `requirements.txt` file with:
	# `transformers`
	# `torch`
	# `gradio`
	# (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`)
	# The Space will automatically build and launch your Gradio app.