File size: 11,340 Bytes
a3524f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3c31b03
a3524f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0921d7
 
a3524f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
# Invoice Data Extractor Agent
# This notebook demonstrates how to build a free, open-source LLM agent
# for extracting structured data from invoice text using Google Colab (CPU-only).

# --- Markdown Cells ---

# # Invoice Data Extractor LLM Agent
#
# ## Function and Business Value
#
# The **Invoice Data Extractor** agent is designed to automate the extraction of key financial information from invoice text. Manually processing invoices is a time-consuming and error-prone task for businesses, especially those dealing with a high volume of transactions. This agent addresses this pain point by enabling users to simply paste the text content of an invoice and receive structured data in return.
#
# **Key extracted fields include:**
# - Vendor Name
# - Total Amount
# - Invoice Date
# - Item Descriptions
# - Quantities
# - Unit Prices
# - Line Item Totals
#
# This automation can significantly reduce manual data entry, minimize errors, and accelerate financial reconciliation processes for small businesses, accountants, and bookkeepers.
#
# ## Target User
#
# Small Business Owners, Accountants, Bookkeepers, Administrative Assistants.
#
# ## Usage Instructions
#
# 1.  **Prepare Invoice Text:** Obtain the text content of your invoice. This can be done by:
#     * Copy-pasting text directly from a digital PDF invoice.
#     * Using a simple OCR tool (if available, external to this Colab) to convert an image-based invoice into text.
# 2.  **Paste Text:** In the Gradio interface, paste the copied invoice text into the designated input box.
# 3.  **Run Agent:** Click the "Extract Data" button.
# 4.  **Review Output:** The agent will process the text and display the extracted data in a structured JSON format, making it easy to parse and use.
#
# ## How it Works (Technical Overview)
#
# This agent leverages a small, CPU-friendly open-source Large Language Model from Hugging Face. The model is prompted with the invoice text and instructed to extract specific entities into a structured JSON output. This approach is powerful because LLMs can understand the context and nuances of human language, making them robust to variations in invoice formats.
#
# We will use the `transformers` library for model inference and `Gradio` for creating a user-friendly web interface that can be easily deployed to Hugging Face Spaces.
#
# ## Limitations & Considerations
#
# * **CPU Performance:** While optimized for CPU, processing very long or complex invoices might take a few seconds.
# * **Accuracy:** The accuracy depends on the quality of the input text and the LLM's capabilities. Highly unstructured or very poorly scanned invoices may yield imperfect results.
# * **Hallucination:** While efforts are made to minimize hallucination through prompt engineering, LLMs can sometimes generate plausible but incorrect data. Always verify critical extracted information.
# * **No OCR Built-in:** This agent assumes text input. For image-based invoices, an external OCR step is required.
#

# --- Code Cells ---

# 1. Setup and Environment Configuration
# Install necessary libraries.
#!pip install transformers torch gradio -q
# Ensure torch is configured for CPU if it defaults otherwise (usually not an issue on Colab CPU)
import torch
if torch.cuda.is_available():
    print("CUDA is available, but we are forcing CPU usage for this project.")
    # This line is primarily for clarity; torch will typically use CPU if no GPU is assigned in Colab.
    # We won't explicitly set device = 'cpu' in model loading unless necessary, as transformers handles it.

import gradio as gr
from transformers import AutoTokenizer, AutoModelForCausalLM
import json
import re # For basic text cleaning/pre-processing

# 2. Recommended Open-Source Model (CPU-friendly, <2B preferred)
# For invoice data extraction, a model good at instruction following and structured output is crucial.
# Phi-2 or TinyLlama are good candidates. Mistral-7B-v0.1 quantized is also an option if it fits memory.
# Let's try TinyLlama-1.1B for initial prototyping due to its size and CPU compatibility.
# For better performance on structured extraction, a finetuned version or a slightly larger model
# like Phi-2 could be used. Let's aim for Phi-2 for slightly better robustness.

# Model ID on Hugging Face
# You might need to experiment with different versions or quantized versions for optimal performance.
# For CPU, a smaller, instruction-tuned model is best.
# "microsoft/phi-2" is 2.7B parameters, might be slow/memory intensive without quantization.
# Let's try a pre-quantized small model or focus on very small models.
# For 0 budget CPU, "TinyLlama/TinyLlama-1.1B-Chat-v1.0" is a great starting point.
# If that struggles with structured output, then "microsoft/phi-2" with torch.float32 may be too much.
# Let's target TinyLlama and add a strong prompt.

MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# For CPU, it's essential to load with default float32. low_cpu_mem_usage=True helps.
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    torch_dtype=torch.float32, # Ensure CPU compatibility
    low_cpu_mem_usage=True # Helps with memory during loading
)

print(f"Model {MODEL_NAME} loaded successfully on CPU.")

# 3. Lightweight Backend Logic and Inference
def extract_invoice_data(invoice_text):
    """
    Extracts key information from invoice text using the LLM.
    Args:
        invoice_text (str): The raw text content of the invoice.
    Returns:
        str: A JSON string containing the extracted invoice data.
    """
    if not invoice_text or len(invoice_text.strip()) < 50: # Basic input validation
        return json.dumps({"error": "Please provide sufficient invoice text for extraction."}, indent=2)

    # Clean up common OCR artifacts or formatting issues
    invoice_text = invoice_text.strip()
    invoice_text = re.sub(r'\s+', ' ', invoice_text) # Replace multiple spaces with single space

    # Define the extraction prompt. This is crucial for accurate structured output.
    # We instruct the LLM to output JSON and provide a clear schema.
    prompt = f"""
    You are an expert financial data extractor. Your task is to accurately extract key information from the provided invoice text and format it as a JSON object.
    Ensure all fields are correctly identified and populated. If a field is not found, use `null`.
    Extract the following:
    - `vendor_name`: The name of the company that issued the invoice.
    - `invoice_date`: The date the invoice was issued (e.g., YYYY-MM-DD).
    - `total_amount`: The total amount due on the invoice.
    - `currency`: The currency of the total amount (e.g., USD, EUR).
    - `items`: A list of dictionaries, where each dictionary represents a line item with:
        - `description`: A brief description of the item.
        - `quantity`: The quantity of the item.
        - `unit_price`: The price per unit of the item.
        - `line_total`: The total for that line item.

    Invoice Text:
    ---
    {invoice_text}
    ---

    Please provide the extracted data in the following JSON format:
    ```json
    {{
        "vendor_name": "string",
        "invoice_date": "YYYY-MM-DD",
        "total_amount": float,
        "currency": "string",
        "items": [
            {{
                "description": "string",
                "quantity": float,
                "unit_price": float,
                "line_total": float
            }}
        ]
    }}
    ```
    """

    # For TinyLlama, we follow the chat template if available, otherwise raw prompt.
    # TinyLlama-1.1B-Chat-v1.0 uses a simple chat format.
    messages = [
        {"role": "user", "content": prompt}
    ]
    # Apply chat template for TinyLlama
    input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)


    # Tokenize input
    inputs = tokenizer(input_text, return_tensors="pt", max_length=2048, truncation=True)
    # Move inputs to CPU (they are already on CPU by default if no CUDA)
    # inputs = {k: v.to('cpu') for k, v in inputs.items()} # Explicitly move to CPU if needed

    # Generate response
    # Use torch.no_grad() for inference to save memory and speed up.
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=512, # Limit output length for structured data
            num_return_sequences=1,
            do_sample=False, # For more deterministic output
            temperature=0.1, # Low temperature for factual extraction
            pad_token_id=tokenizer.eos_token_id # Important for generation
        )

    # Decode and clean the output
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

    # Post-processing to extract JSON block
    json_match = re.search(r"```json\s*(\{.*\})\s*```", generated_text, re.DOTALL)
    if json_match:
        json_str = json_match.group(1)
        try:
            # Validate JSON
            extracted_data = json.loads(json_str)
            return json.dumps(extracted_data, indent=2)
        except json.JSONDecodeError:
            return json.dumps({"error": "Failed to parse JSON from LLM output.", "raw_output": generated_text}, indent=2)
    else:
        return json.dumps({"error": "No JSON block found in LLM output.", "raw_output": generated_text}, indent=2)


# 4. Output Examples
# Example Invoice Text
example_invoice_text = """
INVOICE #INV-2023-001
Date: 2023-10-26
Vendor: Tech Solutions Inc.
123 Tech Lane, Innovation City, CA 90210

Bill To:
Acme Corp.
456 Business Blvd, Enterprise Town, NY 10001

Description                 Qty   Unit Price   Total
---
Software License (Annual)   1     999.00       999.00
Consulting Services (HR)    10    150.00       1500.00
Premium Support             1     200.00       200.00
---
Subtotal: $2699.00
Tax (8%): $215.92
Total Amount Due: $2914.92 USD
Payment Due: 2023-11-26
"""

# print("--- Example Output ---")
# print(extract_invoice_data(example_invoice_text))
# print("--- End of Example ---")


# 5. Hosting Instructions (Gradio)
# Create a Gradio interface
gr_interface = gr.Interface(
    fn=extract_invoice_data,
    inputs=gr.Textbox(lines=15, label="Paste Invoice Text Here", placeholder="Paste the full text of your invoice..."),
    outputs=gr.JSON(label="Extracted Invoice Data"),
    title="Invoice Data Extractor LLM Agent",
    description="Paste invoice text to automatically extract vendor, amount, date, and item details."
)

gr_interface.launch(server_name="0.0.0.0", server_port=7860)

# Launch the Gradio interface
# For Hugging Face Spaces deployment, this part is crucial.
# The app.py file would simply contain `gr_interface.launch(server_name="0.0.0.0", server_port=7860)`
# when deployed to Hugging Face Spaces.
# For local Colab testing:
# gr_interface.launch(debug=True)

# To deploy on Hugging Face Spaces:
# 1. Save the above code (model loading, function, gradio interface) into a single file named `app.py`.
# 2. Create a new Space on Hugging Face.
# 3. Choose "Gradio" as the SDK.
# 4. Upload your `app.py` file.
# 5. Add a `requirements.txt` file with:
#    `transformers`
#    `torch`
#    `gradio`
#    (and any other specific versions if needed, e.g., `torch==2.0.1+cpu`)
# The Space will automatically build and launch your Gradio app.