Spaces:

Propelis
/

QC_Rules

Sleeping

App Files Files Community

Jakecole1 commited on Jul 21, 2025

Commit

6c16992

verified ·

1 Parent(s): 31c1fbd

Upload 11 files

Browse files

Files changed (11) hide show

.env +1 -0
.gitignore +2 -0
README.md +185 -10
compare_app.py +620 -0
main.py +938 -0
quick_test.py +63 -0
quick_test_results.md +17 -0
requirements.txt +14 -0
test_google_doc_ai.py +183 -0
test_metadata.py +89 -0
test_pdf_requirements.py +74 -0

.env ADDED Viewed

	@@ -0,0 +1 @@


1	+ CLAUDE_API_KEY=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

.gitignore ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ .env
2	+ requirements_library/client_requirements/

README.md CHANGED Viewed

@@ -1,10 +1,185 @@
----
-title: QC Rules
-emoji: 🐢
-colorFrom: yellow
-colorTo: indigo
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Artwork Comparison Tool
+A Streamlit-based application for comparing packaging artwork PDFs using AI-powered analysis. This tool extracts text, images, and barcodes from PDF files and provides detailed comparison analysis using Claude AI.
+## Features
+- **PDF Processing**: Extract text, bounding boxes, and convert PDFs to images
+- **Barcode Detection**: Scan and validate barcodes in artwork
+- **AI-Powered Comparison**: Use Claude AI to analyze differences between artworks
+- **Compliance Analysis**: Identify potential compliance impacts of changes
+- **Visual Comparison**: Side-by-side image comparison with extracted data
+- **Client File Management**: Load and compare files from a structured client directory
+## Installation
+### Prerequisites
+- Python 3.8 or higher
+- Windows, macOS, or Linux
+### Setup
+1. **Clone or download the project**
+   ```bash
+   git clone <repository-url>
+   cd SGK-AI-LAB
+   ```
+2. **Install dependencies**
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. **Set up Google Cloud credentials** (for text extraction)
+   - Place your Google Cloud service account JSON file at `src/extract_text/photon-services-f0d3ec1417d0.json`
+   - Or update the path in `compare_app.py` to point to your credentials file
+4. **Set up Anthropic API key** (for AI comparison)
+   ```bash
+   # Windows
+   set ANTHROPIC_API_KEY=your_api_key_here
+   # macOS/Linux
+   export ANTHROPIC_API_KEY=your_api_key_here
+   ```
+## Usage
+### Running the Application
+```bash
+streamlit run compare_app.py
+```
+The application will open in your default web browser at `http://localhost:8501`.
+### How to Use
+1. **Select Artworks to Compare**
+   - Choose from existing client files in the dropdown
+   - Or upload new PDF files using the file uploader
+2. **View Side-by-Side Comparison**
+   - See both artworks displayed side by side
+   - View extracted data summaries (text elements, barcodes)
+3. **Run AI Analysis**
+   - Click "Compare Artworks" to start the AI-powered analysis
+   - Wait for Claude to process and analyze the differences
+4. **Review Results**
+   - **Text Differences**: Missing, added, or changed text elements
+   - **Layout Changes**: Repositioned elements and their impact
+   - **Barcode Changes**: Differences in barcode data or positioning
+   - **Visual Differences**: Design and visual element changes
+   - **Compliance Impact**: Potential regulatory compliance issues
+   - **Recommendations**: Actionable insights and next steps
+### File Structure
+```
+SGK-AI-LAB/
+├── compare_app.py              # Main Streamlit application
+├── requirements.txt            # Python dependencies
+├── test_compare_app.py         # Test script
+├── README.md                   # This file
+├── requirements_library/       # Client artwork files
+│   └── client-requirements/
+│       ├── M&S/
+│       │   ├── Curry puff/
+│       │   └── Lemon package/
+│       └── package/
+└── src/
+    ├── core/
+    │   └── analysis.py
+    ├── extract_text/
+    │   ├── google_document_api.py
+    │   ├── ingest.py
+    │   └── photon-services-f0d3ec1417d0.json
+    └── utils/
+        ├── barcode.py
+        └── image_utils.py
+```
+## Configuration
+### Google Cloud Document AI
+The application uses Google Cloud Document AI for text extraction. To set up:
+1. Create a Google Cloud project
+2. Enable Document AI API
+3. Create a service account and download the JSON credentials
+4. Place the credentials file in `src/extract_text/` or update the path in the code
+### Anthropic Claude API
+For AI-powered comparison, you need an Anthropic API key:
+1. Sign up at [Anthropic Console](https://console.anthropic.com/)
+2. Create an API key
+3. Set the `ANTHROPIC_API_KEY` environment variable
+## Dependencies
+### Core Dependencies
+- `streamlit` - Web application framework
+- `anthropic` - Claude AI API client
+- `google-cloud-documentai` - Google Document AI
+- `pdf2image` - PDF to image conversion
+- `Pillow` - Image processing
+- `opencv-python` - Computer vision
+- `numpy` - Numerical computing
+- `pandas` - Data manipulation
+### Barcode Processing
+- `zxing-cpp` - Barcode detection
+- `barcodenumber` - Barcode validation
+## Troubleshooting
+## API Response Format
+The AI comparison returns structured JSON with the following format:
+```json
+{
+    "overall_similarity": 0.85,
+    "comparison_summary": "Brief overview",
+    "text_differences": [
+        {
+            "category": "Missing Text",
+            "artwork1_content": "Text in artwork 1",
+            "artwork2_content": "Text in artwork 2",
+            "significance": "HIGH/MEDIUM/LOW",
+            "description": "Detailed explanation"
+        }
+    ],
+    "layout_differences": [...],
+    "barcode_differences": [...],
+    "visual_differences": [...],
+    "compliance_impact": [...],
+    "recommendations": ["Action item 1", "Action item 2"]
+}
+```
+## Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Add tests if applicable
+5. Submit a pull request
+## License
+[Add your license information here]
+## Support
+For issues and questions:
+1. Check the troubleshooting section
+2. Review the test script output
+3. Create an issue with detailed error information

compare_app.py ADDED Viewed

	@@ -0,0 +1,620 @@

+import streamlit as st
+import tempfile
+import os
+import pandas as pd
+from src.extract_text.google_document_api import GoogleDocumentAPI
+from pdf2image import convert_from_path
+from PIL import Image, ImageDraw, ImageFont
+from src.utils.image_utils import ImageUtils
+import base64
+from io import BytesIO
+from src.utils.barcode import Barcode
+import anthropic
+import json
+def load_client_artwork_files():
+    """Load all artwork PDF files from client directory"""
+    base_path = "requirements_library/client-requirements"
+    artwork_files = []
+    if not os.path.exists(base_path):
+        return artwork_files
+    # Walk through all subdirectories
+    for root, dirs, files in os.walk(base_path):
+        for file in files:
+            file_path = os.path.join(root, file)
+            relative_path = os.path.relpath(file_path, base_path)
+            if file.lower().endswith('.pdf'):
+                artwork_files.append({
+                    'name': f"{relative_path}",
+                    'path': file_path,
+                    'type': 'artwork'
+                })
+    return artwork_files
+def load_artwork_content(file_info):
+    """Load artwork content as bytes"""
+    try:
+        with open(file_info['path'], 'rb') as f:
+            return f.read()
+    except Exception as e:
+        st.error(f"Error loading artwork file {file_info['name']}: {str(e)}")
+        return None
+def extract_pdf_data(pdf_file, file_name):
+    """Extract text, bounding boxes, images, and barcodes from PDF"""
+    try:
+        # Create a temporary file to process the PDF
+        with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
+            pdf_file.seek(0)
+            tmp_file.write(pdf_file.read())
+            tmp_pdf_path = tmp_file.name
+        # Extract text and bounding boxes using Google Document API
+        google_document_api = GoogleDocumentAPI(credentials_path="src/extract_text/photon-services-f0d3ec1417d0.json")
+        document = google_document_api.process_document(tmp_pdf_path)
+        text_content = google_document_api.extract_text_with_markdown_table(document)
+        bounding_boxes = google_document_api.extract_text_with_bounding_boxes(document)
+        # Convert PDF to image
+        try:
+            images = convert_from_path(tmp_pdf_path)
+            if not images:
+                raise ValueError("No pages found in PDF")
+            page_image = images[0]  # Assuming single page for now
+        except Exception as e:
+            st.error(f"Error converting PDF to image: {str(e)}")
+            # Create a placeholder image
+            page_image = Image.new('RGB', (800, 600), color='white')
+            draw = ImageDraw.Draw(page_image)
+            draw.text((400, 300), "PDF conversion failed", fill='black', anchor='mm')
+        # Process image for comparison: standardize size and optimize quality
+        processed_image, quality, file_size = ImageUtils.process_image_for_comparison(
+            page_image,
+            target_size=(1200, 1600),  # Standard size for comparison
+            max_size_bytes=1024 * 1024  # 1MB limit
+        )
+        # Convert processed image to base64 for API
+        image_base64 = ImageUtils.image_to_base64_optimized(
+            page_image,
+            target_size=(1200, 1600),
+            max_size_bytes=1024 * 1024
+        )
+        # Scan for barcodes
+        barcode = Barcode()
+        barcode_results = barcode.scan_and_validate(page_image)
+        # Clean up temporary file
+        if os.path.exists(tmp_pdf_path):
+            os.unlink(tmp_pdf_path)
+        return {
+            'text_content': text_content,
+            'bounding_boxes': bounding_boxes,
+            'image': processed_image,  # Use the processed image
+            'original_image': page_image,  # Keep original for reference
+            'image_base64': image_base64,
+            'barcode_results': barcode_results,
+            'file_name': file_name,
+            'image_quality': quality,
+            'image_size_bytes': file_size
+        }
+    except Exception as e:
+        st.error(f"Error processing PDF {file_name}: {str(e)}")
+        return None
+def compare_artworks_with_claude(artwork1_data, artwork2_data, model="claude-sonnet-4-20250514"):
+    """Compare two artworks using Claude API"""
+    # Prepare the comparison prompt
+    prompt = f"""
+You are an expert packaging compliance analyzer. Compare these two artwork PDFs and provide a detailed analysis of their differences and similarities.
+## Artwork 1: {artwork1_data['file_name']}
+**Text Content:**
+{artwork1_data['text_content']}
+**Bounding Box Data:**
+{json.dumps(artwork1_data['bounding_boxes'][:10], indent=2) if artwork1_data['bounding_boxes'] else "No text elements detected"}
+**Barcode Data:**
+{json.dumps(artwork1_data['barcode_results'], indent=2) if artwork1_data['barcode_results'] else "No barcodes detected"}
+## Artwork 2: {artwork2_data['file_name']}
+**Text Content:**
+{artwork2_data['text_content']}
+**Bounding Box Data:**
+{json.dumps(artwork2_data['bounding_boxes'][:10], indent=2) if artwork2_data['bounding_boxes'] else "No text elements detected"}
+**Barcode Data:**
+{json.dumps(artwork2_data['barcode_results'], indent=2) if artwork2_data['barcode_results'] else "No barcodes detected"}
+Please provide a comprehensive comparison analysis in the following JSON format:
+{{
+    "overall_similarity": 0.85,
+    "comparison_summary": "Brief overview of the comparison results",
+    "text_differences": [
+        {{
+            "category": "Missing Text",
+            "artwork1_content": "Text found only in artwork 1",
+            "artwork2_content": "Text found only in artwork 2",
+            "significance": "HIGH/MEDIUM/LOW",
+            "description": "Detailed explanation of the difference"
+        }}
+    ],
+    "layout_differences": [
+        {{
+            "category": "Position Changes",
+            "element": "Element that moved",
+            "artwork1_position": "Description of position in artwork 1",
+            "artwork2_position": "Description of position in artwork 2",
+            "significance": "HIGH/MEDIUM/LOW",
+            "description": "Impact of this change"
+        }}
+    ],
+    "barcode_differences": [
+        {{
+            "category": "Barcode Changes",
+            "artwork1_barcodes": "Description of barcodes in artwork 1",
+            "artwork2_barcodes": "Description of barcodes in artwork 2",
+            "significance": "HIGH/MEDIUM/LOW",
+            "description": "Analysis of barcode differences"
+        }}
+    ],
+    "visual_differences": [
+        {{
+            "category": "Visual Elements",
+            "description": "Description of visual differences observed in the images",
+            "significance": "HIGH/MEDIUM/LOW",
+            "recommendation": "Suggested action or consideration"
+        }}
+    ],
+    "compliance_impact": [
+        {{
+            "area": "Regulatory compliance area affected",
+            "impact": "Description of potential compliance impact",
+            "risk_level": "HIGH/MEDIUM/LOW",
+            "recommendation": "Recommended action"
+        }}
+    ],
+    "recommendations": [
+        "List of actionable recommendations based on the comparison"
+    ]
+}}
+Analyze both the textual content and visual elements. Pay special attention to:
+1. Missing or changed text elements
+2. Repositioned elements that might affect readability
+3. Barcode differences that could impact functionality
+4. Visual changes that might affect brand consistency or compliance
+5. Any changes that could impact regulatory compliance
+Provide specific, actionable insights that would be valuable for quality control and compliance verification.
+"""
+    try:
+        # Initialize Anthropic client
+        client = anthropic.Anthropic(api_key=os.getenv('CLAUDE_API_KEY'))
+        # Create message with both images
+        message = client.messages.create(
+            model=model,
+            max_tokens=4000,
+            messages=[
+                {
+                    "role": "user",
+                    "content": [
+                        {
+                            "type": "text",
+                            "text": prompt
+                        },
+                        {
+                            "type": "image",
+                            "source": {
+                                "type": "base64",
+                                "media_type": "image/png",
+                                "data": artwork1_data['image_base64']
+                            }
+                        },
+                        {
+                            "type": "image",
+                            "source": {
+                                "type": "base64",
+                                "media_type": "image/png",
+                                "data": artwork2_data['image_base64']
+                            }
+                        }
+                    ]
+                }
+            ]
+        )
+        # Parse the response
+        response_text = ""
+        for content_block in message.content:
+            if hasattr(content_block, 'type') and content_block.type == 'text':
+                response_text += content_block.text
+        # Try to extract JSON from the response
+        try:
+            # Find JSON in the response
+            start_idx = response_text.find('{')
+            end_idx = response_text.rfind('}') + 1
+            if start_idx != -1 and end_idx != -1:
+                json_str = response_text[start_idx:end_idx]
+                comparison_results = json.loads(json_str)
+            else:
+                # Fallback: create a basic structure with the raw response
+                comparison_results = {
+                    "overall_similarity": 0.5,
+                    "comparison_summary": "Analysis completed but JSON parsing failed",
+                    "raw_response": response_text,
+                    "text_differences": [],
+                    "layout_differences": [],
+                    "barcode_differences": [],
+                    "visual_differences": [],
+                    "compliance_impact": [],
+                    "recommendations": ["Review the raw analysis output for detailed insights"]
+                }
+        except json.JSONDecodeError:
+            # Fallback for JSON parsing errors
+            comparison_results = {
+                "overall_similarity": 0.5,
+                "comparison_summary": "Analysis completed but structured parsing failed",
+                "raw_response": response_text,
+                "text_differences": [],
+                "layout_differences": [],
+                "barcode_differences": [],
+                "visual_differences": [],
+                "compliance_impact": [],
+                "recommendations": ["Review the raw analysis output for detailed insights"]
+            }
+        return comparison_results
+    except Exception as e:
+        st.error(f"Error calling Claude API: {str(e)}")
+        return None
+def display_comparison_results(results, artwork1_data, artwork2_data):
+    """Display the comparison results in a structured format"""
+    if not results:
+        st.error("No comparison results to display")
+        return
+    # Overall Summary
+    st.markdown("## 📊 Comparison Summary")
+    col1, col2, col3 = st.columns(3)
+    with col1:
+        similarity = results.get('overall_similarity', 0.5)
+        st.metric("Overall Similarity", f"{similarity:.1%}")
+    with col2:
+        total_differences = (
+            len(results.get('text_differences', [])) +
+            len(results.get('layout_differences', [])) +
+            len(results.get('barcode_differences', [])) +
+            len(results.get('visual_differences', []))
+        )
+        st.metric("Total Differences", total_differences)
+    with col3:
+        compliance_impacts = len(results.get('compliance_impact', []))
+        st.metric("Compliance Impacts", compliance_impacts)
+    # Summary description
+    if 'comparison_summary' in results:
+        st.markdown(f"**Summary:** {results['comparison_summary']}")
+    # Create tabs for different types of differences
+    tabs = st.tabs(["📝 Text Differences", "📐 Layout Changes", "📱 Barcode Changes", "🎨 Visual Differences", "⚖️ Compliance Impact", "💡 Recommendations"])
+    with tabs[0]:  # Text Differences
+        st.markdown("### Text Content Differences")
+        text_diffs = results.get('text_differences', [])
+        if text_diffs:
+            for i, diff in enumerate(text_diffs):
+                significance_color = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢"}.get(diff.get('significance', 'MEDIUM'), "🟡")
+                with st.expander(f"{significance_color} {diff.get('category', 'Text Difference')} - {diff.get('significance', 'MEDIUM')} Impact"):
+                    col1, col2 = st.columns(2)
+                    with col1:
+                        st.markdown(f"**{artwork1_data['file_name']}:**")
+                        st.text(diff.get('artwork1_content', 'N/A'))
+                    with col2:
+                        st.markdown(f"**{artwork2_data['file_name']}:**")
+                        st.text(diff.get('artwork2_content', 'N/A'))
+                    st.markdown(f"**Description:** {diff.get('description', 'No description available')}")
+        else:
+            st.info("No significant text differences found")
+    with tabs[1]:  # Layout Changes
+        st.markdown("### Layout and Positioning Changes")
+        layout_diffs = results.get('layout_differences', [])
+        if layout_diffs:
+            for diff in layout_diffs:
+                significance_color = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢"}.get(diff.get('significance', 'MEDIUM'), "🟡")
+                with st.expander(f"{significance_color} {diff.get('category', 'Layout Change')} - {diff.get('significance', 'MEDIUM')} Impact"):
+                    st.markdown(f"**Element:** {diff.get('element', 'Unknown element')}")
+                    col1, col2 = st.columns(2)
+                    with col1:
+                        st.markdown(f"**Position in {artwork1_data['file_name']}:**")
+                        st.text(diff.get('artwork1_position', 'N/A'))
+                    with col2:
+                        st.markdown(f"**Position in {artwork2_data['file_name']}:**")
+                        st.text(diff.get('artwork2_position', 'N/A'))
+                    st.markdown(f"**Impact:** {diff.get('description', 'No description available')}")
+        else:
+            st.info("No significant layout differences found")
+    with tabs[2]:  # Barcode Changes
+        st.markdown("### Barcode Differences")
+        barcode_diffs = results.get('barcode_differences', [])
+        if barcode_diffs:
+            for diff in barcode_diffs:
+                significance_color = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢"}.get(diff.get('significance', 'MEDIUM'), "🟡")
+                with st.expander(f"{significance_color} {diff.get('category', 'Barcode Change')} - {diff.get('significance', 'MEDIUM')} Impact"):
+                    col1, col2 = st.columns(2)
+                    with col1:
+                        st.markdown(f"**{artwork1_data['file_name']} Barcodes:**")
+                        st.text(diff.get('artwork1_barcodes', 'N/A'))
+                    with col2:
+                        st.markdown(f"**{artwork2_data['file_name']} Barcodes:**")
+                        st.text(diff.get('artwork2_barcodes', 'N/A'))
+                    st.markdown(f"**Analysis:** {diff.get('description', 'No description available')}")
+        else:
+            st.info("No significant barcode differences found")
+    with tabs[3]:  # Visual Differences
+        st.markdown("### Visual and Design Differences")
+        visual_diffs = results.get('visual_differences', [])
+        if visual_diffs:
+            for diff in visual_diffs:
+                significance_color = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢"}.get(diff.get('significance', 'MEDIUM'), "🟡")
+                with st.expander(f"{significance_color} {diff.get('category', 'Visual Change')} - {diff.get('significance', 'MEDIUM')} Impact"):
+                    st.markdown(f"**Description:** {diff.get('description', 'No description available')}")
+                    if 'recommendation' in diff:
+                        st.markdown(f"**Recommendation:** {diff['recommendation']}")
+        else:
+            st.info("No significant visual differences found")
+    with tabs[4]:  # Compliance Impact
+        st.markdown("### Compliance and Regulatory Impact")
+        compliance_impacts = results.get('compliance_impact', [])
+        if compliance_impacts:
+            for impact in compliance_impacts:
+                risk_color = {"HIGH": "🔴", "MEDIUM": "🟡", "LOW": "🟢"}.get(impact.get('risk_level', 'MEDIUM'), "🟡")
+                with st.expander(f"{risk_color} {impact.get('area', 'Compliance Area')} - {impact.get('risk_level', 'MEDIUM')} Risk"):
+                    st.markdown(f"**Impact:** {impact.get('impact', 'No description available')}")
+                    st.markdown(f"**Recommendation:** {impact.get('recommendation', 'No recommendation provided')}")
+        else:
+            st.success("No compliance impacts identified")
+    with tabs[5]:  # Recommendations
+        st.markdown("### Action Items and Recommendations")
+        recommendations = results.get('recommendations', [])
+        if recommendations:
+            for i, rec in enumerate(recommendations, 1):
+                st.markdown(f"{i}. {rec}")
+        else:
+            st.info("No specific recommendations provided")
+    # Raw response section (collapsible)
+    if 'raw_response' in results:
+        with st.expander("🔍 Raw Analysis Output"):
+            st.text(results['raw_response'])
+def display_side_by_side_images(artwork1_data, artwork2_data):
+    """Display the two artwork images side by side"""
+    st.markdown("## 🖼️ Side-by-Side Comparison")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown(f"### {artwork1_data['file_name']}")
+        st.image(ImageUtils.crop_image(artwork1_data['image']), caption=artwork1_data['file_name'], use_container_width=True)
+        # Display image processing info
+        if 'image_quality' in artwork1_data and 'image_size_bytes' in artwork1_data:
+            quality = artwork1_data['image_quality']
+            size_mb = artwork1_data['image_size_bytes'] / (1024 * 1024)
+            st.info(f"📊 Image Quality: {quality}% | Size: {size_mb:.2f}MB")
+        # Display extracted data summary
+        with st.expander("📊 Extracted Data Summary"):
+            text_elements = len(artwork1_data['bounding_boxes']) if artwork1_data['bounding_boxes'] else 0
+            barcodes = len(artwork1_data['barcode_results']) if artwork1_data['barcode_results'] else 0
+            st.metric("Text Elements", text_elements)
+            st.metric("Barcodes", barcodes)
+    with col2:
+        st.markdown(f"### {artwork2_data['file_name']}")
+        st.image(ImageUtils.crop_image(artwork2_data['image']), caption=artwork2_data['file_name'], use_container_width=True)
+        # Display image processing info
+        if 'image_quality' in artwork2_data and 'image_size_bytes' in artwork2_data:
+            quality = artwork2_data['image_quality']
+            size_mb = artwork2_data['image_size_bytes'] / (1024 * 1024)
+            st.info(f"📊 Image Quality: {quality}% | Size: {size_mb:.2f}MB")
+        # Display extracted data summary
+        with st.expander("📊 Extracted Data Summary"):
+            text_elements = len(artwork2_data['bounding_boxes']) if artwork2_data['bounding_boxes'] else 0
+            barcodes = len(artwork2_data['barcode_results']) if artwork2_data['barcode_results'] else 0
+            st.metric("Text Elements", text_elements)
+            st.metric("Barcodes", barcodes)
+def main():
+    st.set_page_config(layout="wide", page_title="Artwork Comparison Tool")
+    # Load client artwork files
+    client_artwork_files = load_client_artwork_files()
+    # Initialize session state
+    if "artwork1_data" not in st.session_state:
+        st.session_state.artwork1_data = None
+    if "artwork2_data" not in st.session_state:
+        st.session_state.artwork2_data = None
+    if "comparison_results" not in st.session_state:
+        st.session_state.comparison_results = None
+    st.title("🎨 Artwork Comparison Tool")
+    st.write("Compare two packaging artwork PDFs to identify differences in text, layout, barcodes, and visual elements.")
+    # File selection section
+    st.markdown("## 📁 Select Artworks to Compare")
+    col1, col2 = st.columns(2)
+    with col1:
+        st.markdown("### 🎨 Artwork 1")
+        # Create tabs for client files vs upload
+        art1_tab1, art1_tab2 = st.tabs(["📁 Client Files", "📤 Upload New"])
+        with art1_tab1:
+            if client_artwork_files:
+                art1_options = ["Select artwork 1..."] + [f["name"] for f in client_artwork_files]
+                selected_art1_file = st.selectbox("Choose artwork 1:", art1_options, key="art1_select")
+                if selected_art1_file != "Select artwork 1...":
+                    # Find and load the selected file
+                    for file_info in client_artwork_files:
+                        if file_info["name"] == selected_art1_file:
+                            file_content = load_artwork_content(file_info)
+                            if file_content:
+                                import io
+                                temp_file = io.BytesIO(file_content)
+                                temp_file.name = file_info["name"]
+                                # Extract data from the artwork
+                                with st.spinner("Processing artwork 1..."):
+                                    st.session_state.artwork1_data = extract_pdf_data(temp_file, file_info["name"])
+                                if st.session_state.artwork1_data:
+                                    st.success(f"✅ Loaded artwork 1: {selected_art1_file}")
+                            break
+            else:
+                st.info("No client artwork files found")
+        with art1_tab2:
+            artwork1_file = st.file_uploader("Upload Artwork 1 (PDF)", type=["pdf"], key="art1_upload")
+            if artwork1_file:
+                with st.spinner("Processing artwork 1..."):
+                    st.session_state.artwork1_data = extract_pdf_data(artwork1_file, artwork1_file.name)
+                if st.session_state.artwork1_data:
+                    st.success(f"✅ Uploaded artwork 1: {artwork1_file.name}")
+    with col2:
+        st.markdown("### 🎨 Artwork 2")
+        # Create tabs for client files vs upload
+        art2_tab1, art2_tab2 = st.tabs(["📁 Client Files", "📤 Upload New"])
+        with art2_tab1:
+            if client_artwork_files:
+                art2_options = ["Select artwork 2..."] + [f["name"] for f in client_artwork_files]
+                selected_art2_file = st.selectbox("Choose artwork 2:", art2_options, key="art2_select")
+                if selected_art2_file != "Select artwork 2...":
+                    # Find and load the selected file
+                    for file_info in client_artwork_files:
+                        if file_info["name"] == selected_art2_file:
+                            file_content = load_artwork_content(file_info)
+                            if file_content:
+                                import io
+                                temp_file = io.BytesIO(file_content)
+                                temp_file.name = file_info["name"]
+                                # Extract data from the artwork
+                                with st.spinner("Processing artwork 2..."):
+                                    st.session_state.artwork2_data = extract_pdf_data(temp_file, file_info["name"])
+                                if st.session_state.artwork2_data:
+                                    st.success(f"✅ Loaded artwork 2: {selected_art2_file}")
+                            break
+            else:
+                st.info("No client artwork files found")
+        with art2_tab2:
+            artwork2_file = st.file_uploader("Upload Artwork 2 (PDF)", type=["pdf"], key="art2_upload")
+            if artwork2_file:
+                with st.spinner("Processing artwork 2..."):
+                    st.session_state.artwork2_data = extract_pdf_data(artwork2_file, artwork2_file.name)
+                if st.session_state.artwork2_data:
+                    st.success(f"✅ Uploaded artwork 2: {artwork2_file.name}")
+    # Display images side by side if both are loaded
+    if st.session_state.artwork1_data and st.session_state.artwork2_data:
+        display_side_by_side_images(st.session_state.artwork1_data, st.session_state.artwork2_data)
+    # Model selection
+    model_option = "claude-sonnet-4-20250514"
+    # Comparison button
+    if st.button("🔍 Compare Artworks", type="primary"):
+        if st.session_state.artwork1_data and st.session_state.artwork2_data:
+            with st.spinner("Analyzing artworks with Claude..."):
+                st.session_state.comparison_results = compare_artworks_with_claude(
+                    st.session_state.artwork1_data,
+                    st.session_state.artwork2_data,
+                    model=model_option
+                )
+            if st.session_state.comparison_results:
+                st.success("✅ Comparison analysis complete!")
+            else:
+                st.error("❌ Comparison analysis failed")
+        else:
+            st.warning("⚠️ Please select or upload both artworks before comparing")
+    # Display comparison results
+    if st.session_state.comparison_results:
+        display_comparison_results(
+            st.session_state.comparison_results,
+            st.session_state.artwork1_data,
+            st.session_state.artwork2_data
+        )
+    # Add helpful information
+    st.markdown("---")
+    st.markdown("""
+    ### 🛠️ How It Works
+    1. **Extract Content**: The tool extracts text, bounding boxes, images, and barcodes from both PDFs
+    2. **AI Analysis**: Claude analyzes the extracted data and visual elements to identify differences
+    3. **Structured Results**: Differences are categorized by type (text, layout, barcode, visual) and significance
+    4. **Compliance Assessment**: Potential compliance impacts are identified with risk levels and recommendations
+    ### 🎯 Use Cases
+    - **Quality Control**: Verify artwork changes between versions
+    - **Brand Consistency**: Ensure visual elements remain consistent
+    - **Compliance Review**: Identify changes that might affect regulatory compliance
+    - **Change Documentation**: Track and document artwork modifications
+    """)
+if __name__ == "__main__":
+    main()

main.py ADDED Viewed

	@@ -0,0 +1,938 @@

+import streamlit as st
+import tempfile
+import os
+import pandas as pd
+from src.extract_text.ingest import RequirementsIngest
+from src.extract_text.google_document_api import GoogleDocumentAPI
+from src.extract_text.extract_meta_data import PDFArtworkMetadataExtractor
+from src.core.analysis import ComplianceAnalysis
+from pdf2image import convert_from_path
+from PIL import Image, ImageDraw, ImageFont
+from src.utils.image_utils import ImageUtils
+import base64
+from io import BytesIO
+from src.utils.barcode import Barcode
+import glob
+def load_client_requirements_files():
+    """Load all requirements and packaging files from client-requirements directory"""
+    base_path = "requirements_library/client-requirements"
+    requirements_files = []
+    packaging_files = []
+    if not os.path.exists(base_path):
+        return requirements_files, packaging_files
+    # Walk through all subdirectories
+    for root, dirs, files in os.walk(base_path):
+        for file in files:
+            file_path = os.path.join(root, file)
+            relative_path = os.path.relpath(file_path, base_path)
+            if file.lower().endswith('.txt') and 'requirement' in file.lower():
+                requirements_files.append({
+                    'name': f"{relative_path}",
+                    'path': file_path,
+                    'type': 'requirements'
+                })
+            elif file.lower().endswith('.pdf') and 'requirement' in file.lower():
+                requirements_files.append({
+                    'name': f"{relative_path}",
+                    'path': file_path,
+                    'type': 'requirements'
+                })
+            elif file.lower().endswith('.pdf'):
+                packaging_files.append({
+                    'name': f"{relative_path}",
+                    'path': file_path,
+                    'type': 'packaging'
+                })
+    return requirements_files, packaging_files
+def load_file_content(file_info):
+    """Load content from a file based on its type"""
+    try:
+        if file_info['type'] == 'requirements':
+            # For requirements files, read as text
+            with open(file_info['path'], 'r', encoding='utf-8') as f:
+                return f.read()
+        else:
+            # For packaging files, return bytes
+            with open(file_info['path'], 'rb') as f:
+                return f.read()
+    except Exception as e:
+        st.error(f"Error loading file {file_info['name']}: {str(e)}")
+        return None
+def load_requirements_content(file_info):
+    """Load requirements content as string"""
+    try:
+        with open(file_info['path'], 'r', encoding='utf-8') as f:
+            return f.read()
+    except Exception as e:
+        st.error(f"Error loading requirements file {file_info['name']}: {str(e)}")
+        return None
+def load_packaging_content(file_info):
+    """Load packaging content as bytes"""
+    try:
+        with open(file_info['path'], 'rb') as f:
+            return f.read()
+    except Exception as e:
+        st.error(f"Error loading packaging file {file_info['name']}: {str(e)}")
+        return None
+def main():
+    st.set_page_config(layout="wide", page_title="Packaging Compliance Checker")
+    # Load client requirements files
+    client_requirements_files, client_packaging_files = load_client_requirements_files()
+    # Initialize session state variables
+    if "requirements_text" not in st.session_state:
+        st.session_state.requirements_text = None
+    if "analysis_results" not in st.session_state:
+        st.session_state.analysis_results = None
+    if "current_requirements_file" not in st.session_state:
+        st.session_state.current_requirements_file = None
+    if "uploaded_packaging_files" not in st.session_state:
+        st.session_state.uploaded_packaging_files = []
+    if "selected_packaging_file" not in st.session_state:
+        st.session_state.selected_packaging_file = None
+    if "client_requirements_files" not in st.session_state:
+        st.session_state.client_requirements_files = client_requirements_files
+    if "client_packaging_files" not in st.session_state:
+        st.session_state.client_packaging_files = client_packaging_files
+    st.title("Packaging Compliance Checker")
+    st.write(
+        "Upload a requirements document (plain text) that specifies requirements, "
+        "and then upload one or more packaging PDFs to check for compliance."
+    )
+    # Create two columns for the layout
+    col1, col2 = st.columns([1, 1])
+    with col1:
+        # Stylish upload section with custom CSS
+        st.markdown("""
+        <style>
+        .upload-section {
+            background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
+            padding: 20px;
+            border-radius: 15px;
+            color: white;
+            margin-bottom: 20px;
+        }
+        .upload-title {
+            font-size: 24px;
+            font-weight: bold;
+            margin-bottom: 15px;
+            text-align: center;
+        }
+        .upload-description {
+            font-size: 14px;
+            opacity: 0.9;
+            margin-bottom: 20px;
+            text-align: center;
+        }
+        .file-uploader {
+            background: rgba(255, 255, 255, 0.1);
+            border: 2px dashed rgba(255, 255, 255, 0.3);
+            border-radius: 10px;
+            padding: 15px;
+            margin-bottom: 15px;
+        }
+        .requirements-display {
+            background: rgba(255, 255, 255, 0.05);
+            border-radius: 10px;
+            padding: 15px;
+            margin-top: 15px;
+        }
+        .artwork-display {
+            background: rgba(255, 255, 255, 0.05);
+            border-radius: 10px;
+            padding: 15px;
+            margin-top: 15px;
+        }
+        .image-container {
+            max-width: 100%;
+            border-radius: 8px;
+            overflow: hidden;
+            box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);
+        }
+        </style>
+        """, unsafe_allow_html=True)
+        # Upload section container
+        st.markdown('<div class="upload-section">', unsafe_allow_html=True)
+        st.markdown('<div class="upload-title">📄 Document Upload</div>', unsafe_allow_html=True)
+        st.markdown('<div class="upload-description">Upload your requirements and packaging documents for compliance analysis</div>', unsafe_allow_html=True)
+        # Requirements file selection
+        st.markdown('<div class="file-uploader">', unsafe_allow_html=True)
+        st.markdown("**📋 Requirements Document**")
+        # Create tabs for client files vs upload
+        req_tab1, req_tab2 = st.tabs(["📁 Client Files", "📤 Upload New"])
+        with req_tab1:
+            if st.session_state.client_requirements_files:
+                req_options = ["Select a requirements file..."] + [f["name"] for f in st.session_state.client_requirements_files]
+                selected_req_file = st.selectbox("Choose from client files:", req_options)
+                if selected_req_file != "Select a requirements file...":
+                    # Find the selected file
+                    selected_file_info = None
+                    for file_info in st.session_state.client_requirements_files:
+                        if file_info["name"] == selected_req_file:
+                            selected_file_info = file_info
+                            break
+                    if selected_file_info:
+                        # Load and process the requirements file
+                        if selected_file_info["name"].lower().endswith('.pdf'):
+                            # Handle PDF file - load as bytes
+                            requirements_content = load_packaging_content(selected_file_info)
+                            if requirements_content:
+                                # Create a temporary file-like object for the RequirementsIngest
+                                import io
+                                temp_file = io.BytesIO(requirements_content)
+                                temp_file.name = selected_file_info["name"]
+                        else:
+                            # Handle text file - load as text
+                            requirements_content = load_requirements_content(selected_file_info)
+                            if requirements_content:
+                                # Create a temporary file-like object for the RequirementsIngest
+                                import io
+                                temp_file = io.StringIO(requirements_content)
+                                temp_file.name = selected_file_info["name"]
+                            st.session_state.requirements_text = RequirementsIngest().ingest_requirements_document(temp_file)
+                            st.session_state.current_requirements_file = temp_file
+                            st.session_state.analysis_results = None  # Clear previous results
+                            # Display file type information
+                            if isinstance(st.session_state.requirements_text, dict):
+                                file_type = st.session_state.requirements_text.get('type', 'unknown')
+                                if file_type == 'pdf':
+                                    st.success(f"✅ Loaded PDF requirements from: {selected_req_file}")
+                                    st.info("📄 PDF will be processed natively by Claude for full visual analysis")
+                                else:
+                                    st.success(f"✅ Loaded requirements from: {selected_req_file}")
+                            else:
+                                st.success(f"✅ Loaded requirements from: {selected_req_file}")
+            else:
+                st.info("No client requirements files found")
+        with req_tab2:
+            requirements_file = st.file_uploader("Upload Requirements Document (TXT or PDF)", type=["txt", "pdf"])
+            # Only process requirements if a new file is uploaded
+            if requirements_file and requirements_file != st.session_state.current_requirements_file:
+                st.session_state.requirements_text = RequirementsIngest().ingest_requirements_document(requirements_file)
+                st.session_state.current_requirements_file = requirements_file
+                st.session_state.analysis_results = None  # Clear previous results
+                # Display file type information
+                if isinstance(st.session_state.requirements_text, dict):
+                    file_type = st.session_state.requirements_text.get('type', 'unknown')
+                    file_size = st.session_state.requirements_text.get('file_size', 0)
+                    if file_type == 'pdf':
+                        st.success(f"✅ Uploaded PDF requirements: {requirements_file.name} ({file_size:,} bytes)")
+                        st.info("📄 PDF will be processed natively by Claude for full visual analysis")
+                    else:
+                        st.success(f"✅ Uploaded requirements: {requirements_file.name}")
+                else:
+                    st.success(f"✅ Uploaded requirements: {requirements_file.name}")
+        st.markdown('</div>', unsafe_allow_html=True)
+        # Packaging files selection
+        st.markdown('<div class="file-uploader">', unsafe_allow_html=True)
+        st.markdown("**📦 Packaging PDFs**")
+        # Create tabs for client files vs upload
+        pkg_tab1, pkg_tab2 = st.tabs(["📁 Client Files", "📤 Upload New"])
+        with pkg_tab1:
+            if st.session_state.client_packaging_files:
+                pkg_options = ["Select packaging files..."] + [f["name"] for f in st.session_state.client_packaging_files]
+                selected_pkg_files = st.multiselect("Choose from client files:", pkg_options[1:])  # Skip the placeholder
+                if selected_pkg_files:
+                    # Convert selected client files to file-like objects
+                    client_file_objects = []
+                    for selected_file_name in selected_pkg_files:
+                        # Find the selected file
+                        for file_info in st.session_state.client_packaging_files:
+                            if file_info["name"] == selected_file_name:
+                                # Create a file-like object
+                                import io
+                                file_content = load_packaging_content(file_info)
+                                if file_content:
+                                    temp_file = io.BytesIO(file_content)
+                                    temp_file.name = file_info["name"]
+                                    client_file_objects.append(temp_file)
+                                break
+                    st.session_state.uploaded_packaging_files = client_file_objects
+                    # Set the first file as selected if none is selected
+                    if not st.session_state.selected_packaging_file and client_file_objects:
+                        st.session_state.selected_packaging_file = client_file_objects[0]
+                    st.success(f"✅ Loaded {len(client_file_objects)} packaging files from client directory")
+            else:
+                st.info("No client packaging files found")
+        with pkg_tab2:
+            packaging_files = st.file_uploader("Upload Packaging PDFs", type=["pdf"], accept_multiple_files=True)
+            # Update uploaded files list when new files are uploaded
+            if packaging_files:
+                st.session_state.uploaded_packaging_files = packaging_files
+                # Set the first file as selected if none is selected
+                if not st.session_state.selected_packaging_file and packaging_files:
+                    st.session_state.selected_packaging_file = packaging_files[0]
+                st.success(f"✅ Uploaded {len(packaging_files)} packaging files")
+            else:
+                # Only clear if no files are selected from client directory either
+                if not st.session_state.uploaded_packaging_files:
+                    st.session_state.selected_packaging_file = None
+        st.markdown('</div>', unsafe_allow_html=True)
+        # File selector for multiple packaging files
+        if st.session_state.uploaded_packaging_files:
+            st.markdown('<div class="file-uploader">', unsafe_allow_html=True)
+            file_names = [f.name for f in st.session_state.uploaded_packaging_files]
+            selected_file_name = st.selectbox(
+                "Select packaging file to display:",
+                file_names,
+                index=file_names.index(st.session_state.selected_packaging_file.name) if st.session_state.selected_packaging_file else 0
+            )
+            # Update selected file
+            for file in st.session_state.uploaded_packaging_files:
+                if file.name == selected_file_name:
+                    st.session_state.selected_packaging_file = file
+                    break
+            st.markdown('</div>', unsafe_allow_html=True)
+        st.markdown('</div>', unsafe_allow_html=True)
+        # Requirements display section
+        if st.session_state.requirements_text:
+            st.markdown('<div class="requirements-display">', unsafe_allow_html=True)
+            with st.expander("📋 Requirements Document", expanded=True):
+                if isinstance(st.session_state.requirements_text, dict):
+                    # PDF requirements
+                    file_type = st.session_state.requirements_text.get('type', 'unknown')
+                    filename = st.session_state.requirements_text.get('filename', 'Unknown')
+                    file_size = st.session_state.requirements_text.get('file_size', 0)
+                    st.markdown(f"**File Type:** {file_type.upper()}")
+                    st.markdown(f"**Filename:** {filename}")
+                    st.markdown(f"**File Size:** {file_size:,} bytes")
+                    if file_type == 'pdf':
+                        st.info("📄 This PDF will be processed natively by Claude for full visual analysis including charts, graphs, and visual layouts.")
+                        st.markdown("**Preview Text:**")
+                        st.text_area("Requirements Text", st.session_state.requirements_text.get('text_content', ''), height=200)
+                    else:
+                        st.text_area("Requirements Text", st.session_state.requirements_text.get('text_content', ''), height=200)
+                else:
+                    # Text requirements (backward compatibility)
+                    st.text_area("Requirements Text", st.session_state.requirements_text, height=200)
+            st.markdown('</div>', unsafe_allow_html=True)
+        # Artwork display section
+        if st.session_state.selected_packaging_file:
+            st.markdown('<div class="artwork-display">', unsafe_allow_html=True)
+            with st.expander("🎨 Package Artwork", expanded=True):
+                try:
+                    # Reset file pointer to beginning
+                    st.session_state.selected_packaging_file.seek(0)
+                    # Create a temporary file to process the PDF
+                    with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
+                        tmp_file.write(st.session_state.selected_packaging_file.read())
+                        tmp_pdf_path = tmp_file.name
+                    # Convert PDF to image
+                    try:
+                        images = convert_from_path(tmp_pdf_path)
+                        if not images:
+                            raise ValueError("No pages found in PDF")
+                        page_image = images[0]  # Assuming single page for now
+                    except Exception as e:
+                        st.error(f"Error converting PDF to image: {str(e)}")
+                        # Create a placeholder image
+                        page_image = Image.new('RGB', (800, 600), color='white')
+                        draw = ImageDraw.Draw(page_image)
+                        draw.text((400, 300), "PDF conversion failed", fill='black', anchor='mm')
+                    # Display the image with proportional sizing
+                    st.markdown('<div class="image-container">', unsafe_allow_html=True)
+                    st.image(page_image, caption=f"Package: {st.session_state.selected_packaging_file.name}", use_container_width=True)
+                    st.markdown('</div>', unsafe_allow_html=True)
+                    # Clean up temporary file
+                    if os.path.exists(tmp_pdf_path):
+                        os.unlink(tmp_pdf_path)
+                except Exception as e:
+                    st.error(f"Error displaying package artwork: {str(e)}")
+            st.markdown('</div>', unsafe_allow_html=True)
+    with col2:
+        # Compliance guidelines section
+        st.markdown("""
+        <style>
+        .compliance-section {
+            background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
+            padding: 20px;
+            border-radius: 15px;
+            color: white;
+            height: 100%;
+        }
+        .compliance-title {
+            font-size: 24px;
+            font-weight: bold;
+            margin-bottom: 15px;
+            text-align: center;
+        }
+        .compliance-content {
+            background: rgba(255, 255, 255, 0.1);
+            border-radius: 10px;
+            padding: 15px;
+            margin-top: 15px;
+        }
+        .status-compliant {
+            background: rgba(76, 175, 80, 0.2);
+            border-left: 4px solid #4CAF50;
+            padding: 10px;
+            margin: 10px 0;
+            border-radius: 5px;
+        }
+        .status-partial {
+            background: rgba(255, 193, 7, 0.2);
+            border-left: 4px solid #FFC107;
+            padding: 10px;
+            margin: 10px 0;
+            border-radius: 5px;
+        }
+        .status-non-compliant {
+            background: rgba(244, 67, 54, 0.2);
+            border-left: 4px solid #F44336;
+            padding: 10px;
+            margin: 10px 0;
+            border-radius: 5px;
+        }
+        </style>
+        """, unsafe_allow_html=True)
+        st.markdown('<div class="compliance-section">', unsafe_allow_html=True)
+        st.markdown('<div class="compliance-title">📋 Compliance Guidelines</div>', unsafe_allow_html=True)
+        # Read and display the compliance outline
+        try:
+            with open("requirements_library/compliance_outline.txt", "r") as f:
+                outline_content = f.read()
+            st.markdown('<div class="compliance-content">', unsafe_allow_html=True)
+            # Parse and format the content for better display
+            lines = outline_content.strip().split('\n')
+            current_section = ""
+            for line in lines:
+                line = line.strip()
+                if not line:
+                    continue
+                if line == "Compliance Outline":
+                    st.markdown("**📋 Compliance Outline**")
+                elif line == "Compliant":
+                    st.markdown('<div class="status-compliant">', unsafe_allow_html=True)
+                    st.markdown("🟢 **Compliant**")
+                    current_section = "compliant"
+                elif line == "Partially Compliant":
+                    st.markdown('</div>', unsafe_allow_html=True)
+                    st.markdown('<div class="status-partial">', unsafe_allow_html=True)
+                    st.markdown("🟡 **Partially Compliant**")
+                    current_section = "partial"
+                elif line == "Non-Compliant":
+                    st.markdown('</div>', unsafe_allow_html=True)
+                    st.markdown('<div class="status-non-compliant">', unsafe_allow_html=True)
+                    st.markdown("🔴 **Non-Compliant**")
+                    current_section = "non_compliant"
+                elif line.startswith("> "):
+                    # Description line
+                    description = line[2:]  # Remove "> "
+                    st.markdown(f"*{description}*")
+                elif line == "Example Criteria:":
+                    st.markdown("**Example Criteria:**")
+                elif line.startswith("- "):
+                    # Criteria item
+                    criteria = line[2:]  # Remove "- "
+                    st.markdown(f"• {criteria}")
+                elif line and not line.startswith("Example Criteria:"):
+                    # Any other content
+                    st.markdown(line)
+            # Close the last status div
+            if current_section:
+                st.markdown('</div>', unsafe_allow_html=True)
+            st.markdown('</div>', unsafe_allow_html=True)
+        except FileNotFoundError:
+            st.error("Compliance outline file not found")
+        except Exception as e:
+            st.error(f"Error reading compliance outline: {e}")
+        st.markdown('</div>', unsafe_allow_html=True)
+    # Model selection
+    # model_option = st.selectbox(
+    #     "Select Claude Model",
+    #     ["claude-sonnet-4-20250514", "claude-3-5-haiku-20241022"]
+    # )
+    model_option = "claude-sonnet-4-20250514"
+    # Analysis button
+    if st.button("Analyze Compliance"):
+        if st.session_state.requirements_text and st.session_state.uploaded_packaging_files:
+            for packaging_file in st.session_state.uploaded_packaging_files:
+                st.markdown(f"## Analyzing: {packaging_file.name}")
+                # Create a progress bar
+                progress_bar = st.progress(0)
+                status_text = st.empty()
+                # Save the uploaded PDF temporarily.
+                # Reset file pointer to beginning
+                packaging_file.seek(0)
+                with tempfile.NamedTemporaryFile(delete=False, suffix=".pdf") as tmp_file:
+                    tmp_file.write(packaging_file.read())
+                    tmp_pdf_path = tmp_file.name
+                try:
+                    # Ingest the packaging document.
+                    status_text.text("Extracting text from packaging PDF...")
+                    google_document_api = GoogleDocumentAPI(credentials_path="src/extract_text/photon-services-f0d3ec1417d0.json")
+                    document = google_document_api.process_document(tmp_pdf_path)
+                    packaging_text = google_document_api.extract_text_with_markdown_table(document)
+                    packaging_data = google_document_api.extract_text_with_bounding_boxes(document)
+                    progress_bar.progress(25)
+                    # Process image once and store it efficiently
+                    status_text.text("Processing packaging image...")
+                    try:
+                        images = convert_from_path(tmp_pdf_path)
+                        if not images:
+                            raise ValueError("No pages found in PDF")
+                        page_image = images[0]  # Assuming single page for now
+                    except Exception as e:
+                        st.error(f"Error converting PDF to image: {str(e)}")
+                        # Create a placeholder image
+                        page_image = Image.new('RGB', (800, 600), color='white')
+                        draw = ImageDraw.Draw(page_image)
+                        draw.text((400, 300), "PDF conversion failed", fill='black', anchor='mm')
+                    # Convert to base64 once for analysis
+                    buffer = BytesIO()
+                    page_image.save(buffer, format='PNG')
+                    image_base64 = base64.b64encode(buffer.getvalue()).decode('utf-8')
+                    # Scan for barcodes
+                    status_text.text("Scanning for barcodes...")
+                    barcode = Barcode()
+                    barcode_results = barcode.scan_and_validate(page_image)
+                    progress_bar.progress(40)
+                    #Extract metadata from the PDF
+                    status_text.text("Extracting metadata from packaging...")
+                    metadata_extractor = PDFArtworkMetadataExtractor()
+                    metadata_results = metadata_extractor.extract_metadata(tmp_pdf_path)
+                    # Convert tuple keys to strings for JSON serialization
+                    if metadata_results and not metadata_results.get('error'):
+                        if 'text_colors' in metadata_results:
+                            # Convert color tuples to string representation
+                            text_colors_str = {}
+                            for color_tuple, count in metadata_results['text_colors'].items():
+                                if isinstance(color_tuple, tuple):
+                                    color_str = f"RGB{color_tuple}"
+                                else:
+                                    color_str = str(color_tuple)
+                                text_colors_str[color_str] = count
+                            metadata_results['text_colors'] = text_colors_str
+                    progress_bar.progress(50)
+                    # Call the enhanced analyze_compliance method with the raw text documents and metadata
+                    status_text.text("Analyzing requirements and compliance...")
+                    st.session_state.analysis_results = ComplianceAnalysis().analyze_compliance(
+                        st.session_state.requirements_text,
+                        packaging_text,
+                        packaging_data,
+                        image_base64,
+                        barcode_results,
+                        metadata_results,
+                        model=model_option
+                    )
+                    progress_bar.progress(100)
+                    status_text.text("Analysis complete!")
+                    # Display the structured results
+                    st.markdown("### Extracted Requirements")
+                    if "requirements" in st.session_state.analysis_results:
+                        req_df = pd.DataFrame(st.session_state.analysis_results["requirements"])
+                        st.dataframe(req_df)
+                    st.markdown("### Verification Results")
+                    if "verifications" in st.session_state.analysis_results:
+                        # Create tabs for different views of the results
+                        tabs = st.tabs(["Summary", "Detailed Results"])
+                        with tabs[0]:
+                            # Count compliance statuses
+                            if "verifications" in st.session_state.analysis_results:
+                                statuses = [v.get("compliance_status", "UNKNOWN") for v in st.session_state.analysis_results["verifications"]]
+                                compliant = statuses.count("COMPLIANT")
+                                non_compliant = statuses.count("NON-COMPLIANT")
+                                partial = statuses.count("PARTIALLY COMPLIANT")
+                                error = len(statuses) - compliant - non_compliant - partial
+                                # Create columns for status counts
+                                col1, col2, col3, col4 = st.columns(4)
+                                col1.metric("Compliant", compliant)
+                                col2.metric("Non-Compliant", non_compliant)
+                                col3.metric("Partially Compliant", partial)
+                                col4.metric("Errors", error)
+                                # Display the overall compliance report
+                                if "compliance_report" in st.session_state.analysis_results:
+                                    st.markdown(st.session_state.analysis_results["compliance_report"])
+                        with tabs[1]:
+                            st.markdown("### Barcode Scanning Results")
+                            if "barcode_data" in st.session_state.analysis_results and st.session_state.analysis_results["barcode_data"]:
+                                barcode_df = pd.DataFrame(st.session_state.analysis_results["barcode_data"])
+                                st.dataframe(barcode_df)
+                                # Display barcode summary
+                                valid_barcodes = sum(1 for barcode in st.session_state.analysis_results["barcode_data"] if barcode["valid"])
+                                total_barcodes = len(st.session_state.analysis_results["barcode_data"])
+                                st.markdown(f"**Barcode Summary:** {valid_barcodes}/{total_barcodes} valid barcodes found")
+                            else:
+                                st.info("No barcodes found in the packaging")
+                            # Display metadata results
+                            st.markdown("### Typography and Design Metadata")
+                            if "metadata" in st.session_state.analysis_results and st.session_state.analysis_results["metadata"]:
+                                metadata = st.session_state.analysis_results["metadata"]
+                                if metadata.get('error'):
+                                    st.error(f"Metadata extraction error: {metadata['error']}")
+                                else:
+                                    # Display metadata summary
+                                    col1, col2 = st.columns(2)
+                                    with col1:
+                                        st.markdown("**Extraction Info:**")
+                                        st.write(f"**Method:** {metadata.get('extraction_method', 'Unknown')}")
+                                        st.write(f"**Selectable Text:** {'Yes' if metadata.get('has_selectable_text') else 'No'}")
+                                        st.write(f"**Pages Processed:** {metadata.get('pages_processed', 0)}")
+                                    with col2:
+                                        st.markdown("**Dominant Elements:**")
+                                        if metadata.get('fonts'):
+                                            dominant_font = max(metadata['fonts'].items(), key=lambda x: x[1])[0]
+                                            st.write(f"**Font:** {dominant_font}")
+                                        if metadata.get('font_sizes'):
+                                            dominant_size = max(metadata['font_sizes'].items(), key=lambda x: x[1])[0]
+                                            st.write(f"**Font Size:** {dominant_size:.1f}pt")
+                                        if metadata.get('text_colors'):
+                                            dominant_color = max(metadata['text_colors'].items(), key=lambda x: x[1])[0]
+                                            st.write(f"**Text Color:** {dominant_color}")
+                                    # Display detailed metadata in expandable sections
+                                    with st.expander("📊 Detailed Font Analysis"):
+                                        if metadata.get('fonts'):
+                                            font_df = pd.DataFrame([
+                                                {'Font': font, 'Character Count': count}
+                                                for font, count in list(metadata['fonts'].items())[:10]  # Top 10
+                                            ])
+                                            st.dataframe(font_df)
+                                        else:
+                                            st.info("No font data available")
+                                    with st.expander("📏 Font Size Distribution"):
+                                        if metadata.get('font_sizes'):
+                                            size_df = pd.DataFrame([
+                                                {'Font Size (pt)': f"{size:.1f}", 'Character Count': count}
+                                                for size, count in list(metadata['font_sizes'].items())[:10]  # Top 10
+                                            ])
+                                            st.dataframe(size_df)
+                                        else:
+                                            st.info("No font size data available")
+                                    with st.expander("🎨 Text Color Analysis"):
+                                        if metadata.get('text_colors'):
+                                            color_df = pd.DataFrame([
+                                                {'Color (RGB)': str(color), 'Character Count': count}
+                                                for color, count in list(metadata['text_colors'].items())[:10]  # Top 10
+                                            ])
+                                            st.dataframe(color_df)
+                                        else:
+                                            st.info("No color data available")
+                            else:
+                                st.info("No metadata available")
+                    # Show detailed verification results
+                            for i, verification in enumerate(st.session_state.analysis_results["verifications"]):
+                                req_id = verification.get("requirement_id", f"REQ{i+1}")
+                                text_id = verification.get("Text ID", "Unknown")
+                                status = verification.get("compliance_status", "UNKNOWN")
+                                # Color-code status
+                                if status == "COMPLIANT":
+                                    status_color = "green"
+                                elif status == "NON-COMPLIANT":
+                                    status_color = "red"
+                                elif status == "PARTIALLY COMPLIANT":
+                                    status_color = "orange"
+                                else:
+                                    status_color = "gray"
+                                with st.expander(f"{req_id}: {status}", expanded=status != "COMPLIANT"):
+                                    # Show confidence score if available
+                                    if "confidence" in verification:
+                                        st.progress(verification["confidence"])
+                                    # Show reasoning
+                                    if "reasoning" in verification:
+                                        st.markdown(f"**Reasoning:** {verification['reasoning']}")
+                                    # Show criteria if available
+                                    if "criteria" in verification and verification["criteria"]:
+                                        st.markdown("**Criteria:**")
+                                        for criterion in verification["criteria"]:
+                                            st.markdown(f"- {criterion}")
+                                    # Show evidence if available
+                                    if "evidence_found" in verification and verification["evidence_found"]:
+                                        st.markdown("**Evidence Found:**")
+                                        # Separate text, visual, and barcode evidence
+                                        text_evidence = []
+                                        visual_evidence = []
+                                        barcode_evidence = []
+                                        for evidence in verification["evidence_found"]:
+                                            if "text_id" in evidence and evidence["text_id"] is not None:
+                                                text_evidence.append(evidence)
+                                            elif "barcode_id" in evidence and evidence["barcode_id"] is not None:
+                                                barcode_evidence.append(evidence)
+                                            else:
+                                                visual_evidence.append(evidence)
+                                        # Display text evidence
+                                        if text_evidence:
+                                            st.markdown("**Text Evidence:**")
+                                            for evidence in text_evidence:
+                                                text_id = evidence.get("text_id", "Unknown")
+                                                evidence_text = evidence.get("evidence_text", "No description")
+                                                st.markdown(f"- **Text ID {text_id}:** {evidence_text}")
+                                        # Display barcode evidence
+                                        if barcode_evidence:
+                                            st.markdown("**Barcode Evidence:**")
+                                            for evidence in barcode_evidence:
+                                                barcode_id = evidence.get("barcode_id", "Unknown")
+                                                evidence_text = evidence.get("evidence_text", "No description")
+                                                st.markdown(f"- **Barcode ID {barcode_id}:** {evidence_text}")
+                                        # Display visual evidence
+                                        if visual_evidence:
+                                            st.markdown("**Visual Evidence (from image analysis):**")
+                                            for i, evidence in enumerate(visual_evidence, 1):
+                                                evidence_text = evidence.get("evidence_text", "Visual element referenced by Claude")
+                                                st.markdown(f"- **Visual {i}:** {evidence_text}")
+                                        # Show summary
+                                        total_evidence = len(verification["evidence_found"])
+                                        st.markdown(f"*Total evidence: {total_evidence} ({len(text_evidence)} text, {len(barcode_evidence)} barcode, {len(visual_evidence)} visual)*")
+                                    # Individual visualization for this requirement
+                                    if "evidence_found" in verification and verification["evidence_found"]:
+                                        st.markdown(f"### Evidence Visualization for {req_id}")
+                                        # Create a copy of the image for drawing
+                                        try:
+                                            draw_image = page_image.copy()
+                                            draw = ImageDraw.Draw(draw_image)
+                                            img_width, img_height = draw_image.size
+                                            # Define colors for different compliance statuses
+                                            status_colors = {
+                                                "COMPLIANT": "green",
+                                                "NON-COMPLIANT": "red",
+                                                "PARTIALLY COMPLIANT": "orange",
+                                                "ERROR": "purple",
+                                                "UNKNOWN": "gray"
+                                            }
+                                            # Get color for this requirement's status
+                                            color = status_colors.get(status, "gray")
+                                            # Add a legend for this requirement
+                                            st.markdown(f"**Status:** <span style='color:{color}'>■</span> {status}", unsafe_allow_html=True)
+                                            # Track evidence types
+                                            text_evidence_count = 0
+                                            visual_evidence_count = 0
+                                            barcode_evidence_count = 0
+                                            # Draw evidence boxes for this specific requirement
+                                            if "packaging_data" in st.session_state.analysis_results:
+                                                for evidence in verification["evidence_found"]:
+                                                    if "text_id" in evidence and evidence["text_id"] is not None:
+                                                        # Handle text-based evidence with bounding boxes
+                                                        text_id = evidence["text_id"]
+                                                        try:
+                                                            # Check if text_id is numeric for bounding box lookup
+                                                            if isinstance(text_id, (int, float)) or (isinstance(text_id, str) and text_id.isdigit()):
+                                                                # Text ID is 1-based, list is 0-based
+                                                                numeric_id = int(text_id)
+                                                                item = st.session_state.analysis_results["packaging_data"][numeric_id - 1]
+                                                                box = item["bounding_box"]
+                                                                # Denormalize vertices
+                                                                points = [(v['x'] * img_width, v['y'] * img_height) for v in box]
+                                                                # Draw polygon
+                                                                draw.polygon(points, outline=color, width=3)
+                                                                # Add a label with evidence number
+                                                                text_evidence_count += 1
+                                                                label = f"Text Evidence {text_evidence_count}"
+                                                                draw.text(points[0], label, fill="white", stroke_width=2, stroke_fill="black")
+                                                            else:
+                                                                # Handle non-numeric text IDs (like barcode references)
+                                                                text_evidence_count += 1
+                                                                st.info(f"Text Evidence {text_evidence_count}: {evidence.get('evidence_text', 'Text element referenced by Claude')} (ID: {text_id})")
+                                                        except (IndexError, KeyError) as e:
+                                                            st.warning(f"Could not find bounding box for Text ID {text_id}: {e}")
+                                                    elif "barcode_id" in evidence and evidence["barcode_id"] is not None:
+                                                        # Handle barcode-based evidence with bounding boxes
+                                                        barcode_id = evidence["barcode_id"]
+                                                        try:
+                                                            # Find the barcode in barcode_data
+                                                            barcode_found = None
+                                                            for barcode in st.session_state.analysis_results.get("barcode_data", []):
+                                                                if barcode["id"] == barcode_id:
+                                                                    barcode_found = barcode
+                                                                    break
+                                                            if barcode_found:
+                                                                pos = barcode_found["position"]
+                                                                x, y = pos["x"], pos["y"]
+                                                                w, h = pos["width"], pos["height"]
+                                                                # Draw rectangle for barcode
+                                                                draw.rectangle([x, y, x + w, y + h], outline=color, width=3)
+                                                                # Add a label with evidence number
+                                                                barcode_evidence_count += 1
+                                                                label = f"Barcode Evidence {barcode_evidence_count}"
+                                                                draw.text((x, y - 20), label, fill="white", stroke_width=2, stroke_fill="black")
+                                                                # Add barcode info
+                                                                barcode_info = f"{barcode_found['type']}: {barcode_found['data']}"
+                                                                draw.text((x, y - 40), barcode_info, fill="white", stroke_width=2, stroke_fill="black")
+                                                            else:
+                                                                st.warning(f"Could not find barcode data for Barcode ID {barcode_id}")
+                                                        except Exception as e:
+                                                            st.warning(f"Could not draw barcode bounding box for Barcode ID {barcode_id}: {e}")
+                                                    else:
+                                                        # Handle visual-only evidence (no text_id or barcode_id)
+                                                        visual_evidence_count += 1
+                                                        st.info(f"Visual Evidence {visual_evidence_count}: {evidence.get('evidence_text', 'Visual element referenced by Claude')}")
+                                                # Show the image if we have any evidence
+                                                if text_evidence_count > 0 or visual_evidence_count > 0 or barcode_evidence_count > 0:
+                                                    # Add evidence count summary
+                                                    evidence_summary = []
+                                                    if text_evidence_count > 0:
+                                                        evidence_summary.append(f"{text_evidence_count} text")
+                                                    if barcode_evidence_count > 0:
+                                                        evidence_summary.append(f"{barcode_evidence_count} barcode")
+                                                    if visual_evidence_count > 0:
+                                                        evidence_summary.append(f"{visual_evidence_count} visual")
+                                                    st.markdown(f"**Evidence Count:** {', '.join(evidence_summary)}")
+                                                    st.image(ImageUtils.crop_image(draw_image), caption=f"Evidence for {req_id} - {status}", use_container_width=True)
+                                                else:
+                                                    st.info(f"No visual evidence found for {req_id}")
+                                            else:
+                                                # Handle case where no packaging data is available but we have evidence
+                                                evidence_counts = {
+                                                    'text': len([e for e in verification["evidence_found"] if "text_id" in e and e["text_id"] is not None]),
+                                                    'barcode': len([e for e in verification["evidence_found"] if "barcode_id" in e and e["barcode_id"] is not None]),
+                                                    'visual': len([e for e in verification["evidence_found"] if ("text_id" not in e or e["text_id"] is None) and ("barcode_id" not in e or e["barcode_id"] is None)])
+                                                }
+                                                total_evidence = sum(evidence_counts.values())
+                                                if total_evidence > 0:
+                                                    evidence_summary = []
+                                                    if evidence_counts['text'] > 0:
+                                                        evidence_summary.append(f"{evidence_counts['text']} text")
+                                                    if evidence_counts['barcode'] > 0:
+                                                        evidence_summary.append(f"{evidence_counts['barcode']} barcode")
+                                                    if evidence_counts['visual'] > 0:
+                                                        evidence_summary.append(f"{evidence_counts['visual']} visual")
+                                                    st.info(f"Evidence Count: {', '.join(evidence_summary)} (no bounding box data available)")
+                                                    # Show the original image without annotations
+                                                    st.image(ImageUtils.crop_image(page_image), caption=f"Original image for {req_id} - {status}", use_container_width=True)
+                                                else:
+                                                    st.info("No packaging data available for visualization")
+                                        except Exception as e:
+                                            st.error(f"Failed to generate visualization for {req_id}: {e}")
+                                    else:
+                                        st.info(f"No evidence found for {req_id}")
+                except Exception as e:
+                    st.error(f"Error analyzing {packaging_file.name}: {str(e)}")
+                finally:
+                    # Clean up the temporary file
+                    if os.path.exists(tmp_pdf_path):
+                        os.unlink(tmp_pdf_path)
+        else:
+            st.warning("Please upload a requirements document and at least one packaging PDF.")
+    # Add some helpful information at the bottom
+    st.markdown("---")
+    st.markdown("""
+    ### How It Works
+    1. **Upload Requirements**: The system extracts structured requirements from your document
+    2. **Upload Packaging**: We extract text from PDFs and analyze them against requirements
+    3. **Analysis**: Each requirement is verified using structured reasoning and semantic matching
+    """)
+if __name__ == "__main__":
+    # Import pandas here to avoid issues with st.set_page_config
+    import pandas as pd
+    main()

quick_test.py ADDED Viewed

	@@ -0,0 +1,63 @@

+#!/usr/bin/env python3
+"""
+Quick test script for Google Document AI markdown table output.
+"""
+import os
+import sys
+from pathlib import Path
+# Add the src directory to the path
+sys.path.append(str(Path(__file__).parent / "src"))
+from extract_text.google_document_api import GoogleDocumentAPI
+def quick_test():
+    """Quick test of the markdown table generation."""
+    credentials_path = "src/extract_text/photon-services-f0d3ec1417d0.json"
+    test_pdf_path = "requirements_library/client-requirements/Kir-Kat/kitkat-f1.pdf"
+    if not os.path.exists(credentials_path):
+        print(f"❌ Credentials file not found: {credentials_path}")
+        return
+    if not os.path.exists(test_pdf_path):
+        print(f"❌ Test PDF file not found: {test_pdf_path}")
+        return
+    try:
+        print("🔍 Quick test of Google Document AI...")
+        # Initialize and process
+        doc_api = GoogleDocumentAPI(credentials_path)
+        document = doc_api.process_document(test_pdf_path)
+        # Get text blocks with height
+        text_blocks = doc_api.extract_text_with_bounding_boxes(document)
+        print(f"📊 Found {len(text_blocks)} text blocks")
+        # Show first few blocks with height
+        print("\n📏 First 5 text blocks with height:")
+        print("-" * 60)
+        for i, block in enumerate(text_blocks[:5]):
+            print(f"Block {i+1}: Height={block['height']:.2f}mm | Text: {block['text'][:50]}...")
+        # Generate and display markdown table
+        print("\n📋 Markdown Table Output:")
+        print("=" * 80)
+        markdown_table = doc_api.extract_text_with_markdown_table(document)
+        print(markdown_table)
+        # Save to file
+        with open("quick_test_results.md", "w", encoding="utf-8") as f:
+            f.write("# Quick Test Results\n\n")
+            f.write(markdown_table)
+        print(f"\n✅ Results saved to: quick_test_results.md")
+    except Exception as e:
+        print(f"❌ Error: {str(e)}")
+if __name__ == "__main__":
+    quick_test()

quick_test_results.md ADDED Viewed

	@@ -0,0 +1,17 @@

+# Quick Test Results
+| Text ID | X   | Y   | Height (mm) | Style | Text                                                           |\n|----|-----|-----|--------|-------|-------------------------------------------------------------------------|\n| 1 | 35 | 935 | 48.69 | N/A | Version 1 |\n| 2 | 113 | 920 | 23.28 | N/A | TRIEZ-MOI |\n| 3 | 101 | 935 | 35.99 | N/A | PAPIER, |\n| 4 | 95 | 936 | 19.05 | N/A | JE SUIS EN |\n| 5 | 96 | 859 | 15.52 | N/A | EXTERIEUR |\n| 6 | 126 | 889 | 38.81 | N/A | ™MPAPER |\n| 7 | 312 | 280 | 2.82 | N/A | Nestlé S.A. |\n| 8 | 331 | 249 | 3.18 | N/A | La Bonne Portion |\n| 9 | 344 | 262 | 5.64 | N/A | 1 |\n| 10 | 333 | 278 | 2.47 | N/A | 1 BARRE = 1 PORTION |\n| 11 | 285 | 330 | 5.29 | N/A | 7613035 365896> |\n| 12 | 366 | 249 | 3.53 | N/A | Le Bon Geste de Tri / Verantwoord |\n| 13 | 367 | 254 | 3.53 | N/A | weggooien / Verantwortungsvoll entsorgen |\n| 14 | 394 | 275 | 3.53 | N/A | ÉLÉMENTS |\n| 15 | 390 | 285 | 2.82 | N/A | D'EMBALLAGE |\n| 16 | 351 | 299 | 4.59 | N/A | Predut par:/Geproduceerd door: Nestlé Deutschland AG, 60523 Frankfurt am |\n| 17 | 380 | 313 | 4.59 | N/A | EMBALLAGE
+EXTERIEUR |\n| 18 | 401 | 309 | 6.7 | N/A | MIKKEL
+INDIVIDUELS |\n| 19 | 445 | 247 | 3.18 | N/A | Plus d'Informations/Let's talk Informations nutritionnelles Voedingswaarde informatie Nährwertinformationen |\n| 20 | 445 | 252 | 3.53 | N/A | FR 806 800 363 |\n| 21 | 449 | 257 | 3.53 | N/A | (service |\n| 22 | 458 | 257 | 3.53 | N/A | gratuit +pakappel) |\n| 23 | 445 | 262 | 6.35 | N/A | NL 0205699699
+BE(+32) 02 529 5525 |\n| 24 | 445 | 272 | 2.82 | N/A | www.kitkat.fr/www.kitkat.nl |\n| 25 | 446 | 278 | 10.94 | N/A | consommer de...
+préférence avant fin:/
+en minste houdbaar tot
+einde: / Mindestens |\n| 26 | 445 | 296 | 3.88 | N/A | haltbar bis Ende |\n| 27 | 549 | 253 | 3.88 | N/A | \\|Parportion\\| Pro Portion \\|\\|%.AR" par partion%RM pro |\n| 28 | 518 | 258 | 3.88 | N/A | Pour 100g\\|Pro 100g\\|Per 100g |\n| 29 | 557 | 258 | 3.88 | N/A | Per portie 41,5 |\n| 30 | 529 | 265 | 3.18 | N/A | 2135 kJ/510 |\n| 31 | 554 | 265 | 3.18 | N/A | 886 kJ/212 kcal |\n| 32 | 598 | 265 | 2.82 | N/A | 115 |\n| 33 | 542 | 271 | 3.53 | N/A | 25,1 |\n| 34 | 565 | 271 | 3.18 | N/A | 10,8g |\n| 35 | 598 | 271 | 3.18 | N/A | 15% |\n| 36 | 567 | 282 | 3.18 | N/A | 6,1g |\n| 37 | 541 | 289 | 3.18 | N/A | 60,6g |\n| 38 | 565 | 289 | 2.82 | N/A | 25,2 g |\n| 39 | 492 | 293 | 3.88 | N/A | - dont sizes/-aren salas(-devon Zackar 43,49\\|\\| |\n| 40 | 565 | 294 | 2.82 | N/A | 18,0 g |\n| 41 | 598 | 295 | 2.47 | N/A | 20% |\n| 42 | 491 | 299 | 3.53 | N/A | Flores alimentabes/Venis/Bolestatale |\n| 43 | 542 | 299 | 3.18 | N/A | 25g |\n| 44 | 566 | 300 | 3.18 | N/A | 1,0g |\n| 45 | 543 | 306 | 3.18 | N/A | 6,69 |\n| 46 | 566 | 306 | 2.82 | N/A | 2,7g |\n| 47 | 542 | 312 | 2.82 | N/A | 005 |\n| 48 | 565 | 312 | 3.18 | N/A | 0,06 g |\n| 49 | 445 | 305 | 3.88 | N/A | de alfafilichting / Siebe unter der Lanche Preis/Bitten/Ed |\n| 50 | 422 | 311 | 3.53 | N/A | Conservation/Bewaaradvies, |\n| 51 | 423 | 315 | 4.59 | N/A | Lagerungshinweise |\n| 52 | 502 | 312 | 2.82 | N/A | Salz |\n| 53 | 422 | 319 | 5.29 | N/A | A corrà Paberi de la la, de draft Apport de diffrence pour adulte-type (8400 kJ/2000 kcal). Catge confitations |\n| 54 | 598 | 312 | 2.47 | N/A | 15 |\n| 55 | 489 | 329 | 3.53 | N/A | E400 2000. |\n| 56 | 526 | 329 | 3.53 | N/A | Diese Packing enthält 6 Portionen. Portaben sollen für Kinder entreched |\n| 57 | 490 | 339 | 4.23 | N/A | Verpakking bevet 6 peries. Parties diesen te worden aangepast in de leeftijd van Modern. |\n| 58 | 329 | 507 | 4.59 | N/A | W.I |\n| 59 | 334 | 524 | 5.29 | N/A | ON-ZEINL |\n| 60 | 357 | 548 | 3.53 | N/A | PING G |\n| 61 | 337 | 361 | 12.0 | N/A | x9 |\n| 62 | 390 | 361 | 4.94 | N/A | Ни и чела ил |\n| 63 | 473 | 365 | 3.88 | N/A | D |\n| 64 | 472 | 364 | 8.47 | N/A | в элоура элон |\n| 65 | 520 | 408 | 2.82 | N/A | = ਤ ਖ ਖਰਾਬ |\n| 66 | 524 | 419 | 5.29 | N/A | T |\n| 67 | 558 | 428 | 4.59 | N/A | FREE-PORN |\n| 68 | 337 | 603 | 10.58 | N/A | x9 |\n| 69 | 386 | 604 | 5.64 | N/A | Уни и ела ил |\n| 70 | 473 | 607 | 3.88 | N/A | D |\n| 71 | 562 | 544 | 18.35 | N/A | x9 |\n| 72 | 473 | 606 | 8.11 | N/A | в элоура в элон |\n| 73 | 286 | 652 | 14.46 | N/A | breaks |\n| 74 | 295 | 663 | 12.0 | N/A | FOR GOOD |\n| 75 | 293 | 679 | 3.88 | N/A | "Unbreak plus engage |\n| 76 | 285 | 694 | 3.18 | N/A | Ingrédients/Ingrediënten / Zutaten |\n| 77 | 347 | 652 | 5.64 | N/A | JE SUIS EN |\n| 78 | 348 | 656 | 7.76 | N/A | PAPIER |\n| 79 | 351 | 664 | 3.18 | N/A | I'M PAPER |\n| 80 | 337 | 673 | 3.53 | N/A | "De sachet est désormais comes maart |\n| 81 | 337 | 679 | 3.18 | N/A | de papier, vous pouvez le mettra dans votre bac |\n| 82 | 337 | 684 | 3.18 | N/A | de til sera recyclé dans la filière du papier. |\n| 83 | 386 | 643 | 3.53 | N/A | ET POUR LES EMBALLAGES |\n| 84 | 386 | 648 | 9.17 | N/A | INDIVIDUELS, ILS RESTENT EN
+PLASTIQUE POUR GARANTIR LA
+QUALITÉ ET LE GOÛT DE NOS |\n| 85 | 386 | 664 | 3.88 | N/A | BARRES KITKAT NOUS UTILISONS |\n| 86 | 386 | 669 | 2.47 | N/A | DU PLASTIQUE RECYCLE" |\n| 87 | 386 | 674 | 2.82 | N/A | AND FOR |\n| 88 | 398 | 674 | 2.82 | N/A | THE |\n| 89 | 403 | 675 | 2.82 | N/A | WRAPPERS |\n| 90 | 417 | 675 | 3.18 | N/A | INSIDE. HE USE |\n| 91 | 386 | 679 | 2.47 | N/A | RECYCLED PLASTIC TO ENSURE THE QUALITY |\n| 92 | 387 | 683 | 2.82 | N/A | AND TASTINESS OF OUR KITKAT BARS. |\n| 93 | 478 | 676 | 4.94 | N/A | Cocod |\n| 94 | 497 | 648 | 2.47 | N/A | Nestlé acheteune |\n| 95 | 497 | 654 | 2.82 | N/A | quantité de caca |\n| 96 | 497 | 658 | 2.82 | N/A | certifiée Rainforest |\n| 97 | 506 | 664 | 2.47 | N/A | quiv |\n| 98 | 497 | 668 | 3.18 | N/A | à celle cessaire |\n| 99 | 497 | 673 | 3.53 | N/A | pour ce produi |\n| 100 | 522 | 648 | 2.47 | N/A | DÉCOUVREZ-EN PLUS |\n| 101 | 285 | 700 | 21.87 | N/A | Gaufrette croustillante enrobée de chocolat au lait (67%). Le chocolat au lait contient des matières grasses végétales en plus du beurre de cacao. Ingrédients: fra de BLE LAIT écrémén poudre,
+pile de cacao, moes grass vedtales (entre, fritt), berre de cacae', PETIT-LAIT tiré en poudre, shop de glucose, male om de LAIT anhydre, cacao malgra, émulant (cithins), poudre lever
+(carbonates de sodium). Bilan massique certiléninforest Allance. www.Krokante wafel omhuld met melkchocolade (67%). De melkchocolade bevat naast cacaoboter ook andere plantaardige vetten.
+wel, ungern Walkpoeder, cacaoman die wetten (palm, kartié), cacaoboter",\ beeder, glucosestroop, walenbret, spere cacao, emigr
+Rainforest Allance Lesseer op aar Knusperwaffel in Milchschokolade (67% Milchschokolade enthält neben Kakaobutter auch andere
+flanzliche Fette. Zutaten: Zucker, WEREMMEN, MAGENT CHPULVER, Kakaomasse planche Fette (Palm, Shee), Kakaobutter, MOLIENERZEUGNIS, Glukosesirup, BUTTERBEINFETT, fettarmer Kaka',
+Emulgator Lecithine, Backtriebmittel Natriumcarbonnie Rainforest Allance-zertiert. Wehr erfahren untersa.org |\n| 102 | 583 | 645 | 2.47 | N/A | MIXTE |\n| 103 | 557 | 660 | 4.59 | N/A | FSC FSC C149053 |\n| 104 | 570 | 670 | 2.12 | N/A | Вотермия |\n| 105 | 551 | 679 | 3.53 | N/A | *** Kika❤achète du plastique recyclé pour |\n| 106 | 553 | 685 | 3.18 | N/A | couvrir la quantité nécessaire à la production |\n| 107 | 570 | 690 | 3.53 | N/A | KitKat qui partent cette |\n| 108 | 552 | 687 | 8.11 | N/A | desembe |\n| 109 | 552 | 695 | 4.23 | N/A | mention |\n| 110 | 562 | 697 | 3.18 | N/A | en |\n| 111 | 552 | 702 | 2.82 | N/A | Ce volume est susceptible d'être utilisé dans |\n| 112 | 551 | 708 | 2.47 | N/A | dautres |\n| 113 | 561 | 708 | 2.47 | N/A | emballages. Pour en savoir plus |\n| 114 | 552 | 713 | 2.47 | N/A | www.kitkat.fr |\n| 115 | 541 | 722 | 6.35 | N/A | 6x41,5 g = 249 ge |\n

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+google-cloud-documentai
+streamlit
+pandas
+anthropic
+requests>=2.31.0
+urllib3>=2.0.7
+Pillow>=8.3.0
+pdf2image
+numpy>=1.21.0
+opencv-python
+barcodenumber
+zxing-cpp
+PyMuPDF>=1.23.0
+PyPDF2>=3.0.0

test_google_doc_ai.py ADDED Viewed

	@@ -0,0 +1,183 @@

+#!/usr/bin/env python3
+"""
+Test script for Google Document AI functionality.
+This script demonstrates the text extraction with bounding boxes and height calculation.
+"""
+import os
+import sys
+from pathlib import Path
+# Add the src directory to the path so we can import our modules
+sys.path.append(str(Path(__file__).parent / "src"))
+from extract_text.google_document_api import GoogleDocumentAPI
+def test_google_doc_ai():
+    """Test the Google Document AI functionality with a sample PDF."""
+    # Path to the credentials file
+    credentials_path = "src/extract_text/photon-services-f0d3ec1417d0.json"
+    # Path to a test PDF file
+    test_pdf_path = "requirements_library/client-requirements/Kir-Kat/kitkat-f1.pdf"
+    # Check if files exist
+    if not os.path.exists(credentials_path):
+        print(f"❌ Credentials file not found: {credentials_path}")
+        print("Please ensure the Google Cloud credentials file is in the correct location.")
+        return
+    if not os.path.exists(test_pdf_path):
+        print(f"❌ Test PDF file not found: {test_pdf_path}")
+        print("Please ensure the test PDF file exists.")
+        return
+    print("🔍 Testing Google Document AI functionality...")
+    print(f"📄 Using PDF: {test_pdf_path}")
+    print(f"🔑 Using credentials: {credentials_path}")
+    print("-" * 80)
+    try:
+        # Initialize the Google Document API
+        print("1. Initializing Google Document API...")
+        doc_api = GoogleDocumentAPI(credentials_path)
+        print("✅ Google Document API initialized successfully")
+        # Process the document
+        print("\n2. Processing document...")
+        document = doc_api.process_document(test_pdf_path)
+        print("✅ Document processed successfully")
+        # Get basic text
+        print("\n3. Extracting basic text...")
+        basic_text = doc_api.get_document_text(document, page_number=0)
+        print(f"📝 Basic text length: {len(basic_text)} characters")
+        print(f"📝 First 200 characters: {basic_text[:200]}...")
+        # Extract text with bounding boxes and height
+        print("\n4. Extracting text with bounding boxes and height...")
+        text_blocks = doc_api.extract_text_with_bounding_boxes(document)
+        print(f"📊 Found {len(text_blocks)} text blocks")
+        # Display sample text blocks
+        print("\n5. Sample text blocks with height information:")
+        print("-" * 80)
+        for i, block in enumerate(text_blocks[:10]):  # Show first 10 blocks
+            print(f"Block {i+1}:")
+            print(f"  Page: {block['page_number']}")
+            print(f"  Height: {block['height']:.2f} mm")
+            print(f"  Style: {block['style']}")
+            print(f"  Text: {block['text'][:100]}{'...' if len(block['text']) > 100 else ''}")
+            print(f"  Bounding Box: {block['bounding_box']}")
+            print()
+        # Generate markdown table
+        print("\n6. Generating markdown table...")
+        markdown_table = doc_api.extract_text_with_markdown_table(document)
+        print("📋 Markdown table generated successfully")
+        # Test the new extract_text_heights_mm function
+        print("\n7. Testing extract_text_heights_mm function...")
+        heights_mm = doc_api.extract_text_heights_mm(document)
+        print(f"📏 Found {len(heights_mm)} lines with height in mm")
+        # Display sample heights
+        print("\n📏 Sample line heights (mm):")
+        print("-" * 60)
+        for i, (page_num, line_text, height_mm) in enumerate(heights_mm[:10]):
+            print(f"Line {i+1}: Page {page_num}, Height={height_mm}mm | Text: {line_text[:50]}...")
+        # Save results to files
+        print("\n8. Saving results to files...")
+        # Save raw text blocks
+        with open("test_results_text_blocks.txt", "w", encoding="utf-8") as f:
+            f.write("Text Blocks with Height Information:\n")
+            f.write("=" * 50 + "\n\n")
+            for i, block in enumerate(text_blocks):
+                f.write(f"Block {i+1}:\n")
+                f.write(f"  Page: {block['page_number']}\n")
+                f.write(f"  Height: {block['height']:.2f} mm\n")
+                f.write(f"  Style: {block['style']}\n")
+                f.write(f"  Text: {block['text']}\n")
+                f.write(f"  Bounding Box: {block['bounding_box']}\n")
+                f.write("-" * 40 + "\n")
+        # Save markdown table
+        with open("test_results_markdown_table.md", "w", encoding="utf-8") as f:
+            f.write("# Google Document AI Results\n\n")
+            f.write("## Text Blocks with Height Information\n\n")
+            f.write(markdown_table)
+        # Save basic text
+        with open("test_results_basic_text.txt", "w", encoding="utf-8") as f:
+            f.write("Basic Extracted Text:\n")
+            f.write("=" * 30 + "\n\n")
+            f.write(basic_text)
+        print("✅ Results saved to:")
+        print("   - test_results_text_blocks.txt")
+        print("   - test_results_markdown_table.md")
+        print("   - test_results_basic_text.txt")
+        # Save heights data
+        with open("test_results_heights_mm.txt", "w", encoding="utf-8") as f:
+            f.write("Line Heights in Millimeters:\n")
+            f.write("=" * 40 + "\n\n")
+            for i, (page_num, line_text, height_mm) in enumerate(heights_mm):
+                f.write(f"Line {i+1}: Page {page_num}, Height={height_mm}mm\n")
+                f.write(f"Text: {line_text}\n")
+                f.write("-" * 40 + "\n")
+        print("   - test_results_heights_mm.txt")
+        # Display statistics
+        print("\n9. Statistics:")
+        print("-" * 30)
+        heights = [block['height'] for block in text_blocks]
+        if heights:
+            print(f"📏 Height statistics:")
+            print(f"   Min height: {min(heights):.2f} mm")
+            print(f"   Max height: {max(heights):.2f} mm")
+            print(f"   Average height: {sum(heights)/len(heights):.2f} mm")
+        # Count styles
+        styles = {}
+        for block in text_blocks:
+            style = block['style']
+            styles[style] = styles.get(style, 0) + 1
+        print(f"\n🎨 Style distribution:")
+        for style, count in sorted(styles.items(), key=lambda x: x[1], reverse=True):
+            print(f"   {style}: {count} blocks")
+        print("\n🎉 Test completed successfully!")
+    except Exception as e:
+        print(f"❌ Error during testing: {str(e)}")
+        import traceback
+        traceback.print_exc()
+def display_markdown_preview():
+    """Display a preview of the generated markdown table."""
+    try:
+        with open("test_results_markdown_table.md", "r", encoding="utf-8") as f:
+            content = f.read()
+        print("\n📋 Markdown Table Preview:")
+        print("=" * 80)
+        print(content)
+    except FileNotFoundError:
+        print("❌ Markdown table file not found. Run the test first.")
+if __name__ == "__main__":
+    print("🚀 Google Document AI Test Script")
+    print("=" * 50)
+    # Run the main test
+    test_google_doc_ai()
+    # Display markdown preview
+    display_markdown_preview()

test_metadata.py ADDED Viewed

	@@ -0,0 +1,89 @@

+#!/usr/bin/env python3
+"""
+Test script for metadata extraction functionality
+"""
+import os
+import sys
+from src.extract_text.extract_meta_data import PDFArtworkMetadataExtractor
+def test_metadata_extraction():
+    """Test the metadata extraction on a sample PDF"""
+    # Check if we have any PDF files in the requirements library
+    base_path = "requirements_library/client-requirements"
+    if not os.path.exists(base_path):
+        print("❌ No requirements library found")
+        return False
+    # Find the first PDF file
+    pdf_file = None
+    for root, dirs, files in os.walk(base_path):
+        for file in files:
+            if file.lower().endswith('.pdf'):
+                pdf_file = os.path.join(root, file)
+                break
+        if pdf_file:
+            break
+    if not pdf_file:
+        print("❌ No PDF files found in requirements library")
+        return False
+    print(f"📄 Testing metadata extraction on: {pdf_file}")
+    try:
+        # Initialize the extractor
+        extractor = PDFArtworkMetadataExtractor()
+        # Extract metadata
+        metadata = extractor.extract_metadata(pdf_file)
+        if 'error' in metadata:
+            print(f"❌ Error extracting metadata: {metadata['error']}")
+            return False
+        # Print results
+        print("✅ Metadata extraction successful!")
+        print(f"📊 Pages processed: {metadata.get('pages_processed', 0)}")
+        print(f"📝 Has selectable text: {metadata.get('has_selectable_text', False)}")
+        print(f"🔧 Extraction method: {metadata.get('extraction_method', 'unknown')}")
+        # Show top fonts
+        fonts = metadata.get('fonts', {})
+        if fonts:
+            print("\n🔤 Top 3 Fonts:")
+            for i, (font, count) in enumerate(list(fonts.items())[:3]):
+                print(f"  {i+1}. {font}: {count} characters")
+        # Show top font sizes
+        font_sizes = metadata.get('font_sizes', {})
+        if font_sizes:
+            print("\n📏 Top 3 Font Sizes:")
+            for i, (size, count) in enumerate(list(font_sizes.items())[:3]):
+                print(f"  {i+1}. {size}pt: {count} characters")
+        # Show top colors
+        colors = metadata.get('text_colors', {})
+        if colors:
+            print("\n🎨 Top 3 Text Colors:")
+            for i, (color, count) in enumerate(list(colors.items())[:3]):
+                print(f"  {i+1}. RGB{color}: {count} characters")
+        return True
+    except Exception as e:
+        print(f"❌ Test failed with error: {str(e)}")
+        return False
+if __name__ == "__main__":
+    print("🧪 Testing Metadata Extraction")
+    print("=" * 40)
+    success = test_metadata_extraction()
+    if success:
+        print("\n✅ All tests passed! Metadata extraction is working correctly.")
+    else:
+        print("\n❌ Tests failed. Please check the error messages above.")
+        sys.exit(1)

test_pdf_requirements.py ADDED Viewed

	@@ -0,0 +1,74 @@

+#!/usr/bin/env python3
+"""
+Test script for PDF requirements functionality
+"""
+import os
+import tempfile
+from src.extract_text.ingest import RequirementsIngest
+def test_pdf_requirements():
+    """Test PDF requirements ingestion"""
+    print("Testing PDF requirements functionality...")
+    # Create a simple test PDF (we'll use an existing one if available)
+    test_pdf_path = None
+    # Look for any PDF file in the requirements_library
+    for root, dirs, files in os.walk("requirements_library"):
+        for file in files:
+            if file.lower().endswith('.pdf'):
+                test_pdf_path = os.path.join(root, file)
+                break
+        if test_pdf_path:
+            break
+    if not test_pdf_path:
+        print("No PDF files found for testing. Creating a simple test...")
+        # Create a simple test with text file
+        with tempfile.NamedTemporaryFile(mode='w', suffix='.txt', delete=False) as f:
+            f.write("Test requirement: All products must have allergen information.")
+            test_file_path = f.name
+        print(f"Created test text file: {test_file_path}")
+    else:
+        print(f"Using existing PDF for testing: {test_pdf_path}")
+        test_file_path = test_pdf_path
+    try:
+        # Test the ingestion
+        ingest = RequirementsIngest()
+        # Open the file and test ingestion
+        with open(test_file_path, 'rb') as f:
+            result = ingest.ingest_requirements_document(f)
+        print("✅ Ingestion successful!")
+        print(f"Result type: {type(result)}")
+        if isinstance(result, dict):
+            print(f"File type: {result.get('type', 'unknown')}")
+            print(f"Filename: {result.get('filename', 'unknown')}")
+            print(f"File size: {result.get('file_size', 0)} bytes")
+            print(f"Text content preview: {result.get('text_content', '')[:200]}...")
+        else:
+            print(f"Text content: {result[:200]}...")
+        print("\n✅ PDF requirements functionality is working!")
+    except Exception as e:
+        print(f"❌ Error during testing: {e}")
+        import traceback
+        traceback.print_exc()
+    finally:
+        # Clean up test file if we created one
+        if test_pdf_path is None and 'test_file_path' in locals():
+            try:
+                os.unlink(test_file_path)
+                print(f"Cleaned up test file: {test_file_path}")
+            except:
+                pass
+if __name__ == "__main__":
+    test_pdf_requirements()