Spaces:

syedmohaiminulhoque
/

agentic-doc-sim-streamlit

Running

App Files Files Community

syedmohaiminulhoque commited on about 1 month ago

Commit

f02f2d2

1 Parent(s): b663e26

Upload full Streamlit app from GitHub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

requirements.txt +35 -3
src/README.md +159 -0
src/TESTING.md +170 -0
src/__pycache__/config.cpython-312.pyc +0 -0
src/__pycache__/config.cpython-313.pyc +0 -0
src/__pycache__/streamlit_app.cpython-313.pyc +0 -0
src/agents/__init__.py +0 -0
src/agents/__pycache__/__init__.cpython-312.pyc +0 -0
src/agents/__pycache__/__init__.cpython-313.pyc +0 -0
src/agents/__pycache__/base_agent.cpython-312.pyc +0 -0
src/agents/__pycache__/ingestion_agent.cpython-312.pyc +0 -0
src/agents/__pycache__/ingestion_agent.cpython-313.pyc +0 -0
src/agents/__pycache__/table_agent.cpython-312.pyc +0 -0
src/agents/__pycache__/text_agent.cpython-312.pyc +0 -0
src/agents/base_agent.py +44 -0
src/agents/ingestion_agent.py +176 -0
src/agents/ingestion_agent_alternative.py +167 -0
src/agents/table_agent.py +156 -0
src/agents/text_agent.py +137 -0
src/config.py +49 -0
src/create_test_docs.py +189 -0
src/models/__init__.py +14 -0
src/models/__pycache__/__init__.cpython-312.pyc +0 -0
src/models/__pycache__/__init__.cpython-313.pyc +0 -0
src/models/__pycache__/document.cpython-312.pyc +0 -0
src/models/__pycache__/document.cpython-313.pyc +0 -0
src/models/__pycache__/similarity.cpython-312.pyc +0 -0
src/models/__pycache__/similarity.cpython-313.pyc +0 -0
src/models/document.py +142 -0
src/models/similarity.py +40 -0
src/orchestrator/__init__.py +0 -0
src/orchestrator/__pycache__/__init__.cpython-312.pyc +0 -0
src/orchestrator/__pycache__/__init__.cpython-313.pyc +0 -0
src/orchestrator/__pycache__/scorers.cpython-312.pyc +0 -0
src/orchestrator/__pycache__/scorers.cpython-313.pyc +0 -0
src/orchestrator/__pycache__/similarity_orchestrator.cpython-312.pyc +0 -0
src/orchestrator/__pycache__/similarity_orchestrator.cpython-313.pyc +0 -0
src/orchestrator/scorers.py +197 -0
src/orchestrator/similarity_orchestrator.py +130 -0
src/requirements-alternative.txt +37 -0
src/requirements.txt +35 -0
src/storage/__init__.py +0 -0
src/storage/vector_store.py +183 -0
src/streamlit_app.py +296 -35
src/utils/__init__.py +0 -0
src/utils/__pycache__/__init__.cpython-312.pyc +0 -0
src/utils/__pycache__/__init__.cpython-313.pyc +0 -0
src/utils/__pycache__/file_handler.cpython-312.pyc +0 -0
src/utils/__pycache__/visualization.cpython-312.pyc +0 -0
src/utils/__pycache__/visualization.cpython-313.pyc +0 -0

requirements.txt CHANGED Viewed

@@ -1,3 +1,35 @@
-altair
-pandas
-streamlit

+# Core framework
+streamlit>=1.31.0
+# Data models
+pydantic>=2.6.0
+# Document parsing - using versions compatible with Python 3.13
+# Use pypdf if PyMuPDF has DLL issues on Windows
+pypdf>=4.0.0  # Fallback PDF parser (pure Python, no DLL dependencies)
+python-docx>=1.1.0
+pdfplumber>=0.10.0
+# ML & Embeddings
+sentence-transformers>=2.3.0
+torch>=2.2.0
+# Vector storage
+faiss-cpu>=1.7.0
+# Data processing
+numpy>=1.26.0
+pandas>=2.2.0
+Pillow>=10.2.0
+# Utilities
+python-dotenv>=1.0.0
+# Visualization
+plotly>=5.18.0
+# Async
+aiofiles>=23.2.0
+# Similarity metrics
+scikit-learn>=1.3.0

src/README.md ADDED Viewed

	@@ -0,0 +1,159 @@

+# agentic-multimodal-doc-comparator
+An agentic system to accurately match document similarity of two docs containing complex design
+![System Architecture](img/multi_agent_doc_similarity_architecture.svg)
+## Features (Phase 1)
+- **Multi-modal document analysis**: Text and table extraction
+- **Semantic similarity**: Uses sentence-transformers for embeddings
+- **Interactive Streamlit UI**: Easy-to-use web interface
+- **Support for PDF and DOCX**: Compare documents in multiple formats
+- **Detailed similarity reports**: Per-modality breakdown and matched sections
+- **Configurable weights**: Adjust importance of text vs. tables
+## System Architecture
+The system implements a 6-layer architecture:
+1. **Input Layer**: Accepts PDF/DOCX documents
+2. **Ingestion Layer**: Extracts raw content (text, tables)
+3. **Modality Extractors**: Specialized agents for text and table processing
+4. **Vector Store**: FAISS-based similarity search
+5. **Orchestrator**: Coordinates comparison and aggregates scores
+6. **Output Layer**: Similarity report with visualizations
+## Installation
+### Prerequisites
+- Python 3.8+
+- pip
+### Setup
+1. Clone the repository:
+```bash
+git clone <repository-url>
+cd agentic-multimodal-doc-comparator
+```
+2. Install dependencies:
+```bash
+pip install -r requirements.txt
+```
+3. (Optional) Set up environment variables:
+```bash
+cp .env.example .env
+# Edit .env with your API keys (for Phase 2 features)
+```
+## Usage
+### Running the Streamlit App
+```bash
+streamlit run streamlit_app.py
+```
+The app will open in your browser at `http://localhost:8501`
+### Using the App
+1. **Upload Documents**: Upload two documents (PDF or DOCX) in the designated areas
+2. **Adjust Weights**: Use the sidebar to adjust the weight given to text vs. table comparison
+3. **Compare**: Click the "Compare Documents" button
+4. **View Results**:
+   - Overall similarity score (0-100%)
+   - Per-modality breakdown (text and table scores)
+   - Top matched sections from both documents
+5. **Download Report**: Export results as JSON for further analysis
+## Project Structure
+```
+agentic-multimodal-doc-comparator/
+├── agents/                     # Modality extraction agents
+│   ├── base_agent.py          # Abstract base class
+│   ├── ingestion_agent.py     # PDF/DOCX parsing
+│   ├── text_agent.py          # Text chunking & embeddings
+│   └── table_agent.py         # Table extraction & embeddings
+├── orchestrator/               # Similarity orchestration
+│   ├── scorers.py             # Per-modality scoring
+│   └── similarity_orchestrator.py  # Main orchestrator
+├── storage/                    # Vector storage
+│   └── vector_store.py        # FAISS wrapper
+├── models/                     # Data models
+│   ├── document.py            # Document structures
+│   └── similarity.py          # Similarity report structures
+├── utils/                      # Utilities
+│   ├── file_handler.py        # File upload/validation
+│   └── visualization.py       # Result visualization
+├── config.py                   # Configuration
+├── streamlit_app.py           # Main Streamlit UI
+└── requirements.txt           # Dependencies
+```
+## Configuration
+Edit `config.py` to customize:
+- **Embedding model**: Default is `all-MiniLM-L6-v2`
+- **Chunk size**: Default 512 tokens with 50-token overlap
+- **Modality weights**: Default 60% text, 40% tables
+- **File limits**: Default 50MB max file size
+## Phase 2 Roadmap
+Future enhancements include:
+- **Image Agent**: Extract and compare images using CLIP embeddings
+- **Layout Agent**: Analyze document structure and section hierarchy
+- **Meta Agent**: Compare metadata (title, author, date, keywords)
+- **Batch Comparison**: Compare 1 document against N documents
+- **Enhanced UI**: Visual diff, interactive navigation, filtering
+## Technical Details
+### Models & Libraries
+- **Embedding**: sentence-transformers (all-MiniLM-L6-v2, 384 dimensions)
+- **PDF Parsing**: PyMuPDF (text) + pdfplumber (tables)
+- **DOCX Parsing**: python-docx
+- **Vector Search**: FAISS (cosine similarity)
+- **UI**: Streamlit with Plotly visualizations
+### Similarity Scoring
+- **Text**: Cosine similarity between chunk embeddings, averaged over best matches
+- **Tables**: Schema and content similarity using linearized table embeddings
+- **Overall**: Weighted combination of modality scores
+## Troubleshooting
+### Common Issues
+1. **"Module not found" errors**: Run `pip install -r requirements.txt`
+2. **Large files timing out**: Reduce document size or increase timeout in config
+3. **Memory errors**: Process smaller documents or reduce chunk overlap
+4. **No matches found**: Documents may be too dissimilar or use different terminology
+## Contributing
+Contributions welcome! Please:
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Submit a pull request
+## License
+MIT License
+## Acknowledgments
+- Architecture inspired by multi-agent RAG systems
+- Built with Streamlit, sentence-transformers, and FAISS

src/TESTING.md ADDED Viewed

	@@ -0,0 +1,170 @@

+# Testing Guide for Document Comparison App
+## Quick Start Testing
+### 1. Install Dependencies (if not done)
+```bash
+pip install -r requirements.txt
+```
+**Expected time**: 5-10 minutes (large packages like PyTorch)
+### 2. Create Test Documents
+```bash
+python create_test_docs.py
+```
+This creates three test documents:
+- `test_doc1.docx` - Product requirements document
+- `test_doc2.docx` - Similar document with differences
+- `test_doc3_identical.docx` - Identical to doc1
+### 3. Run the App
+```bash
+streamlit run streamlit_app.py
+```
+The app will open at: `http://localhost:8501`
+### 4. Test Scenarios
+#### Test Case 1: Similar Documents (Expected: 60-80% similarity)
+- **Document 1**: test_doc1.docx
+- **Document 2**: test_doc2.docx
+- **What to expect**:
+  - Overall similarity: ~65-75%
+  - Text similarity: ~70-80% (similar topics, some wording differences)
+  - Table similarity: ~50-60% (different tech stacks)
+  - Matched sections showing overlapping features and overview
+#### Test Case 2: Identical Documents (Expected: ~100% similarity)
+- **Document 1**: test_doc1.docx
+- **Document 2**: test_doc3_identical.docx
+- **What to expect**:
+  - Overall similarity: ~95-100%
+  - Text similarity: ~100%
+  - Table similarity: ~100%
+  - All sections matched
+#### Test Case 3: Test with Your Own Documents
+- Upload any two PDF or DOCX files (max 50MB each)
+- Adjust text/table weights in sidebar
+- View detailed comparison results
+## What to Look For
+### ✅ Successful Run Indicators
+1. **Progress bar completes** through all stages:
+   - Ingesting documents
+   - Extracting and embedding text
+   - Extracting and embedding tables
+   - Comparing documents
+2. **Results display shows**:
+   - Overall similarity gauge (0-100%)
+   - Bar chart with text and table scores
+   - Matched sections with content snippets
+   - Page numbers for each match
+3. **Download button** works and exports JSON report
+### ⚠️ Common Issues to Check
+1. **"Module not found" errors**
+   - Run: `pip install -r requirements.txt`
+2. **Model download on first run**
+   - sentence-transformers will download ~90MB model first time
+   - This is normal and only happens once
+3. **Memory warnings**
+   - Test with smaller documents first
+   - Close other applications if needed
+4. **Table extraction issues**
+   - Some PDFs may have tables in image format (won't extract)
+   - DOCX tables extract more reliably
+## Expected Performance
+- **Small documents** (< 5 pages): 5-15 seconds
+- **Medium documents** (5-20 pages): 15-45 seconds
+- **Large documents** (> 20 pages): 45+ seconds
+## Verifying Results
+### Text Similarity
+- Check "Matched Sections" to see side-by-side text comparisons
+- Higher scores = more semantic overlap
+- Look for similar topics even with different wording
+### Table Similarity
+- Compares table schemas (headers) and content
+- Identical tables = high score
+- Different schemas = lower score
+### Overall Score
+- Weighted combination (default: 60% text, 40% table)
+- Adjust weights in sidebar to change emphasis
+## Troubleshooting
+### App won't start
+```bash
+# Check Python version (need 3.8+)
+python --version
+# Reinstall streamlit
+pip install --upgrade streamlit
+```
+### Embeddings slow
+- First run downloads model (~90MB)
+- Subsequent runs use cached model
+- Consider using GPU if available (change to faiss-gpu in requirements)
+### No matches found
+- Documents may be too different
+- Try adjusting chunk size in config.py
+- Check if documents have extractable text (not scanned images)
+## Advanced Testing
+### Modify Configuration
+Edit `config.py` to adjust:
+```python
+TEXT_CHUNK_SIZE = 512  # Increase for longer context
+TEXT_CHUNK_OVERLAP = 50  # Increase for better matching
+MODALITY_WEIGHTS = {"text": 0.60, "table": 0.40}  # Adjust importance
+```
+### Test Different Document Types
+1. **Highly similar**: Same document, minor edits
+2. **Moderately similar**: Same topic, different authors
+3. **Dissimilar**: Completely different topics
+### Validate Accuracy
+Compare app results with manual review:
+- Do matched sections make sense?
+- Are similarity percentages reasonable?
+- Are table comparisons accurate?
+## Next Steps
+After successful testing:
+1. Test with your real documents
+2. Adjust weights based on your use case
+3. Consider Phase 2 features (image, layout, metadata comparison)
+4. Provide feedback for improvements
+## Support
+If you encounter issues:
+1. Check error message in terminal
+2. Verify all dependencies installed
+3. Ensure documents are valid PDF/DOCX
+4. Check file size limits (50MB default)

src/__pycache__/config.cpython-312.pyc ADDED Viewed

Binary file (1.39 kB). View file

src/__pycache__/config.cpython-313.pyc ADDED Viewed

Binary file (1.38 kB). View file

src/__pycache__/streamlit_app.cpython-313.pyc ADDED Viewed

Binary file (12.2 kB). View file

src/agents/__init__.py ADDED Viewed

File without changes

src/agents/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (179 Bytes). View file

src/agents/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (179 Bytes). View file

src/agents/__pycache__/base_agent.cpython-312.pyc ADDED Viewed

Binary file (1.91 kB). View file

src/agents/__pycache__/ingestion_agent.cpython-312.pyc ADDED Viewed

Binary file (6.74 kB). View file

src/agents/__pycache__/ingestion_agent.cpython-313.pyc ADDED Viewed

Binary file (6.7 kB). View file

src/agents/__pycache__/table_agent.cpython-312.pyc ADDED Viewed

Binary file (6.42 kB). View file

src/agents/__pycache__/text_agent.cpython-312.pyc ADDED Viewed

Binary file (5.14 kB). View file

src/agents/base_agent.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""
+Abstract base class for all modality agents.
+"""
+from abc import ABC, abstractmethod
+from typing import Any, Dict
+class BaseAgent(ABC):
+    """Abstract base class for all modality agents in the system."""
+    def __init__(self, config: Dict[str, Any] = None):
+        """
+        Initialize the agent with configuration.
+        Args:
+            config: Configuration dictionary
+        """
+        self.config = config or {}
+    @abstractmethod
+    async def process(self, input_data: Any) -> Any:
+        """
+        Process input data and return structured output.
+        Args:
+            input_data: Input data to process
+        Returns:
+            Processed output specific to the agent type
+        """
+        pass
+    @abstractmethod
+    def get_agent_name(self) -> str:
+        """
+        Return the name of this agent for logging/tracking.
+        Returns:
+            Agent name as string
+        """
+        pass
+    def __repr__(self) -> str:
+        return f"{self.get_agent_name()}(config={self.config})"

src/agents/ingestion_agent.py ADDED Viewed

	@@ -0,0 +1,176 @@

+"""
+Document ingestion agent for extracting content from PDF and DOCX files.
+Supports both PyMuPDF and pypdf for PDF parsing.
+"""
+import pdfplumber
+from docx import Document
+from typing import Dict, List, Any
+from pathlib import Path
+from agents.base_agent import BaseAgent
+from models.document import RawDocument
+# Try to import PyMuPDF, fallback to pypdf if not available
+try:
+    import fitz  # PyMuPDF
+    USING_PYMUPDF = True
+    print("✓ Using PyMuPDF for PDF text extraction")
+except (ImportError, OSError) as e:
+    print(f"⚠ PyMuPDF not available ({e}), falling back to pypdf")
+    try:
+        from pypdf import PdfReader
+        USING_PYMUPDF = False
+        print("✓ Using pypdf for PDF text extraction")
+    except ImportError:
+        raise ImportError(
+            "Neither PyMuPDF nor pypdf is available. "
+            "Install one of them: pip install PyMuPDF or pip install pypdf"
+        )
+class IngestionAgent(BaseAgent):
+    """Agent responsible for extracting raw content from documents."""
+    def __init__(self, config: Dict[str, Any] = None):
+        super().__init__(config)
+    def get_agent_name(self) -> str:
+        return "IngestionAgent"
+    async def process(self, file_path: str) -> RawDocument:
+        """
+        Process a document file and extract raw content.
+        Args:
+            file_path: Path to PDF or DOCX file
+        Returns:
+            RawDocument containing extracted content
+        """
+        file_type = self._detect_file_type(file_path)
+        if file_type == "pdf":
+            return await self._ingest_pdf(file_path)
+        elif file_type == "docx":
+            return await self._ingest_docx(file_path)
+        else:
+            raise ValueError(f"Unsupported file type: {file_type}")
+    def _detect_file_type(self, file_path: str) -> str:
+        """Detect file type from extension."""
+        extension = Path(file_path).suffix.lower()
+        if extension == ".pdf":
+            return "pdf"
+        elif extension in [".docx", ".doc"]:
+            return "docx"
+        else:
+            raise ValueError(f"Unsupported file extension: {extension}")
+    async def _ingest_pdf(self, file_path: str) -> RawDocument:
+        """
+        Extract content from PDF file.
+        Args:
+            file_path: Path to PDF file
+        Returns:
+            RawDocument with extracted content
+        """
+        pages = []
+        raw_text = ""
+        raw_tables = []
+        # Extract text using PyMuPDF or pypdf
+        if USING_PYMUPDF:
+            # Extract text using PyMuPDF
+            with fitz.open(file_path) as pdf_doc:
+                for page_num, page in enumerate(pdf_doc, start=1):
+                    page_text = page.get_text()
+                    raw_text += page_text + "\n"
+                    pages.append({
+                        "page_num": page_num,
+                        "text": page_text
+                    })
+        else:
+            # Extract text using pypdf
+            reader = PdfReader(file_path)
+            for page_num, page in enumerate(reader.pages, start=1):
+                page_text = page.extract_text() or ""
+                raw_text += page_text + "\n"
+                pages.append({
+                    "page_num": page_num,
+                    "text": page_text
+                })
+        # Extract tables using pdfplumber (works with both)
+        with pdfplumber.open(file_path) as pdf:
+            for page_num, page in enumerate(pdf.pages, start=1):
+                tables_on_page = page.extract_tables()
+                if tables_on_page:
+                    for table_idx, table in enumerate(tables_on_page):
+                        if table:  # Skip empty tables
+                            raw_tables.append({
+                                "page_num": page_num,
+                                "table_idx": table_idx,
+                                "data": table
+                            })
+        return RawDocument(
+            filename=Path(file_path).name,
+            file_type="pdf",
+            pages=pages,
+            raw_text=raw_text.strip(),
+            raw_tables=raw_tables,
+            total_pages=len(pages)
+        )
+    async def _ingest_docx(self, file_path: str) -> RawDocument:
+        """
+        Extract content from DOCX file.
+        Args:
+            file_path: Path to DOCX file
+        Returns:
+            RawDocument with extracted content
+        """
+        doc = Document(file_path)
+        pages = []
+        raw_text = ""
+        raw_tables = []
+        # Extract text from paragraphs
+        # Note: DOCX doesn't have "pages" like PDF, so we simulate page 1
+        page_text = ""
+        for para in doc.paragraphs:
+            if para.text.strip():
+                page_text += para.text + "\n"
+                raw_text += para.text + "\n"
+        pages.append({
+            "page_num": 1,
+            "text": page_text
+        })
+        # Extract tables
+        for table_idx, table in enumerate(doc.tables):
+            table_data = []
+            for row in table.rows:
+                row_data = [cell.text.strip() for cell in row.cells]
+                table_data.append(row_data)
+            if table_data:  # Skip empty tables
+                raw_tables.append({
+                    "page_num": 1,
+                    "table_idx": table_idx,
+                    "data": table_data
+                })
+        return RawDocument(
+            filename=Path(file_path).name,
+            file_type="docx",
+            pages=pages,
+            raw_text=raw_text.strip(),
+            raw_tables=raw_tables,
+            total_pages=1  # DOCX treated as single page
+        )

src/agents/ingestion_agent_alternative.py ADDED Viewed

	@@ -0,0 +1,167 @@

+"""
+Document ingestion agent with fallback support for pypdf.
+Use this version if PyMuPDF installation issues persist.
+"""
+from docx import Document
+from typing import Dict, List, Any
+from pathlib import Path
+from agents.base_agent import BaseAgent
+from models.document import RawDocument
+# Try to import fitz (PyMuPDF), fallback to pypdf
+try:
+    import fitz  # PyMuPDF
+    USING_PYMUPDF = True
+except ImportError:
+    from pypdf import PdfReader
+    USING_PYMUPDF = False
+    print("Using pypdf (PyMuPDF not available)")
+import pdfplumber
+class IngestionAgent(BaseAgent):
+    """Agent responsible for extracting raw content from documents."""
+    def __init__(self, config: Dict[str, Any] = None):
+        super().__init__(config)
+    def get_agent_name(self) -> str:
+        return "IngestionAgent"
+    async def process(self, file_path: str) -> RawDocument:
+        """
+        Process a document file and extract raw content.
+        Args:
+            file_path: Path to PDF or DOCX file
+        Returns:
+            RawDocument containing extracted content
+        """
+        file_type = self._detect_file_type(file_path)
+        if file_type == "pdf":
+            return await self._ingest_pdf(file_path)
+        elif file_type == "docx":
+            return await self._ingest_docx(file_path)
+        else:
+            raise ValueError(f"Unsupported file type: {file_type}")
+    def _detect_file_type(self, file_path: str) -> str:
+        """Detect file type from extension."""
+        extension = Path(file_path).suffix.lower()
+        if extension == ".pdf":
+            return "pdf"
+        elif extension in [".docx", ".doc"]:
+            return "docx"
+        else:
+            raise ValueError(f"Unsupported file extension: {extension}")
+    async def _ingest_pdf(self, file_path: str) -> RawDocument:
+        """
+        Extract content from PDF file.
+        Args:
+            file_path: Path to PDF file
+        Returns:
+            RawDocument with extracted content
+        """
+        pages = []
+        raw_text = ""
+        raw_tables = []
+        if USING_PYMUPDF:
+            # Extract text using PyMuPDF (fitz)
+            with fitz.open(file_path) as pdf_doc:
+                for page_num, page in enumerate(pdf_doc, start=1):
+                    page_text = page.get_text()
+                    raw_text += page_text + "\n"
+                    pages.append({
+                        "page_num": page_num,
+                        "text": page_text
+                    })
+        else:
+            # Extract text using pypdf
+            reader = PdfReader(file_path)
+            for page_num, page in enumerate(reader.pages, start=1):
+                page_text = page.extract_text()
+                raw_text += page_text + "\n"
+                pages.append({
+                    "page_num": page_num,
+                    "text": page_text
+                })
+        # Extract tables using pdfplumber (works with both)
+        with pdfplumber.open(file_path) as pdf:
+            for page_num, page in enumerate(pdf.pages, start=1):
+                tables_on_page = page.extract_tables()
+                if tables_on_page:
+                    for table_idx, table in enumerate(tables_on_page):
+                        if table:  # Skip empty tables
+                            raw_tables.append({
+                                "page_num": page_num,
+                                "table_idx": table_idx,
+                                "data": table
+                            })
+        return RawDocument(
+            filename=Path(file_path).name,
+            file_type="pdf",
+            pages=pages,
+            raw_text=raw_text.strip(),
+            raw_tables=raw_tables,
+            total_pages=len(pages)
+        )
+    async def _ingest_docx(self, file_path: str) -> RawDocument:
+        """
+        Extract content from DOCX file.
+        Args:
+            file_path: Path to DOCX file
+        Returns:
+            RawDocument with extracted content
+        """
+        doc = Document(file_path)
+        pages = []
+        raw_text = ""
+        raw_tables = []
+        # Extract text from paragraphs
+        page_text = ""
+        for para in doc.paragraphs:
+            if para.text.strip():
+                page_text += para.text + "\n"
+                raw_text += para.text + "\n"
+        pages.append({
+            "page_num": 1,
+            "text": page_text
+        })
+        # Extract tables
+        for table_idx, table in enumerate(doc.tables):
+            table_data = []
+            for row in table.rows:
+                row_data = [cell.text.strip() for cell in row.cells]
+                table_data.append(row_data)
+            if table_data:  # Skip empty tables
+                raw_tables.append({
+                    "page_num": 1,
+                    "table_idx": table_idx,
+                    "data": table_data
+                })
+        return RawDocument(
+            filename=Path(file_path).name,
+            file_type="docx",
+            pages=pages,
+            raw_text=raw_text.strip(),
+            raw_tables=raw_tables,
+            total_pages=1  # DOCX treated as single page
+        )

src/agents/table_agent.py ADDED Viewed

	@@ -0,0 +1,156 @@

+"""
+Table agent for extracting and embedding table data.
+"""
+import numpy as np
+from typing import List, Tuple, Dict, Any
+from sentence_transformers import SentenceTransformer
+from agents.base_agent import BaseAgent
+from models.document import TableExtraction, RawDocument
+import config
+class TableAgent(BaseAgent):
+    """Agent responsible for table extraction and embedding generation."""
+    def __init__(self, config_dict: Dict[str, Any] = None):
+        super().__init__(config_dict)
+        # Load embedding model (same as text agent for consistency)
+        self.model = SentenceTransformer(config.TEXT_EMBEDDING_MODEL)
+    def get_agent_name(self) -> str:
+        return "TableAgent"
+    async def process(self, raw_document: RawDocument) -> Tuple[List[TableExtraction], np.ndarray]:
+        """
+        Process raw tables into structured format and embeddings.
+        Args:
+            raw_document: Raw document with extracted tables
+        Returns:
+            Tuple of (list of TableExtraction objects, numpy array of embeddings)
+        """
+        # Parse tables
+        tables = self.parse_tables(raw_document.raw_tables)
+        # Generate embeddings
+        if tables:
+            table_texts = [self.linearize_table(table) for table in tables]
+            embeddings = self.generate_embeddings(table_texts)
+        else:
+            embeddings = np.array([])
+        return tables, embeddings
+    def parse_tables(self, raw_tables: List[Dict[str, Any]]) -> List[TableExtraction]:
+        """
+        Parse raw table data into structured TableExtraction objects.
+        Args:
+            raw_tables: List of raw table dictionaries
+        Returns:
+            List of TableExtraction objects
+        """
+        tables = []
+        for raw_table in raw_tables:
+            table_data = raw_table.get("data", [])
+            if not table_data or len(table_data) < 1:
+                continue
+            # First row is usually headers
+            headers = [str(cell).strip() for cell in table_data[0]] if table_data else []
+            # Remaining rows are data
+            rows = []
+            for row_data in table_data[1:]:
+                row = [str(cell).strip() for cell in row_data]
+                rows.append(row)
+            # Generate schema summary
+            schema_summary = self._generate_schema_summary(headers, rows)
+            table = TableExtraction(
+                headers=headers,
+                rows=rows,
+                page_number=raw_table.get("page_num", 1),
+                schema_summary=schema_summary
+            )
+            tables.append(table)
+        return tables
+    def _generate_schema_summary(self, headers: List[str], rows: List[List[str]]) -> str:
+        """
+        Generate a summary of the table schema.
+        Args:
+            headers: Table headers
+            rows: Table rows
+        Returns:
+            Schema summary string
+        """
+        num_columns = len(headers)
+        num_rows = len(rows)
+        summary = f"Table with {num_columns} columns and {num_rows} rows. "
+        summary += f"Columns: {', '.join(headers[:5])}"  # Show first 5 headers
+        if len(headers) > 5:
+            summary += f" and {len(headers) - 5} more"
+        return summary
+    def linearize_table(self, table: TableExtraction) -> str:
+        """
+        Convert table to linear text format for embedding.
+        Args:
+            table: TableExtraction object
+        Returns:
+            Linearized table as string
+        """
+        # Format: "Header1: value1, Header2: value2, ..."
+        lines = []
+        # Add schema summary
+        lines.append(table.schema_summary)
+        # Add headers
+        if table.headers:
+            lines.append(f"Headers: {' | '.join(table.headers)}")
+        # Add rows (sample first few for embedding)
+        max_rows = 10  # Limit to avoid very long text
+        for idx, row in enumerate(table.rows[:max_rows], start=1):
+            if row:
+                # Create row representation
+                row_text = f"Row {idx}: {' | '.join(row)}"
+                lines.append(row_text)
+        if len(table.rows) > max_rows:
+            lines.append(f"... and {len(table.rows) - max_rows} more rows")
+        return "\n".join(lines)
+    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
+        """
+        Generate embeddings for linearized tables.
+        Args:
+            texts: List of linearized table texts
+        Returns:
+            Numpy array of embeddings (shape: num_tables x embedding_dim)
+        """
+        if not texts:
+            return np.array([])
+        # Generate embeddings using sentence-transformers
+        embeddings = self.model.encode(texts, convert_to_numpy=True, show_progress_bar=False)
+        return embeddings

src/agents/text_agent.py ADDED Viewed

	@@ -0,0 +1,137 @@

+"""
+Text agent for chunking text and generating embeddings.
+"""
+import numpy as np
+from typing import List, Tuple, Dict, Any
+from sentence_transformers import SentenceTransformer
+from agents.base_agent import BaseAgent
+from models.document import DocumentChunk, RawDocument
+import config
+class TextAgent(BaseAgent):
+    """Agent responsible for text chunking and embedding generation."""
+    def __init__(self, config_dict: Dict[str, Any] = None):
+        super().__init__(config_dict)
+        # Load embedding model
+        self.model = SentenceTransformer(config.TEXT_EMBEDDING_MODEL)
+    def get_agent_name(self) -> str:
+        return "TextAgent"
+    async def process(self, raw_document: RawDocument) -> Tuple[List[DocumentChunk], np.ndarray]:
+        """
+        Process raw document text into chunks and embeddings.
+        Args:
+            raw_document: Raw document with extracted text
+        Returns:
+            Tuple of (list of DocumentChunks, numpy array of embeddings)
+        """
+        # Chunk the text
+        chunks = self.chunk_text(raw_document.raw_text, raw_document)
+        # Generate embeddings
+        if chunks:
+            chunk_texts = [chunk.content for chunk in chunks]
+            embeddings = self.generate_embeddings(chunk_texts)
+        else:
+            embeddings = np.array([])
+        return chunks, embeddings
+    def chunk_text(self, text: str, raw_document: RawDocument) -> List[DocumentChunk]:
+        """
+        Split text into chunks with overlap.
+        Args:
+            text: Text to chunk
+            raw_document: Original document for metadata
+        Returns:
+            List of DocumentChunk objects
+        """
+        if not text or not text.strip():
+            return []
+        chunks = []
+        # Simple character-based chunking (approximate token-based chunking)
+        # Approximate: 1 token ~= 4 characters
+        char_chunk_size = config.TEXT_CHUNK_SIZE * 4
+        char_overlap = config.TEXT_CHUNK_OVERLAP * 4
+        text_length = len(text)
+        start = 0
+        chunk_idx = 0
+        while start < text_length:
+            end = min(start + char_chunk_size, text_length)
+            # Extract chunk
+            chunk_text = text[start:end].strip()
+            if chunk_text:
+                # Try to find the page number for this chunk
+                page_num = self._estimate_page_number(start, raw_document)
+                chunk = DocumentChunk(
+                    content=chunk_text,
+                    chunk_type="text",
+                    page_number=page_num,
+                    metadata={
+                        "chunk_index": chunk_idx,
+                        "start_char": start,
+                        "end_char": end
+                    }
+                )
+                chunks.append(chunk)
+                chunk_idx += 1
+            # Move to next chunk with overlap
+            start = end - char_overlap if end < text_length else text_length
+        return chunks
+    def _estimate_page_number(self, char_position: int, raw_document: RawDocument) -> int:
+        """
+        Estimate page number based on character position.
+        Args:
+            char_position: Character position in full text
+            raw_document: Original document
+        Returns:
+            Estimated page number (1-indexed)
+        """
+        # Calculate based on pages
+        current_pos = 0
+        for page in raw_document.pages:
+            page_text = page.get("text", "")
+            current_pos += len(page_text)
+            if char_position < current_pos:
+                return page.get("page_num", 1)
+        # Default to last page if not found
+        return raw_document.total_pages if raw_document.total_pages > 0 else 1
+    def generate_embeddings(self, texts: List[str]) -> np.ndarray:
+        """
+        Generate embeddings for list of texts.
+        Args:
+            texts: List of text strings
+        Returns:
+            Numpy array of embeddings (shape: num_texts x embedding_dim)
+        """
+        if not texts:
+            return np.array([])
+        # Generate embeddings using sentence-transformers
+        embeddings = self.model.encode(texts, convert_to_numpy=True, show_progress_bar=False)
+        return embeddings

src/config.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""
+Central configuration for the multi-agent document comparison system.
+"""
+import os
+from pathlib import Path
+from dotenv import load_dotenv
+# Load environment variables
+load_dotenv()
+# Paths
+PROJECT_ROOT = Path(__file__).parent
+DATA_DIR = PROJECT_ROOT / "data"
+UPLOAD_DIR = DATA_DIR / "uploads"
+VECTOR_STORE_DIR = DATA_DIR / "vector_stores"
+# Create directories if they don't exist
+UPLOAD_DIR.mkdir(parents=True, exist_ok=True)
+VECTOR_STORE_DIR.mkdir(parents=True, exist_ok=True)
+# Embedding configuration
+TEXT_EMBEDDING_MODEL = "sentence-transformers/all-MiniLM-L6-v2"
+EMBEDDING_DIMENSION = 384  # MiniLM output dimension
+# Chunking parameters
+TEXT_CHUNK_SIZE = 512  # tokens
+TEXT_CHUNK_OVERLAP = 50  # tokens
+# Similarity parameters
+TOP_K_MATCHES = 10  # Number of similar chunks to retrieve
+# Modality weights (Phase 1: text + tables only)
+# These weights must sum to 1.0
+MODALITY_WEIGHTS = {
+    "text": 0.60,
+    "table": 0.40
+}
+# File constraints
+MAX_FILE_SIZE_MB = 50
+ALLOWED_EXTENSIONS = [".pdf", ".docx"]
+# Future: LLM API keys (Phase 2)
+OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
+ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY", "")
+HUGGINGFACE_TOKEN = os.getenv("HUGGINGFACE_TOKEN", "")
+# Logging
+LOG_LEVEL = "INFO"

src/create_test_docs.py ADDED Viewed

	@@ -0,0 +1,189 @@

+"""
+Script to create sample test documents for testing the document comparison app.
+"""
+from docx import Document
+from docx.shared import Inches, Pt
+from docx.enum.text import WD_ALIGN_PARAGRAPH
+def create_test_doc1():
+    """Create first test document."""
+    doc = Document()
+    # Add title
+    title = doc.add_heading('Product Requirements Document', 0)
+    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
+    # Overview section
+    doc.add_heading('1. Overview', 1)
+    doc.add_paragraph(
+        'This document outlines the requirements for the new mobile application. '
+        'The app will provide users with real-time notifications and task management '
+        'capabilities. Our goal is to create an intuitive, user-friendly interface '
+        'that enhances productivity.'
+    )
+    # Features section
+    doc.add_heading('2. Features', 1)
+    doc.add_paragraph('The application will include the following key features:')
+    features = [
+        'User authentication with OAuth2 protocol',
+        'Push notifications for task updates and reminders',
+        'Calendar integration with Google Calendar and Outlook',
+        'Collaborative task sharing with team members',
+        'Real-time synchronization across devices'
+    ]
+    for feature in features:
+        doc.add_paragraph(feature, style='List Bullet')
+    # Technical Specifications
+    doc.add_heading('3. Technical Specifications', 1)
+    doc.add_paragraph('The technology stack for this project:')
+    table = doc.add_table(rows=4, cols=3)
+    table.style = 'Medium Grid 1 Accent 1'
+    # Header row
+    hdr_cells = table.rows[0].cells
+    hdr_cells[0].text = 'Component'
+    hdr_cells[1].text = 'Technology'
+    hdr_cells[2].text = 'Version'
+    # Data rows
+    data = [
+        ('Frontend', 'React Native', '0.72'),
+        ('Backend', 'Node.js', '18.x'),
+        ('Database', 'PostgreSQL', '15.0')
+    ]
+    for i, (comp, tech, ver) in enumerate(data, start=1):
+        row = table.rows[i].cells
+        row[0].text = comp
+        row[1].text = tech
+        row[2].text = ver
+    # Timeline
+    doc.add_heading('4. Timeline', 1)
+    doc.add_paragraph(
+        'Phase 1: Requirements gathering - 2 weeks\n'
+        'Phase 2: Design and architecture - 3 weeks\n'
+        'Phase 3: Development - 8 weeks\n'
+        'Phase 4: Testing and QA - 2 weeks\n'
+        'Phase 5: Deployment - 1 week'
+    )
+    doc.save('data/uploads/test_doc1.docx')
+    print('✅ Created test_doc1.docx')
+def create_test_doc2():
+    """Create second test document (similar but with differences)."""
+    doc = Document()
+    # Add title
+    title = doc.add_heading('Product Requirements Document', 0)
+    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
+    # Overview section (similar wording)
+    doc.add_heading('1. Overview', 1)
+    doc.add_paragraph(
+        'This document describes the specifications for a new mobile application. '
+        'The application will offer users real-time alerts and project management '
+        'features. We aim to build a streamlined, easy-to-use platform that '
+        'boosts team efficiency.'
+    )
+    # Features section (some overlap, some new)
+    doc.add_heading('2. Core Features', 1)
+    doc.add_paragraph('Key functionality includes:')
+    features = [
+        'User login with OAuth2 authentication',
+        'Real-time push notifications for updates',
+        'Calendar synchronization with multiple platforms',
+        'Team collaboration tools and shared workspaces',
+        'Offline mode support for uninterrupted work',
+        'File attachment and sharing capabilities'
+    ]
+    for feature in features:
+        doc.add_paragraph(feature, style='List Bullet')
+    # Technical Specifications (different technologies)
+    doc.add_heading('3. Technology Stack', 1)
+    doc.add_paragraph('Proposed technology choices:')
+    table = doc.add_table(rows=5, cols=3)
+    table.style = 'Medium Grid 1 Accent 1'
+    # Header row
+    hdr_cells = table.rows[0].cells
+    hdr_cells[0].text = 'Component'
+    hdr_cells[1].text = 'Technology'
+    hdr_cells[2].text = 'Version'
+    # Data rows (some different)
+    data = [
+        ('Frontend', 'React Native', '0.72'),
+        ('Backend', 'Express.js', '4.18'),
+        ('Database', 'MongoDB', '6.0'),
+        ('Cache', 'Redis', '7.0')
+    ]
+    for i, (comp, tech, ver) in enumerate(data, start=1):
+        row = table.rows[i].cells
+        row[0].text = comp
+        row[1].text = tech
+        row[2].text = ver
+    # Project Schedule (different from doc1)
+    doc.add_heading('4. Project Schedule', 1)
+    doc.add_paragraph(
+        'Sprint 1: Planning and setup - 2 weeks\n'
+        'Sprint 2-3: Core development - 6 weeks\n'
+        'Sprint 4: Feature completion - 3 weeks\n'
+        'Sprint 5: Testing phase - 3 weeks\n'
+        'Sprint 6: Launch preparation - 1 week'
+    )
+    # Additional section (not in doc1)
+    doc.add_heading('5. Budget Estimates', 1)
+    doc.add_paragraph(
+        'Development costs: $150,000\n'
+        'Infrastructure: $20,000/year\n'
+        'Maintenance: $30,000/year'
+    )
+    doc.save('data/uploads/test_doc2.docx')
+    print('✅ Created test_doc2.docx')
+def create_identical_doc():
+    """Create a third document identical to doc1 for testing perfect match."""
+    doc = Document()
+    # Same as doc1
+    title = doc.add_heading('Product Requirements Document', 0)
+    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
+    doc.add_heading('1. Overview', 1)
+    doc.add_paragraph(
+        'This document outlines the requirements for the new mobile application. '
+        'The app will provide users with real-time notifications and task management '
+        'capabilities. Our goal is to create an intuitive, user-friendly interface '
+        'that enhances productivity.'
+    )
+    doc.save('data/uploads/test_doc3_identical.docx')
+    print('✅ Created test_doc3_identical.docx (identical to doc1)')
+if __name__ == '__main__':
+    print('Creating test documents...')
+    create_test_doc1()
+    create_test_doc2()
+    create_identical_doc()
+    print('\n✅ All test documents created successfully!')
+    print('Documents saved in: data/uploads/')

src/models/__init__.py ADDED Viewed

	@@ -0,0 +1,14 @@

+"""
+Models package for document and chunk data structures.
+"""
+from models.document import RawDocument, DocumentChunk, ProcessedDocument, TableExtraction
+from models.similarity import ModalityScore, SimilarityReport
+__all__ = [
+    "RawDocument",
+    "DocumentChunk",
+    "ProcessedDocument",
+    "TableExtraction",
+    "ModalityScore",
+    "SimilarityReport",
+]

src/models/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (532 Bytes). View file

src/models/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (532 Bytes). View file

src/models/__pycache__/document.cpython-312.pyc ADDED Viewed

Binary file (6.33 kB). View file

src/models/__pycache__/document.cpython-313.pyc ADDED Viewed

Binary file (6.28 kB). View file

src/models/__pycache__/similarity.cpython-312.pyc ADDED Viewed

Binary file (3.3 kB). View file

src/models/__pycache__/similarity.cpython-313.pyc ADDED Viewed

Binary file (3.35 kB). View file

src/models/document.py ADDED Viewed

	@@ -0,0 +1,142 @@

+"""
+Data models for documents and document chunks.
+"""
+from typing import List, Dict, Any
+import uuid
+class RawDocument:
+    """Represents a raw document with extracted content."""
+    def __init__(
+        self,
+        filename: str,
+        file_type: str,
+        pages: List[Dict[str, Any]],
+        raw_text: str,
+        raw_tables: List[Dict[str, Any]],
+        total_pages: int,
+    ):
+        """
+        Initialize a RawDocument.
+        Args:
+            filename: Name of the document file
+            file_type: Type of file (e.g., 'pdf', 'docx')
+            pages: List of page dictionaries with 'page_num' and 'text' keys
+            raw_text: Full extracted text from the document
+            raw_tables: List of tables extracted from the document
+            total_pages: Total number of pages in the document
+        """
+        self.filename = filename
+        self.file_type = file_type
+        self.pages = pages
+        self.raw_text = raw_text
+        self.raw_tables = raw_tables
+        self.total_pages = total_pages
+    def __repr__(self) -> str:
+        return f"RawDocument(filename={self.filename}, pages={self.total_pages})"
+class DocumentChunk:
+    """Represents a chunk of document content with metadata."""
+    def __init__(
+        self,
+        content: str,
+        chunk_type: str,
+        page_number: int,
+        metadata: Dict[str, Any] = None,
+        chunk_id: str = None,
+    ):
+        """
+        Initialize a DocumentChunk.
+        Args:
+            content: The text content of the chunk
+            chunk_type: Type of chunk (e.g., 'text', 'table')
+            page_number: Page number where this chunk appears
+            metadata: Additional metadata about the chunk
+            chunk_id: Unique identifier for the chunk (auto-generated if not provided)
+        """
+        self.content = content
+        self.chunk_type = chunk_type
+        self.page_number = page_number
+        self.metadata = metadata or {}
+        self.chunk_id = chunk_id or str(uuid.uuid4())
+    def __repr__(self) -> str:
+        return (
+            f"DocumentChunk(type={self.chunk_type}, page={self.page_number}, "
+            f"length={len(self.content)})"
+        )
+class TableExtraction:
+    """Represents a table extracted from a document."""
+    def __init__(
+        self,
+        headers: List[str],
+        rows: List[List[str]],
+        page_number: int,
+        schema_summary: str,
+        table_id: str = None,
+    ):
+        """
+        Initialize a TableExtraction.
+        Args:
+            headers: List of column headers
+            rows: List of rows, each containing cell values
+            page_number: Page number where this table appears
+            schema_summary: Summary description of the table schema
+            table_id: Unique identifier for the table (auto-generated if not provided)
+        """
+        self.headers = headers
+        self.rows = rows
+        self.page_number = page_number
+        self.schema_summary = schema_summary
+        self.table_id = table_id or str(uuid.uuid4())
+    def __repr__(self) -> str:
+        return (
+            f"TableExtraction(columns={len(self.headers)}, "
+            f"rows={len(self.rows)}, page={self.page_number})"
+        )
+class ProcessedDocument:
+    """Represents a fully processed document with text chunks and tables."""
+    def __init__(
+        self,
+        filename: str,
+        text_chunks: List[DocumentChunk],
+        tables: List["TableExtraction"],
+        total_pages: int,
+        file_type: str,
+    ):
+        """
+        Initialize a ProcessedDocument.
+        Args:
+            filename: Name of the document file
+            text_chunks: List of text chunks extracted from the document
+            tables: List of tables extracted from the document
+            total_pages: Total number of pages in the document
+            file_type: Type of file (e.g., 'pdf', 'docx')
+        """
+        self.filename = filename
+        self.text_chunks = text_chunks
+        self.tables = tables
+        self.total_pages = total_pages
+        self.file_type = file_type
+    def __repr__(self) -> str:
+        return (
+            f"ProcessedDocument(filename={self.filename}, "
+            f"text_chunks={len(self.text_chunks)}, "
+            f"tables={len(self.tables)})"
+        )

src/models/similarity.py ADDED Viewed

	@@ -0,0 +1,40 @@

+"""
+Data models for similarity scoring and comparison results.
+"""
+from typing import Dict, Any, List
+from pydantic import BaseModel, Field
+from datetime import datetime
+class ModalityScore(BaseModel):
+    """Represents similarity score for a specific modality (text, table, etc.)."""
+    modality: str = Field(..., description="Type of modality (e.g., 'text', 'table')")
+    score: float = Field(..., ge=0.0, le=1.0, description="Similarity score (0.0 to 1.0)")
+    details: Dict[str, Any] = Field(default_factory=dict, description="Additional details about the scoring")
+    matched_items: List[Dict[str, Any]] = Field(default_factory=list, description="List of matched items between documents")
+    def __repr__(self) -> str:
+        return f"ModalityScore(modality={self.modality}, score={self.score:.3f})"
+class SimilarityReport(BaseModel):
+    """Contains comprehensive similarity comparison results between two documents."""
+    doc1_name: str = Field(..., description="Name of first document")
+    doc2_name: str = Field(..., description="Name of second document")
+    overall_score: float = Field(..., ge=0.0, le=1.0, description="Overall similarity score (0.0 to 1.0)")
+    text_score: ModalityScore = Field(..., description="ModalityScore for text")
+    table_score: ModalityScore = Field(..., description="ModalityScore for tables")
+    matched_sections: List[Dict[str, Any]] = Field(default_factory=list, description="List of matched sections with details")
+    weights_used: Dict[str, float] = Field(default_factory=dict, description="Weights used for modality scoring")
+    timestamp: datetime = Field(default_factory=datetime.now, description="Time when report was generated")
+    class Config:
+        arbitrary_types_allowed = True
+    def __repr__(self) -> str:
+        return (
+            f"SimilarityReport(docs={self.doc1_name} vs {self.doc2_name}, "
+            f"score={self.overall_score:.3f})"
+        )

src/orchestrator/__init__.py ADDED Viewed

File without changes

src/orchestrator/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (185 Bytes). View file

src/orchestrator/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (185 Bytes). View file

src/orchestrator/__pycache__/scorers.cpython-312.pyc ADDED Viewed

Binary file (6.52 kB). View file

src/orchestrator/__pycache__/scorers.cpython-313.pyc ADDED Viewed

Binary file (6.37 kB). View file

src/orchestrator/__pycache__/similarity_orchestrator.cpython-312.pyc ADDED Viewed

Binary file (4.64 kB). View file

src/orchestrator/__pycache__/similarity_orchestrator.cpython-313.pyc ADDED Viewed

Binary file (4.55 kB). View file

src/orchestrator/scorers.py ADDED Viewed

	@@ -0,0 +1,197 @@

+"""
+Similarity scorers for different modalities.
+"""
+import numpy as np
+from typing import List, Dict, Any, Tuple
+from sklearn.metrics.pairwise import cosine_similarity
+from models.similarity import ModalityScore
+from models.document import DocumentChunk, TableExtraction
+import config
+def compute_text_similarity(
+    doc1_chunks: List[DocumentChunk],
+    doc1_embeddings: np.ndarray,
+    doc2_chunks: List[DocumentChunk],
+    doc2_embeddings: np.ndarray
+) -> ModalityScore:
+    """
+    Compute text similarity between two documents.
+    Args:
+        doc1_chunks: Chunks from document 1
+        doc1_embeddings: Embeddings for document 1
+        doc2_chunks: Chunks from document 2
+        doc2_embeddings: Embeddings for document 2
+    Returns:
+        ModalityScore with text similarity details
+    """
+    if len(doc1_embeddings) == 0 or len(doc2_embeddings) == 0:
+        return ModalityScore(
+            modality="text",
+            score=0.0,
+            details={"reason": "One or both documents have no text"},
+            matched_items=[]
+        )
+    # Compute pairwise cosine similarities
+    similarities = cosine_similarity(doc1_embeddings, doc2_embeddings)
+    # Find best matches for each chunk in doc1
+    matched_items = []
+    similarity_scores = []
+    for i, chunk1 in enumerate(doc1_chunks):
+        # Find best matching chunk in doc2
+        best_match_idx = np.argmax(similarities[i])
+        best_score = similarities[i][best_match_idx]
+        if best_score > 0.5:  # Only include matches above threshold
+            chunk2 = doc2_chunks[best_match_idx]
+            matched_items.append({
+                "doc1_chunk_id": chunk1.chunk_id,
+                "doc2_chunk_id": chunk2.chunk_id,
+                "doc1_content": chunk1.content[:200] + "..." if len(chunk1.content) > 200 else chunk1.content,
+                "doc2_content": chunk2.content[:200] + "..." if len(chunk2.content) > 200 else chunk2.content,
+                "similarity": float(best_score),
+                "doc1_page": chunk1.page_number,
+                "doc2_page": chunk2.page_number
+            })
+        similarity_scores.append(best_score)
+    # Overall text score (mean of best matches)
+    overall_score = float(np.mean(similarity_scores)) if similarity_scores else 0.0
+    # Sort matched items by similarity (descending)
+    matched_items.sort(key=lambda x: x["similarity"], reverse=True)
+    return ModalityScore(
+        modality="text",
+        score=overall_score,
+        details={
+            "num_doc1_chunks": len(doc1_chunks),
+            "num_doc2_chunks": len(doc2_chunks),
+            "num_matches": len(matched_items),
+            "average_similarity": overall_score
+        },
+        matched_items=matched_items[:config.TOP_K_MATCHES]  # Limit to top K
+    )
+def compute_table_similarity(
+    doc1_tables: List[TableExtraction],
+    doc1_embeddings: np.ndarray,
+    doc2_tables: List[TableExtraction],
+    doc2_embeddings: np.ndarray
+) -> ModalityScore:
+    """
+    Compute table similarity between two documents.
+    Args:
+        doc1_tables: Tables from document 1
+        doc1_embeddings: Embeddings for document 1 tables
+        doc2_tables: Tables from document 2
+        doc2_embeddings: Embeddings for document 2 tables
+    Returns:
+        ModalityScore with table similarity details
+    """
+    if len(doc1_tables) == 0 and len(doc2_tables) == 0:
+        # Both documents have no tables - perfectly similar in this modality
+        return ModalityScore(
+            modality="table",
+            score=1.0,
+            details={"reason": "Neither document has tables"},
+            matched_items=[]
+        )
+    if len(doc1_embeddings) == 0 or len(doc2_embeddings) == 0:
+        # One has tables, the other doesn't
+        return ModalityScore(
+            modality="table",
+            score=0.0,
+            details={"reason": "One document has tables, the other doesn't"},
+            matched_items=[]
+        )
+    # Compute pairwise cosine similarities
+    similarities = cosine_similarity(doc1_embeddings, doc2_embeddings)
+    # Find best matches
+    matched_items = []
+    similarity_scores = []
+    for i, table1 in enumerate(doc1_tables):
+        # Find best matching table in doc2
+        best_match_idx = np.argmax(similarities[i])
+        best_score = similarities[i][best_match_idx]
+        if best_score > 0.3:  # Lower threshold for tables
+            table2 = doc2_tables[best_match_idx]
+            matched_items.append({
+                "doc1_table_id": table1.table_id,
+                "doc2_table_id": table2.table_id,
+                "doc1_schema": table1.schema_summary,
+                "doc2_schema": table2.schema_summary,
+                "similarity": float(best_score),
+                "doc1_page": table1.page_number,
+                "doc2_page": table2.page_number
+            })
+        similarity_scores.append(best_score)
+    # Overall table score
+    overall_score = float(np.mean(similarity_scores)) if similarity_scores else 0.0
+    # Sort matched items by similarity
+    matched_items.sort(key=lambda x: x["similarity"], reverse=True)
+    return ModalityScore(
+        modality="table",
+        score=overall_score,
+        details={
+            "num_doc1_tables": len(doc1_tables),
+            "num_doc2_tables": len(doc2_tables),
+            "num_matches": len(matched_items),
+            "average_similarity": overall_score
+        },
+        matched_items=matched_items
+    )
+def compute_weighted_score(
+    modality_scores: Dict[str, ModalityScore],
+    weights: Dict[str, float] = None
+) -> float:
+    """
+    Compute weighted overall similarity score.
+    Args:
+        modality_scores: Dictionary of modality -> ModalityScore
+        weights: Dictionary of modality -> weight (defaults to config.MODALITY_WEIGHTS)
+    Returns:
+        Weighted overall score (0.0 to 1.0)
+    """
+    if weights is None:
+        weights = config.MODALITY_WEIGHTS
+    total_score = 0.0
+    total_weight = 0.0
+    for modality, score_obj in modality_scores.items():
+        if modality in weights:
+            weight = weights[modality]
+            total_score += score_obj.score * weight
+            total_weight += weight
+    # Normalize by total weight
+    if total_weight > 0:
+        return total_score / total_weight
+    else:
+        return 0.0

src/orchestrator/similarity_orchestrator.py ADDED Viewed

	@@ -0,0 +1,130 @@

+"""
+Similarity orchestrator for coordinating document comparison across modalities.
+"""
+from typing import Dict, Tuple
+import numpy as np
+from models.document import ProcessedDocument
+from models.similarity import SimilarityReport, ModalityScore
+from orchestrator.scorers import (
+    compute_text_similarity,
+    compute_table_similarity,
+    compute_weighted_score
+)
+import config
+class SimilarityOrchestrator:
+    """Orchestrates similarity comparison across multiple modalities."""
+    def __init__(self, weights: Dict[str, float] = None):
+        """
+        Initialize orchestrator.
+        Args:
+            weights: Custom modality weights (defaults to config.MODALITY_WEIGHTS)
+        """
+        self.weights = weights or config.MODALITY_WEIGHTS
+    async def compare_documents(
+        self,
+        doc1: ProcessedDocument,
+        doc1_text_embeddings: np.ndarray,
+        doc1_table_embeddings: np.ndarray,
+        doc2: ProcessedDocument,
+        doc2_text_embeddings: np.ndarray,
+        doc2_table_embeddings: np.ndarray
+    ) -> SimilarityReport:
+        """
+        Compare two processed documents across all modalities.
+        Args:
+            doc1: First processed document
+            doc1_text_embeddings: Text embeddings for doc1
+            doc1_table_embeddings: Table embeddings for doc1
+            doc2: Second processed document
+            doc2_text_embeddings: Text embeddings for doc2
+            doc2_table_embeddings: Table embeddings for doc2
+        Returns:
+            SimilarityReport with overall score and per-modality details
+        """
+        # Compute text similarity
+        text_score = compute_text_similarity(
+            doc1.text_chunks,
+            doc1_text_embeddings,
+            doc2.text_chunks,
+            doc2_text_embeddings
+        )
+        # Compute table similarity
+        table_score = compute_table_similarity(
+            doc1.tables,
+            doc1_table_embeddings,
+            doc2.tables,
+            doc2_table_embeddings
+        )
+        # Compute weighted overall score
+        modality_scores = {
+            "text": text_score,
+            "table": table_score
+        }
+        overall_score = compute_weighted_score(modality_scores, self.weights)
+        # Compile matched sections from both modalities
+        matched_sections = []
+        # Add top text matches
+        for match in text_score.matched_items[:5]:  # Top 5 text matches
+            matched_sections.append({
+                "type": "text",
+                "doc1_content": match["doc1_content"],
+                "doc2_content": match["doc2_content"],
+                "similarity": match["similarity"],
+                "doc1_page": match["doc1_page"],
+                "doc2_page": match["doc2_page"]
+            })
+        # Add top table matches
+        for match in table_score.matched_items[:3]:  # Top 3 table matches
+            matched_sections.append({
+                "type": "table",
+                "doc1_schema": match["doc1_schema"],
+                "doc2_schema": match["doc2_schema"],
+                "similarity": match["similarity"],
+                "doc1_page": match["doc1_page"],
+                "doc2_page": match["doc2_page"]
+            })
+        # Sort all matched sections by similarity
+        matched_sections.sort(key=lambda x: x["similarity"], reverse=True)
+        # Create report
+        report = SimilarityReport(
+            doc1_name=doc1.filename,
+            doc2_name=doc2.filename,
+            overall_score=overall_score,
+            text_score=text_score,
+            table_score=table_score,
+            matched_sections=matched_sections,
+            weights_used=self.weights
+        )
+        return report
+    def adjust_weights(self, new_weights: Dict[str, float]) -> None:
+        """
+        Adjust modality weights.
+        Args:
+            new_weights: New weight dictionary
+        """
+        # Validate weights sum to 1.0
+        total = sum(new_weights.values())
+        if abs(total - 1.0) > 0.01:
+            # Normalize weights
+            self.weights = {k: v / total for k, v in new_weights.items()}
+        else:
+            self.weights = new_weights

src/requirements-alternative.txt ADDED Viewed

	@@ -0,0 +1,37 @@

+# Alternative requirements.txt using pypdf instead of PyMuPDF
+# Use this if PyMuPDF installation fails
+# Core framework
+streamlit>=1.31.0
+# Data models
+pydantic>=2.6.0
+# Document parsing - Alternative approach
+pypdf>=4.0.0  # Simpler PDF library with pure Python implementation
+python-docx>=1.1.0
+pdfplumber>=0.10.0
+# ML & Embeddings
+sentence-transformers>=2.3.0
+torch>=2.2.0
+# Vector storage
+faiss-cpu>=1.7.0
+# Data processing
+numpy>=1.26.0
+pandas>=2.2.0
+Pillow>=10.2.0
+# Utilities
+python-dotenv>=1.0.0
+# Visualization
+plotly>=5.18.0
+# Async
+aiofiles>=23.2.0
+# Similarity metrics
+scikit-learn>=1.3.0

src/requirements.txt ADDED Viewed

	@@ -0,0 +1,35 @@

+# Core framework
+streamlit>=1.31.0
+# Data models
+pydantic>=2.6.0
+# Document parsing - using versions compatible with Python 3.13
+# Use pypdf if PyMuPDF has DLL issues on Windows
+pypdf>=4.0.0  # Fallback PDF parser (pure Python, no DLL dependencies)
+python-docx>=1.1.0
+pdfplumber>=0.10.0
+# ML & Embeddings
+sentence-transformers>=2.3.0
+torch>=2.2.0
+# Vector storage
+faiss-cpu>=1.7.0
+# Data processing
+numpy>=1.26.0
+pandas>=2.2.0
+Pillow>=10.2.0
+# Utilities
+python-dotenv>=1.0.0
+# Visualization
+plotly>=5.18.0
+# Async
+aiofiles>=23.2.0
+# Similarity metrics
+scikit-learn>=1.3.0

src/storage/__init__.py ADDED Viewed

File without changes

src/storage/vector_store.py ADDED Viewed

	@@ -0,0 +1,183 @@

+"""
+Vector storage using FAISS for similarity search.
+"""
+import faiss
+import numpy as np
+from typing import Dict, List, Tuple, Optional, Any
+from pathlib import Path
+import pickle
+import config
+class MultiModalVectorStore:
+    """Vector store for managing multi-modal embeddings using FAISS."""
+    def __init__(self):
+        self.indices: Dict[str, faiss.Index] = {}  # modality -> FAISS index
+        self.metadata: Dict[str, List[Dict[str, Any]]] = {}  # modality -> list of metadata
+        self.dimension = config.EMBEDDING_DIMENSION
+    def add_vectors(
+        self,
+        modality: str,
+        embeddings: np.ndarray,
+        metadata: List[Dict[str, Any]]
+    ) -> None:
+        """
+        Add vectors to the store for a specific modality.
+        Args:
+            modality: Modality type ('text' or 'table')
+            embeddings: Numpy array of embeddings (num_vectors x dimension)
+            metadata: List of metadata dicts for each vector
+        """
+        if len(embeddings) == 0:
+            return
+        # Ensure embeddings are float32 (required by FAISS)
+        embeddings = embeddings.astype(np.float32)
+        # Create index if it doesn't exist
+        if modality not in self.indices:
+            self.indices[modality] = faiss.IndexFlatL2(self.dimension)
+            self.metadata[modality] = []
+        # Add vectors to index
+        self.indices[modality].add(embeddings)
+        # Add metadata
+        self.metadata[modality].extend(metadata)
+    def query_similar(
+        self,
+        modality: str,
+        query_vector: np.ndarray,
+        k: int = 10
+    ) -> List[Tuple[int, float, Dict[str, Any]]]:
+        """
+        Query for similar vectors.
+        Args:
+            modality: Modality type to search in
+            query_vector: Query vector (1D array of dimension)
+            k: Number of results to return
+        Returns:
+            List of (index, distance, metadata) tuples
+        """
+        if modality not in self.indices or self.indices[modality].ntotal == 0:
+            return []
+        # Ensure query vector is 2D and float32
+        if query_vector.ndim == 1:
+            query_vector = query_vector.reshape(1, -1)
+        query_vector = query_vector.astype(np.float32)
+        # Search
+        k = min(k, self.indices[modality].ntotal)
+        distances, indices = self.indices[modality].search(query_vector, k)
+        # Compile results
+        results = []
+        for idx, distance in zip(indices[0], distances[0]):
+            if idx < len(self.metadata[modality]):
+                results.append((
+                    int(idx),
+                    float(distance),
+                    self.metadata[modality][idx]
+                ))
+        return results
+    def get_all_vectors(self, modality: str) -> Tuple[Optional[np.ndarray], List[Dict[str, Any]]]:
+        """
+        Get all vectors and metadata for a modality.
+        Args:
+            modality: Modality type
+        Returns:
+            Tuple of (embeddings array, metadata list)
+        """
+        if modality not in self.indices or self.indices[modality].ntotal == 0:
+            return None, []
+        # Reconstruct vectors from index
+        num_vectors = self.indices[modality].ntotal
+        embeddings = faiss.rev_swig_ptr(
+            self.indices[modality].get_xb(),
+            num_vectors * self.dimension
+        ).reshape(num_vectors, self.dimension)
+        return embeddings, self.metadata[modality]
+    def get_num_vectors(self, modality: str) -> int:
+        """
+        Get number of vectors stored for a modality.
+        Args:
+            modality: Modality type
+        Returns:
+            Number of vectors
+        """
+        if modality not in self.indices:
+            return 0
+        return self.indices[modality].ntotal
+    def save(self, filename_prefix: str) -> None:
+        """
+        Save indices and metadata to disk.
+        Args:
+            filename_prefix: Prefix for saved files
+        """
+        save_dir = config.VECTOR_STORE_DIR
+        save_dir.mkdir(parents=True, exist_ok=True)
+        for modality, index in self.indices.items():
+            # Save FAISS index
+            index_path = save_dir / f"{filename_prefix}_{modality}.faiss"
+            faiss.write_index(index, str(index_path))
+            # Save metadata
+            metadata_path = save_dir / f"{filename_prefix}_{modality}_metadata.pkl"
+            with open(metadata_path, "wb") as f:
+                pickle.dump(self.metadata[modality], f)
+    def load(self, filename_prefix: str) -> bool:
+        """
+        Load indices and metadata from disk.
+        Args:
+            filename_prefix: Prefix of saved files
+        Returns:
+            True if loaded successfully, False otherwise
+        """
+        load_dir = config.VECTOR_STORE_DIR
+        try:
+            # Find all index files with this prefix
+            for modality in ["text", "table"]:
+                index_path = load_dir / f"{filename_prefix}_{modality}.faiss"
+                metadata_path = load_dir / f"{filename_prefix}_{modality}_metadata.pkl"
+                if index_path.exists() and metadata_path.exists():
+                    # Load FAISS index
+                    self.indices[modality] = faiss.read_index(str(index_path))
+                    # Load metadata
+                    with open(metadata_path, "rb") as f:
+                        self.metadata[modality] = pickle.load(f)
+            return True
+        except Exception as e:
+            print(f"Error loading vector store: {e}")
+            return False
+    def clear(self) -> None:
+        """Clear all indices and metadata."""
+        self.indices.clear()
+        self.metadata.clear()

src/streamlit_app.py CHANGED Viewed

@@ -1,40 +1,301 @@
-import altair as alt
-import numpy as np
-import pandas as pd
 import streamlit as st
-"""
-# Welcome to Streamlit!
-Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).
-In the meantime, below is an example of what you can do with just a few lines of code:
-"""
-num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
-num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
-indices = np.linspace(0, 1, num_points)
-theta = 2 * np.pi * num_turns * indices
-radius = indices
-x = radius * np.cos(theta)
-y = radius * np.sin(theta)
-df = pd.DataFrame({
-    "x": x,
-    "y": y,
-    "idx": indices,
-    "rand": np.random.randn(num_points),
-})
-st.altair_chart(alt.Chart(df, height=700, width=700)
-    .mark_point(filled=True)
-    .encode(
-        x=alt.X("x", axis=None),
-        y=alt.Y("y", axis=None),
-        color=alt.Color("idx", legend=None, scale=alt.Scale()),
-        size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
-    ))

+"""
+Multi-Agent Document Comparison Streamlit App
+"""
+import sys
+from pathlib import Path
+# Add project root to Python path for imports
+project_root = Path(__file__).parent
+if str(project_root) not in sys.path:
+    sys.path.insert(0, str(project_root))
 import streamlit as st
+import asyncio
+import json
+# Import agents and utilities
+from agents.ingestion_agent import IngestionAgent
+from agents.text_agent import TextAgent
+from agents.table_agent import TableAgent
+from orchestrator.similarity_orchestrator import SimilarityOrchestrator
+from utils.file_handler import save_uploaded_file, validate_file, get_file_type
+from utils.visualization import (
+    create_similarity_gauge,
+    create_modality_breakdown_chart,
+    format_matched_sections,
+    create_score_legend
+)
+from models.document import ProcessedDocument
+import config
+# Page configuration
+st.set_page_config(
+    page_title="Multi-Agent Document Comparator",
+    page_icon="📄",
+    layout="wide",
+    initial_sidebar_state="expanded"
+)
+def main():
+    """Main application function."""
+    # Header
+    st.title("📄 Multi-Agent Document Comparator")
+    st.markdown("**An agentic system to accurately match document similarity**")
+    # Show architecture diagram
+    with st.expander("🏗️ View System Architecture", expanded=False):
+        arch_path = Path("img/multi_agent_doc_similarity_architecture.svg")
+        if arch_path.exists():
+            st.image(str(arch_path), use_container_width=True)
+        else:
+            st.info("Architecture diagram not found")
+    st.markdown("---")
+    # Sidebar configuration
+    with st.sidebar:
+        st.header("⚙️ Configuration")
+        # Modality weights
+        st.subheader("Modality Weights")
+        text_weight = st.slider(
+            "Text Weight",
+            min_value=0.0,
+            max_value=1.0,
+            value=config.MODALITY_WEIGHTS["text"],
+            step=0.05
+        )
+        table_weight = 1.0 - text_weight
+        st.write(f"Table Weight: {table_weight:.2f}")
+        # Phase info
+        st.markdown("---")
+        st.subheader("📋 Phase 1 Implementation")
+        st.write("✅ Text comparison")
+        st.write("✅ Table comparison")
+        st.write("⏳ Image comparison (Phase 2)")
+        st.write("⏳ Layout comparison (Phase 2)")
+        st.write("⏳ Metadata comparison (Phase 2)")
+    # Main content area
+    col1, col2 = st.columns(2)
+    with col1:
+        st.subheader("📤 Document 1 (Main)")
+        uploaded_file1 = st.file_uploader(
+            "Upload PDF or DOCX",
+            type=["pdf", "docx"],
+            key="file1",
+            help="Maximum file size: 50MB"
+        )
+    with col2:
+        st.subheader("📤 Document 2 (Comparison)")
+        uploaded_file2 = st.file_uploader(
+            "Upload PDF or DOCX",
+            type=["pdf", "docx"],
+            key="file2",
+            help="Maximum file size: 50MB"
+        )
+    # Compare button
+    st.markdown("---")
+    if st.button("🔍 Compare Documents", type="primary", use_container_width=True):
+        if not uploaded_file1 or not uploaded_file2:
+            st.error("Please upload both documents before comparing.")
+            return
+        # Process documents and compare
+        with st.spinner("Processing documents..."):
+            try:
+                # Save uploaded files
+                file1_path = save_uploaded_file(uploaded_file1)
+                file2_path = save_uploaded_file(uploaded_file2)
+                # Validate files
+                valid1, error1 = validate_file(file1_path)
+                valid2, error2 = validate_file(file2_path)
+                if not valid1:
+                    st.error(f"Document 1 error: {error1}")
+                    return
+                if not valid2:
+                    st.error(f"Document 2 error: {error2}")
+                    return
+                # Process documents
+                report = asyncio.run(process_and_compare(
+                    file1_path,
+                    file2_path,
+                    {"text": text_weight, "table": table_weight}
+                ))
+                # Display results
+                display_results(report)
+            except Exception as e:
+                st.error(f"An error occurred: {str(e)}")
+                import traceback
+                st.code(traceback.format_exc())
+async def process_and_compare(file1_path: str, file2_path: str, weights: dict):
+    """
+    Process two documents and compare them.
+    Args:
+        file1_path: Path to first document
+        file2_path: Path to second document
+        weights: Modality weights
+    Returns:
+        SimilarityReport
+    """
+    # Initialize agents
+    ingestion_agent = IngestionAgent()
+    text_agent = TextAgent()
+    table_agent = TableAgent()
+    orchestrator = SimilarityOrchestrator(weights=weights)
+    # Progress tracking
+    progress_bar = st.progress(0)
+    status_text = st.empty()
+    # Step 1: Ingest documents
+    status_text.text("⏳ Ingesting documents...")
+    progress_bar.progress(10)
+    raw_doc1 = await ingestion_agent.process(file1_path)
+    raw_doc2 = await ingestion_agent.process(file2_path)
+    progress_bar.progress(25)
+    # Step 2: Extract text
+    status_text.text("⏳ Extracting and embedding text...")
+    text_chunks1, text_embeddings1 = await text_agent.process(raw_doc1)
+    text_chunks2, text_embeddings2 = await text_agent.process(raw_doc2)
+    progress_bar.progress(50)
+    # Step 3: Extract tables
+    status_text.text("⏳ Extracting and embedding tables...")
+    tables1, table_embeddings1 = await table_agent.process(raw_doc1)
+    tables2, table_embeddings2 = await table_agent.process(raw_doc2)
+    progress_bar.progress(75)
+    # Step 4: Create processed documents
+    processed_doc1 = ProcessedDocument(
+        filename=raw_doc1.filename,
+        text_chunks=text_chunks1,
+        tables=tables1,
+        total_pages=raw_doc1.total_pages,
+        file_type=raw_doc1.file_type
+    )
+    processed_doc2 = ProcessedDocument(
+        filename=raw_doc2.filename,
+        text_chunks=text_chunks2,
+        tables=tables2,
+        total_pages=raw_doc2.total_pages,
+        file_type=raw_doc2.file_type
+    )
+    # Step 5: Compare documents
+    status_text.text("⏳ Comparing documents...")
+    report = await orchestrator.compare_documents(
+        processed_doc1,
+        text_embeddings1,
+        table_embeddings1,
+        processed_doc2,
+        text_embeddings2,
+        table_embeddings2
+    )
+    progress_bar.progress(100)
+    status_text.text("✅ Comparison complete!")
+    return report
+def display_results(report):
+    """
+    Display comparison results.
+    Args:
+        report: SimilarityReport object
+    """
+    st.markdown("---")
+    st.header("📊 Comparison Results")
+    # Overall similarity gauge
+    col1, col2 = st.columns([1, 1])
+    with col1:
+        gauge_fig = create_similarity_gauge(report.overall_score)
+        st.plotly_chart(gauge_fig, use_container_width=True)
+    with col2:
+        st.markdown(create_score_legend())
+    # Modality breakdown
+    st.markdown("---")
+    st.subheader("📈 Per-Modality Breakdown")
+    breakdown_fig = create_modality_breakdown_chart(report)
+    st.plotly_chart(breakdown_fig, use_container_width=True)
+    # Detailed scores
+    col1, col2 = st.columns(2)
+    with col1:
+        if report.text_score:
+            st.metric(
+                "Text Similarity",
+                f"{report.text_score.score:.1%}",
+                f"{report.text_score.details.get('num_matches', 0)} matches"
+            )
+    with col2:
+        if report.table_score:
+            st.metric(
+                "Table Similarity",
+                f"{report.table_score.score:.1%}",
+                f"{report.table_score.details.get('num_matches', 0)} matches"
+            )
+    # Matched sections
+    st.markdown("---")
+    st.subheader("🔗 Top Matched Sections")
+    if report.matched_sections:
+        formatted_sections = format_matched_sections(report.matched_sections[:5])
+        st.markdown(formatted_sections)
+    else:
+        st.info("No significant matches found between documents.")
+    # Download report
+    st.markdown("---")
+    report_json = json.dumps(report.model_dump(), indent=2, default=str)
+    col1, col2, col3 = st.columns([1, 1, 2])
+    with col1:
+        st.download_button(
+            label="📥 Download Report (JSON)",
+            data=report_json,
+            file_name=f"similarity_report_{report.timestamp.strftime('%Y%m%d_%H%M%S')}.json",
+            mime="application/json"
+        )
+if __name__ == "__main__":
+    main()

src/utils/__init__.py ADDED Viewed

File without changes

src/utils/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (178 Bytes). View file

src/utils/__pycache__/__init__.cpython-313.pyc ADDED Viewed

Binary file (178 Bytes). View file

src/utils/__pycache__/file_handler.cpython-312.pyc ADDED Viewed

Binary file (4.59 kB). View file

src/utils/__pycache__/visualization.cpython-312.pyc ADDED Viewed

Binary file (8.29 kB). View file

src/utils/__pycache__/visualization.cpython-313.pyc ADDED Viewed

Binary file (8.23 kB). View file