Spaces:

sukhrobnurali
/

financial-document-analyzer

Runtime error

App Files Files Community

sukhrobnurali commited on Nov 27, 2025

Commit

a921556

1 Parent(s): f10f47f

v1 Application is Ready

Browse files

Files changed (9) hide show

.dockerignore +39 -0
.env.example +1 -0
.gitignore +9 -0
Dockerfile +41 -0
README.md +130 -6
app.py +297 -0
criteria.py +160 -0
document_processor.py +153 -0
requirements.txt +5 -0

.dockerignore ADDED Viewed

	@@ -0,0 +1,39 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+# Environment
+.env
+.venv
+venv/
+ENV/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# Testing
+.pytest_cache/
+.coverage
+htmlcov/
+# Documentation (already copied in Dockerfile)
+QUICKSTART.md
+DEPLOYMENT.md
+# Git
+.git/
+.gitignore
+# PDFs and uploads
+*.pdf
+uploads/
+# OS
+.DS_Store
+Thumbs.db

.env.example ADDED Viewed

	@@ -0,0 +1 @@


1	+ OPENAI_API_KEY=your_openai_api_key_here

.gitignore ADDED Viewed

	@@ -0,0 +1,9 @@

+__pycache__/
+*.py[cod]
+*$py.class
+.env
+.venv
+venv/
+*.pdf
+.streamlit/
+uploads/

Dockerfile ADDED Viewed

	@@ -0,0 +1,41 @@

+# Use Python 3.11 slim image for smaller size
+FROM python:3.11-slim
+# Set working directory
+WORKDIR /app
+# Install system dependencies required for PDF processing
+RUN apt-get update && apt-get install -y \
+    build-essential \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better layer caching
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application files
+COPY app.py .
+COPY document_processor.py .
+COPY criteria.py .
+COPY README.md .
+# Create directory for temporary file uploads
+RUN mkdir -p /app/uploads
+# Expose port 7860 (Hugging Face Spaces standard port)
+EXPOSE 7860
+# Set environment variables for Streamlit
+ENV STREAMLIT_SERVER_PORT=7860
+ENV STREAMLIT_SERVER_ADDRESS=0.0.0.0
+ENV STREAMLIT_SERVER_HEADLESS=true
+ENV STREAMLIT_BROWSER_GATHER_USAGE_STATS=false
+# Health check
+HEALTHCHECK CMD curl --fail http://localhost:7860/_stcore/health || exit 1
+# Run the Streamlit app
+CMD ["streamlit", "run", "app.py", "--server.port=7860", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -1,11 +1,135 @@
 ---
-title: Financial Document Analyzer
-emoji: 🐢
 colorFrom: blue
-colorTo: purple
-sdk: docker
 pinned: false
-short_description: An AI-powered financial document analyzer that screens compa
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Intelligent Investment Screener
+emoji: 📊
 colorFrom: blue
+colorTo: green
+sdk: streamlit
+sdk_version: 1.31.0
+app_file: app.py
 pinned: false
+license: mit
 ---
+# Intelligent Investment Screener
+An AI-powered financial document analyzer that screens company annual reports against specific investment criteria using RAG (Retrieval-Augmented Generation).
+##  What It Does
+Instead of manually reading 100-page annual reports to find specific financial metrics, this tool:
+1. **Accepts** company financial documents (10-K, Annual Reports)
+2. **Extracts** key financial metrics (debt ratios, revenue breakdown, etc.)
+3. **Analyzes** them against customizable investment criteria
+4. **Returns** a Pass/Fail decision with **citations** (page numbers and sections)
+##  Key Features
+###  Citation-Based Analysis
+Every finding includes:
+- Exact page number
+- Specific section or table name
+- Relevance score
+This transforms the tool from a "magic box" to a **trusted, verifiable assistant**.
+###  Multiple Screening Criteria
+1. **Shariah Compliance**: Islamic finance screening
+   - Debt ratio < 33%
+   - Interest income < 5%
+   - No prohibited activities (alcohol, gambling, etc.)
+2. **ESG (Environmental, Social, Governance)**: Sustainable investing
+   - Carbon emissions disclosure
+   - Board diversity > 30%
+   - No environmental violations
+   - Labor practice compliance
+3. **Value Investing**: Traditional value metrics
+   - P/E ratio < 15
+   - Debt to Equity < 0.5
+   - Positive free cash flow
+   - Revenue growth > 5%
+##  Technical Architecture
+### Tech Stack
+- **Frontend**: Streamlit
+- **LLM**: OpenAI GPT-4o-mini
+- **RAG Framework**: LlamaIndex
+### How It Works
+```
+PDF Upload → LlamaIndex Parser → Vector Index → OpenAI Analysis → Cited Results
+```
+1. **Document Loading**: LlamaIndex parses PDF and preserves page metadata
+2. **Vector Indexing**: Creates searchable embeddings of document chunks
+3. **Criteria Analysis**: OpenAI GPT-4o-mini analyzes relevant sections against rules
+4. **Citation Extraction**: Page numbers and sections are tracked throughout
+5. **Results Display**: Pass/Fail with verifiable citations
+##  Quick Start
+### Running Locally
+```bash
+# Clone the repository
+git clone <your-repo-url>
+cd investment-screener
+# Install dependencies
+pip install -r requirements.txt
+# Set up environment variables
+cp .env.example .env
+# Add your OPENAI_API_KEY to .env
+# Run the app
+streamlit run app.py
+```
+### Get an OpenAI API Key
+1. Visit [OpenAI Platform](https://platform.openai.com/api-keys)
+2. Sign up or log in
+3. Click "Create new secret key"
+4. Copy the key and add it to `.env` or enter it in the app sidebar
+##  Usage
+1. **Select Screening Criteria**: Choose from Shariah, ESG, or Value Investing
+2. **Upload Document**: Upload an annual report or 10-K filing (PDF)
+3. **Analyze**: Click the analyze button
+4. **Review Results**:
+   - Overall Pass/Fail decision
+   - Detailed metric-by-metric breakdown
+   - Citations with page numbers for verification
+##  Example Use Cases
+### Shariah Compliance Screening
+Investors following Islamic finance principles need to ensure companies:
+- Don't derive significant income from interest
+- Maintain acceptable debt levels
+- Don't operate in prohibited industries
+### ESG Screening
+Socially responsible investors want to verify:
+- Environmental impact disclosures
+- Corporate governance practices
+- Social responsibility metrics
+### Value Investing
+Traditional investors need quick access to:
+- Valuation ratios (P/E, P/B)
+- Financial health metrics
+- Growth indicators
+---
+**Note**: This tool is for informational purposes only. Always verify financial data and consult with qualified financial advisors before making investment decisions.

app.py ADDED Viewed

	@@ -0,0 +1,297 @@

+"""
+Intelligent Investment Screener
+A RAG-based application for analyzing company financial reports against investment criteria.
+"""
+import streamlit as st
+import os
+import json
+import tempfile
+from pathlib import Path
+from dotenv import load_dotenv
+from document_processor import InvestmentDocumentProcessor
+from criteria import CRITERIA_OPTIONS
+# Load environment variables
+load_dotenv()
+# Page config
+st.set_page_config(
+    page_title="Investment Screener",
+    page_icon="📊",
+    layout="wide"
+)
+# Custom CSS
+st.markdown("""
+<style>
+    .main-header {
+        font-size: 2.5rem;
+        font-weight: bold;
+        margin-bottom: 0.5rem;
+    }
+    .sub-header {
+        font-size: 1.2rem;
+        color: #666;
+        margin-bottom: 2rem;
+    }
+    .pass-badge {
+        background-color: #28a745;
+        color: white;
+        padding: 0.5rem 1rem;
+        border-radius: 0.5rem;
+        font-weight: bold;
+        display: inline-block;
+        margin: 0.5rem 0;
+    }
+    .fail-badge {
+        background-color: #dc3545;
+        color: white;
+        padding: 0.5rem 1rem;
+        border-radius: 0.5rem;
+        font-weight: bold;
+        display: inline-block;
+        margin: 0.5rem 0;
+    }
+    .citation {
+        background-color: #f8f9fa;
+        border-left: 4px solid #007bff;
+        padding: 1rem;
+        margin: 0.5rem 0;
+        border-radius: 0.25rem;
+    }
+    .metric-card {
+        background-color: #ffffff;
+        padding: 1.5rem;
+        border-radius: 0.5rem;
+        border: 1px solid #e0e0e0;
+        margin: 1rem 0;
+    }
+</style>
+""", unsafe_allow_html=True)
+def initialize_session_state():
+    """Initialize Streamlit session state variables."""
+    if 'processor' not in st.session_state:
+        st.session_state.processor = None
+    if 'analysis_result' not in st.session_state:
+        st.session_state.analysis_result = None
+    if 'document_loaded' not in st.session_state:
+        st.session_state.document_loaded = False
+def display_criteria_rules(criteria):
+    """Display the rules for selected criteria."""
+    st.subheader("Screening Rules")
+    for rule in criteria['rules']:
+        st.markdown(f"**{rule['name']}**: {rule['description']}")
+        st.caption(f"Threshold: {rule['threshold']}")
+def display_analysis_result(result, criteria_name):
+    """Display analysis results with citations."""
+    st.markdown("---")
+    st.markdown("## Analysis Results")
+    # Overall pass/fail
+    overall_pass = result.get('overall_pass', False)
+    if overall_pass:
+        st.markdown('<div class="pass-badge">✓ PASSED - Investment Compatible</div>',
+                    unsafe_allow_html=True)
+    else:
+        st.markdown('<div class="fail-badge">✗ FAILED - Does Not Meet Criteria</div>',
+                    unsafe_allow_html=True)
+    # Summary
+    if 'summary' in result:
+        st.markdown("### Summary")
+        st.info(result['summary'])
+    # Detailed metrics
+    st.markdown("### Detailed Analysis")
+    # Remove metadata fields for display
+    metrics = {k: v for k, v in result.items()
+               if k not in ['overall_pass', 'summary', 'citations', 'source_nodes_count', 'parse_error', 'raw_response']}
+    for metric_name, metric_data in metrics.items():
+        if isinstance(metric_data, dict):
+            display_metric_card(metric_name, metric_data)
+    # Citations section
+    if 'citations' in result and result['citations']:
+        st.markdown("### 📚 Citations & Sources")
+        st.caption(f"Analysis based on {result.get('source_nodes_count', 0)} relevant document sections")
+        for citation in result['citations'][:5]:  # Show top 5 citations
+            display_citation(citation)
+    # Debug: Show raw response if parse error
+    if result.get('parse_error'):
+        with st.expander("Raw LLM Response (Debug)"):
+            st.text(result.get('raw_response', 'No response'))
+def display_metric_card(metric_name, metric_data):
+    """Display a single metric card with citation."""
+    # Format metric name
+    formatted_name = metric_name.replace('_', ' ').title()
+    # Determine pass/fail
+    passed = metric_data.get('pass', metric_data.get('compliant', metric_data.get('disclosed', None)))
+    # Build display
+    status_icon = "✓" if passed else "✗"
+    status_color = "green" if passed else "red"
+    st.markdown(f"""
+    <div class="metric-card">
+        <h4 style="color: {status_color};">{status_icon} {formatted_name}</h4>
+    """, unsafe_allow_html=True)
+    # Display metric details
+    for key, value in metric_data.items():
+        if key not in ['pass', 'page', 'location']:
+            if isinstance(value, bool):
+                value = "Yes" if value else "No"
+            st.markdown(f"**{key.replace('_', ' ').title()}**: {value}")
+    # Citation info
+    if 'page' in metric_data and 'location' in metric_data:
+        st.markdown(f"""
+        <div style="margin-top: 1rem; padding: 0.5rem; background-color: #e7f3ff; border-radius: 0.25rem;">
+            📄 <strong>Found on Page {metric_data['page']}</strong><br>
+            📍 Section: {metric_data['location']}
+        </div>
+        """, unsafe_allow_html=True)
+    elif 'page' in metric_data:
+        st.markdown(f"📄 **Page {metric_data['page']}**")
+    st.markdown("</div>", unsafe_allow_html=True)
+def display_citation(citation):
+    """Display a citation box."""
+    st.markdown(f"""
+    <div class="citation">
+        <strong>Page {citation['page']}</strong> (Relevance: {citation['score']:.2%})<br>
+        <small>{citation['text_preview']}</small>
+    </div>
+    """, unsafe_allow_html=True)
+def main():
+    """Main application."""
+    initialize_session_state()
+    # Header
+    st.markdown('<div class="main-header">📊 Intelligent Investment Screener</div>',
+                unsafe_allow_html=True)
+    st.markdown('<div class="sub-header">AI-powered financial document analysis with citations</div>',
+                unsafe_allow_html=True)
+    # Sidebar
+    with st.sidebar:
+        st.markdown("## Configuration")
+        # API Key input
+        api_key = os.getenv('OPENAI_API_KEY', '')
+        if not api_key:
+            api_key = st.text_input(
+                "OpenAI API Key",
+                type="password",
+                help="Get your API key at https://platform.openai.com/api-keys"
+            )
+        if not api_key:
+            st.warning("Please enter your OpenAI API key to continue.")
+            st.stop()
+        # Criteria selection
+        st.markdown("## Screening Criteria")
+        selected_criteria_name = st.selectbox(
+            "Select Investment Strategy",
+            options=list(CRITERIA_OPTIONS.keys())
+        )
+        criteria = CRITERIA_OPTIONS[selected_criteria_name]
+        with st.expander("View Criteria Details"):
+            st.markdown(f"**{criteria['name']}**")
+            st.caption(criteria['description'])
+            display_criteria_rules(criteria)
+        st.markdown("---")
+        st.markdown("### About")
+        st.caption("""
+        This tool uses RAG (Retrieval-Augmented Generation) to analyze
+        financial documents against specific investment criteria.
+        All findings include page citations for verification.
+        """)
+    # Main content
+    col1, col2 = st.columns([1, 1])
+    with col1:
+        st.markdown("### Upload Document")
+        uploaded_file = st.file_uploader(
+            "Upload Annual Report or 10-K Filing (PDF)",
+            type=['pdf'],
+            help="Upload a company's annual report or SEC 10-K filing"
+        )
+        if uploaded_file is not None:
+            # Save to temp file
+            with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp_file:
+                tmp_file.write(uploaded_file.getvalue())
+                tmp_path = tmp_file.name
+            # Load document
+            if not st.session_state.document_loaded or st.session_state.processor is None:
+                with st.spinner("Loading and indexing document..."):
+                    try:
+                        processor = InvestmentDocumentProcessor(api_key)
+                        processor.load_pdf(tmp_path)
+                        st.session_state.processor = processor
+                        st.session_state.document_loaded = True
+                        # Show document info
+                        doc_info = processor.get_document_summary()
+                        st.success(f"✓ Document loaded: {doc_info['num_pages']} pages")
+                    except Exception as e:
+                        st.error(f"Error loading document: {str(e)}")
+                        st.stop()
+                # Clean up temp file
+                Path(tmp_path).unlink(missing_ok=True)
+    with col2:
+        st.markdown("### Analysis")
+        if st.session_state.document_loaded:
+            if st.button("🔍 Analyze Document", type="primary", use_container_width=True):
+                with st.spinner(f"Analyzing against {selected_criteria_name} criteria..."):
+                    try:
+                        result = st.session_state.processor.analyze_with_criteria(
+                            criteria['analysis_prompt']
+                        )
+                        st.session_state.analysis_result = result
+                    except Exception as e:
+                        st.error(f"Analysis error: {str(e)}")
+                        st.exception(e)
+        else:
+            st.info("Upload a PDF document to begin analysis")
+    # Display results
+    if st.session_state.analysis_result is not None:
+        display_analysis_result(st.session_state.analysis_result, selected_criteria_name)
+if __name__ == "__main__":
+    main()

criteria.py ADDED Viewed

	@@ -0,0 +1,160 @@

+"""
+Investment screening criteria definitions.
+Each criterion includes rules and the analysis prompt for the LLM.
+"""
+SHARIAH_CRITERIA = {
+    "name": "Shariah Compliance",
+    "description": "Islamic finance screening for halal investments",
+    "rules": [
+        {
+            "name": "Debt Ratio",
+            "threshold": "< 33%",
+            "description": "Total debt must be less than 33% of market capitalization or total assets"
+        },
+        {
+            "name": "Interest Income",
+            "threshold": "< 5%",
+            "description": "Interest-bearing income must be less than 5% of total revenue"
+        },
+        {
+            "name": "Prohibited Activities",
+            "threshold": "0%",
+            "description": "No involvement in alcohol, gambling, pork products, conventional banking, or adult entertainment"
+        },
+        {
+            "name": "Cash & Interest-Bearing Securities",
+            "threshold": "< 33%",
+            "description": "Cash and interest-bearing securities must be less than 33% of market cap"
+        }
+    ],
+    "analysis_prompt": """You are a Shariah compliance analyst. Analyze this financial document and extract the following:
+1. **Debt Ratio**: Calculate total debt / total assets (or market cap if available). Must be < 33%.
+2. **Interest Income**: Find interest income / total revenue. Must be < 5%.
+3. **Prohibited Activities**: Check for revenue from alcohol, gambling, pork, conventional banking, or adult entertainment.
+4. **Cash Ratio**: Calculate (cash + interest-bearing securities) / total assets. Must be < 33%.
+For EACH finding, you MUST provide:
+- The exact value or percentage
+- The page number where you found it
+- The specific section or table name (e.g., "Balance Sheet, Note 5")
+Format your response as JSON:
+{
+    "debt_ratio": {"value": "X%", "page": N, "location": "Section name", "pass": true/false},
+    "interest_income": {"value": "X%", "page": N, "location": "Section name", "pass": true/false},
+    "prohibited_activities": {"found": true/false, "details": "...", "page": N, "location": "Section name", "pass": true/false},
+    "cash_ratio": {"value": "X%", "page": N, "location": "Section name", "pass": true/false},
+    "overall_pass": true/false,
+    "summary": "Brief explanation"
+}
+If you cannot find specific information, state "Not found in document" but still cite where you looked."""
+}
+ESG_CRITERIA = {
+    "name": "ESG (Environmental, Social, Governance)",
+    "description": "Sustainable and responsible investment screening",
+    "rules": [
+        {
+            "name": "Carbon Emissions Disclosure",
+            "threshold": "Required",
+            "description": "Company must disclose Scope 1 and 2 emissions"
+        },
+        {
+            "name": "Board Diversity",
+            "threshold": "> 30%",
+            "description": "At least 30% of board members should be women or minorities"
+        },
+        {
+            "name": "Environmental Violations",
+            "threshold": "None",
+            "description": "No major environmental fines or violations in past 2 years"
+        },
+        {
+            "name": "Labor Practices",
+            "threshold": "Compliant",
+            "description": "No labor rights violations or controversies"
+        }
+    ],
+    "analysis_prompt": """You are an ESG investment analyst. Analyze this financial document and extract the following:
+1. **Carbon Emissions**: Find Scope 1 and Scope 2 emissions disclosures.
+2. **Board Diversity**: Find percentage of women or minorities on the board.
+3. **Environmental Violations**: Check for environmental fines or legal issues.
+4. **Labor Practices**: Look for labor controversies or violations.
+For EACH finding, you MUST provide:
+- The specific data point
+- The page number where you found it
+- The specific section name (e.g., "Sustainability Report, page 15")
+Format your response as JSON:
+{
+    "carbon_disclosure": {"disclosed": true/false, "scope1": "X tons", "scope2": "Y tons", "page": N, "location": "Section name", "pass": true/false},
+    "board_diversity": {"percentage": "X%", "details": "...", "page": N, "location": "Section name", "pass": true/false},
+    "environmental_violations": {"found": true/false, "details": "...", "page": N, "location": "Section name", "pass": true/false},
+    "labor_practices": {"compliant": true/false, "details": "...", "page": N, "location": "Section name", "pass": true/false},
+    "overall_pass": true/false,
+    "summary": "Brief explanation"
+}
+If you cannot find specific information, state "Not found in document" but still cite where you looked."""
+}
+VALUE_INVESTING_CRITERIA = {
+    "name": "Value Investing",
+    "description": "Traditional value investing metrics",
+    "rules": [
+        {
+            "name": "P/E Ratio",
+            "threshold": "< 15",
+            "description": "Price to Earnings ratio should be below 15"
+        },
+        {
+            "name": "Debt to Equity",
+            "threshold": "< 0.5",
+            "description": "Debt to Equity ratio should be below 0.5"
+        },
+        {
+            "name": "Free Cash Flow",
+            "threshold": "Positive",
+            "description": "Company must have positive free cash flow"
+        },
+        {
+            "name": "Revenue Growth",
+            "threshold": "> 5%",
+            "description": "Year-over-year revenue growth should exceed 5%"
+        }
+    ],
+    "analysis_prompt": """You are a value investing analyst. Analyze this financial document and extract the following:
+1. **P/E Ratio**: Calculate or find Price to Earnings ratio. Should be < 15.
+2. **Debt to Equity**: Calculate total debt / total equity. Should be < 0.5.
+3. **Free Cash Flow**: Find operating cash flow minus capital expenditures. Must be positive.
+4. **Revenue Growth**: Calculate year-over-year revenue growth. Should be > 5%.
+For EACH finding, you MUST provide:
+- The exact value or ratio
+- The page number where you found it
+- The specific section or table name
+Format your response as JSON:
+{
+    "pe_ratio": {"value": X, "page": N, "location": "Section name", "pass": true/false},
+    "debt_to_equity": {"value": X, "page": N, "location": "Section name", "pass": true/false},
+    "free_cash_flow": {"value": "$X", "positive": true/false, "page": N, "location": "Section name", "pass": true/false},
+    "revenue_growth": {"value": "X%", "page": N, "location": "Section name", "pass": true/false},
+    "overall_pass": true/false,
+    "summary": "Brief explanation"
+}
+If you cannot find specific information, state "Not found in document" but still cite where you looked."""
+}
+CRITERIA_OPTIONS = {
+    "Shariah Compliance": SHARIAH_CRITERIA,
+    "ESG Screening": ESG_CRITERIA,
+    "Value Investing": VALUE_INVESTING_CRITERIA
+}

document_processor.py ADDED Viewed

	@@ -0,0 +1,153 @@

+"""
+Document processing with LlamaIndex.
+Handles PDF parsing, indexing, and querying with citation tracking.
+"""
+import os
+import json
+from typing import Dict, Any, List
+from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
+from llama_index.llms.openai import OpenAI
+from llama_index.core.node_parser import SimpleNodeParser
+from llama_index.core.schema import NodeWithScore
+class InvestmentDocumentProcessor:
+    """Process investment documents (PDFs) and extract information with citations."""
+    def __init__(self, api_key: str):
+        """Initialize the processor with OpenAI API key."""
+        # Configure OpenAI GPT-4o-mini (cheap and fast)
+        self.llm = OpenAI(
+            model="gpt-4o-mini",
+            api_key=api_key,
+            temperature=0.1  # Low temperature for factual extraction
+        )
+        # Set global LLM (embeddings will use OpenAI default)
+        Settings.llm = self.llm
+        # Node parser to chunk documents while preserving metadata
+        self.node_parser = SimpleNodeParser.from_defaults(
+            chunk_size=1024,
+            chunk_overlap=200
+        )
+        self.index = None
+        self.documents = None
+    def load_pdf(self, pdf_path: str) -> None:
+        """Load and index a PDF document."""
+        # Load PDF with metadata extraction
+        reader = SimpleDirectoryReader(
+            input_files=[pdf_path],
+            filename_as_id=True
+        )
+        self.documents = reader.load_data()
+        # Add page numbers to metadata if not present
+        for doc in self.documents:
+            if 'page_label' not in doc.metadata:
+                # SimpleDirectoryReader should add page info, but fallback
+                doc.metadata['page_label'] = doc.metadata.get('page', 'Unknown')
+        # Create vector index
+        self.index = VectorStoreIndex.from_documents(
+            self.documents,
+            node_parser=self.node_parser,
+            show_progress=True
+        )
+    def analyze_with_criteria(self, criteria_prompt: str) -> Dict[str, Any]:
+        """
+        Analyze the document against investment criteria.
+        Returns analysis with citations.
+        """
+        if self.index is None:
+            raise ValueError("No document loaded. Call load_pdf() first.")
+        # Create query engine with citation tracking
+        query_engine = self.index.as_query_engine(
+            similarity_top_k=10,  # Get more context
+            response_mode="tree_summarize"
+        )
+        # Query with the criteria prompt
+        response = query_engine.query(criteria_prompt)
+        # Extract citations from source nodes
+        citations = self._extract_citations(response.source_nodes)
+        # Parse the response (expecting JSON)
+        try:
+            analysis_result = json.loads(str(response))
+        except json.JSONDecodeError:
+            # If not JSON, wrap in a structure
+            analysis_result = {
+                "raw_response": str(response),
+                "parse_error": True
+            }
+        # Add citations
+        analysis_result['citations'] = citations
+        analysis_result['source_nodes_count'] = len(response.source_nodes)
+        return analysis_result
+    def _extract_citations(self, source_nodes: List[NodeWithScore]) -> List[Dict[str, Any]]:
+        """Extract citation information from source nodes."""
+        citations = []
+        for idx, node in enumerate(source_nodes):
+            page = node.node.metadata.get('page_label',
+                                          node.node.metadata.get('page', 'Unknown'))
+            citation = {
+                "index": idx + 1,
+                "page": page,
+                "score": node.score,
+                "text_preview": node.node.text[:200] + "..." if len(node.node.text) > 200 else node.node.text,
+                "file_name": node.node.metadata.get('file_name', 'Unknown')
+            }
+            citations.append(citation)
+        return citations
+    def get_document_summary(self) -> Dict[str, Any]:
+        """Get basic document information."""
+        if self.documents is None:
+            return {"error": "No document loaded"}
+        return {
+            "num_pages": len(self.documents),
+            "file_name": self.documents[0].metadata.get('file_name', 'Unknown'),
+            "total_chars": sum(len(doc.text) for doc in self.documents)
+        }
+    def quick_search(self, query: str, top_k: int = 5) -> List[Dict[str, Any]]:
+        """
+        Perform a quick search in the document.
+        Useful for finding specific sections or terms.
+        """
+        if self.index is None:
+            raise ValueError("No document loaded. Call load_pdf() first.")
+        query_engine = self.index.as_query_engine(
+            similarity_top_k=top_k,
+            response_mode="no_text"  # Just return nodes, no generation
+        )
+        response = query_engine.query(query)
+        results = []
+        for node in response.source_nodes:
+            page = node.node.metadata.get('page_label',
+                                          node.node.metadata.get('page', 'Unknown'))
+            results.append({
+                "page": page,
+                "text": node.node.text,
+                "score": node.score
+            })
+        return results

requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+streamlit>=1.31.0
+llama-index>=0.10.0
+openai>=1.0.0
+pypdf>=4.0.0
+python-dotenv>=1.0.0