# Long-Context Document Semantic Analysis System This intelligent AI system analyzes long documents to automatically detect duplicates, contradictions, and inconsistencies using state-of-the-art Natural Language Processing (NLP) techniques. ## Features - **Duplicate Detection**: Identifies semantically identical or near-identical text segments using SBERT embeddings and FAISS vector search. - **Contradiction Detection**: Uses a Cross-Encoder Natural Language Inference (NLI) model to flag logically conflicting statements. - **Holistic Analysis**: Processes multiple documents (PDF, TXT) to find inconsistencies across the entire corpus. - **Evidence-Based Reporting**: Generates a downloadable Markdown report with source references and confidence scores. ## Architecture 1. **Document Processing**: Extracts text from PDFs/TXTs and chunks it into overlapping segments. 2. **Embedding Generation**: `sentence-transformers/all-MiniLM-L6-v2` maps chunks to dense vector space. 3. **Similarity Search**: `FAISS` efficiently finds potential duplicate candidates. 4. **Logical Inference**: `cross-encoder/nli-distilroberta-base` verifies logical relationships (Contradiction/Entailment) between similar chunks. ## Installation 1. **Create a Virtual Environment** (Recommended): ```bash python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 2. **Install Dependencies**: ```bash pip install -r requirements.txt ``` *Note: PyTorch installation might take a few minutes.* ## Usage 1. **Start the Application**: ```bash streamlit run app.py ``` OR using the venv directly: ```bash ./venv/bin/streamlit run app.py ``` 2. **Navigate to the UI**: Open your browser at `http://localhost:8501`. 3. **Analyze**: - Upload PDF or TXT files via the sidebar. - Click "Analyze Documents". - View results on the dashboard and download the report. ## Verification To verify the core logic without the UI: ```bash ./venv/bin/python verify_backend.py ``` This generates sample contradictory documents and checks if the system flags them correctly.