semantic_main / README.md
JAYASREESS's picture
Upload 8 files
253246d verified

Long-Context Document Semantic Analysis System

This intelligent AI system analyzes long documents to automatically detect duplicates, contradictions, and inconsistencies using state-of-the-art Natural Language Processing (NLP) techniques.

Features

  • Duplicate Detection: Identifies semantically identical or near-identical text segments using SBERT embeddings and FAISS vector search.
  • Contradiction Detection: Uses a Cross-Encoder Natural Language Inference (NLI) model to flag logically conflicting statements.
  • Holistic Analysis: Processes multiple documents (PDF, TXT) to find inconsistencies across the entire corpus.
  • Evidence-Based Reporting: Generates a downloadable Markdown report with source references and confidence scores.

Architecture

  1. Document Processing: Extracts text from PDFs/TXTs and chunks it into overlapping segments.
  2. Embedding Generation: sentence-transformers/all-MiniLM-L6-v2 maps chunks to dense vector space.
  3. Similarity Search: FAISS efficiently finds potential duplicate candidates.
  4. Logical Inference: cross-encoder/nli-distilroberta-base verifies logical relationships (Contradiction/Entailment) between similar chunks.

Installation

  1. Create a Virtual Environment (Recommended):

    python3 -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  2. Install Dependencies:

    pip install -r requirements.txt
    

    Note: PyTorch installation might take a few minutes.

Usage

  1. Start the Application:

    streamlit run app.py
    

    OR using the venv directly:

    ./venv/bin/streamlit run app.py
    
  2. Navigate to the UI: Open your browser at http://localhost:8501.

  3. Analyze:

    • Upload PDF or TXT files via the sidebar.
    • Click "Analyze Documents".
    • View results on the dashboard and download the report.

Verification

To verify the core logic without the UI:

./venv/bin/python verify_backend.py

This generates sample contradictory documents and checks if the system flags them correctly.