Spaces:
Configuration error
Configuration error
Long-Context Document Semantic Analysis System
This intelligent AI system analyzes long documents to automatically detect duplicates, contradictions, and inconsistencies using state-of-the-art Natural Language Processing (NLP) techniques.
Features
- Duplicate Detection: Identifies semantically identical or near-identical text segments using SBERT embeddings and FAISS vector search.
- Contradiction Detection: Uses a Cross-Encoder Natural Language Inference (NLI) model to flag logically conflicting statements.
- Holistic Analysis: Processes multiple documents (PDF, TXT) to find inconsistencies across the entire corpus.
- Evidence-Based Reporting: Generates a downloadable Markdown report with source references and confidence scores.
Architecture
- Document Processing: Extracts text from PDFs/TXTs and chunks it into overlapping segments.
- Embedding Generation:
sentence-transformers/all-MiniLM-L6-v2maps chunks to dense vector space. - Similarity Search:
FAISSefficiently finds potential duplicate candidates. - Logical Inference:
cross-encoder/nli-distilroberta-baseverifies logical relationships (Contradiction/Entailment) between similar chunks.
Installation
Create a Virtual Environment (Recommended):
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activateInstall Dependencies:
pip install -r requirements.txtNote: PyTorch installation might take a few minutes.
Usage
Start the Application:
streamlit run app.pyOR using the venv directly:
./venv/bin/streamlit run app.pyNavigate to the UI: Open your browser at
http://localhost:8501.Analyze:
- Upload PDF or TXT files via the sidebar.
- Click "Analyze Documents".
- View results on the dashboard and download the report.
Verification
To verify the core logic without the UI:
./venv/bin/python verify_backend.py
This generates sample contradictory documents and checks if the system flags them correctly.