semantic_main / README.md
JAYASREESS's picture
Upload 8 files
253246d verified
# Long-Context Document Semantic Analysis System
This intelligent AI system analyzes long documents to automatically detect duplicates, contradictions, and inconsistencies using state-of-the-art Natural Language Processing (NLP) techniques.
## Features
- **Duplicate Detection**: Identifies semantically identical or near-identical text segments using SBERT embeddings and FAISS vector search.
- **Contradiction Detection**: Uses a Cross-Encoder Natural Language Inference (NLI) model to flag logically conflicting statements.
- **Holistic Analysis**: Processes multiple documents (PDF, TXT) to find inconsistencies across the entire corpus.
- **Evidence-Based Reporting**: Generates a downloadable Markdown report with source references and confidence scores.
## Architecture
1. **Document Processing**: Extracts text from PDFs/TXTs and chunks it into overlapping segments.
2. **Embedding Generation**: `sentence-transformers/all-MiniLM-L6-v2` maps chunks to dense vector space.
3. **Similarity Search**: `FAISS` efficiently finds potential duplicate candidates.
4. **Logical Inference**: `cross-encoder/nli-distilroberta-base` verifies logical relationships (Contradiction/Entailment) between similar chunks.
## Installation
1. **Create a Virtual Environment** (Recommended):
```bash
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
```
2. **Install Dependencies**:
```bash
pip install -r requirements.txt
```
*Note: PyTorch installation might take a few minutes.*
## Usage
1. **Start the Application**:
```bash
streamlit run app.py
```
OR using the venv directly:
```bash
./venv/bin/streamlit run app.py
```
2. **Navigate to the UI**:
Open your browser at `http://localhost:8501`.
3. **Analyze**:
- Upload PDF or TXT files via the sidebar.
- Click "Analyze Documents".
- View results on the dashboard and download the report.
## Verification
To verify the core logic without the UI:
```bash
./venv/bin/python verify_backend.py
```
This generates sample contradictory documents and checks if the system flags them correctly.