File size: 2,129 Bytes
253246d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# Long-Context Document Semantic Analysis System

This intelligent AI system analyzes long documents to automatically detect duplicates, contradictions, and inconsistencies using state-of-the-art Natural Language Processing (NLP) techniques.

## Features
- **Duplicate Detection**: Identifies semantically identical or near-identical text segments using SBERT embeddings and FAISS vector search.
- **Contradiction Detection**: Uses a Cross-Encoder Natural Language Inference (NLI) model to flag logically conflicting statements.
- **Holistic Analysis**: Processes multiple documents (PDF, TXT) to find inconsistencies across the entire corpus.
- **Evidence-Based Reporting**: Generates a downloadable Markdown report with source references and confidence scores.

## Architecture
1. **Document Processing**: Extracts text from PDFs/TXTs and chunks it into overlapping segments.
2. **Embedding Generation**: `sentence-transformers/all-MiniLM-L6-v2` maps chunks to dense vector space.
3. **Similarity Search**: `FAISS` efficiently finds potential duplicate candidates.
4. **Logical Inference**: `cross-encoder/nli-distilroberta-base` verifies logical relationships (Contradiction/Entailment) between similar chunks.

## Installation

1. **Create a Virtual Environment** (Recommended):
   ```bash
   python3 -m venv venv
   source venv/bin/activate  # On Windows: venv\Scripts\activate
   ```

2. **Install Dependencies**:
   ```bash
   pip install -r requirements.txt
   ```
   *Note: PyTorch installation might take a few minutes.*

## Usage

1. **Start the Application**:
   ```bash
   streamlit run app.py
   ```
   OR using the venv directly:
   ```bash
   ./venv/bin/streamlit run app.py
   ```

2. **Navigate to the UI**:
   Open your browser at `http://localhost:8501`.

3. **Analyze**:
   - Upload PDF or TXT files via the sidebar.
   - Click "Analyze Documents".
   - View results on the dashboard and download the report.

## Verification
To verify the core logic without the UI:
```bash
./venv/bin/python verify_backend.py
```
This generates sample contradictory documents and checks if the system flags them correctly.