Spaces:

JAYASREESS
/

semantic_main

Configuration error

App Files Files Community

JAYASREESS commited on Jan 28

Commit

253246d

verified ·

1 Parent(s): 1be2eed

Upload 8 files

Browse files

Files changed (8) hide show

README.md +55 -19
app.py +166 -0
backend.py +200 -0
doc_a.txt +3 -0
doc_b.txt +3 -0
requirements.txt +6 -2
run_app.sh +12 -0
verify_backend.py +55 -0

README.md CHANGED Viewed

@@ -1,19 +1,55 @@
----
-title: Semantic Main
-emoji: 🚀
-colorFrom: red
-colorTo: red
-sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: Streamlit template space
----
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

+# Long-Context Document Semantic Analysis System
+This intelligent AI system analyzes long documents to automatically detect duplicates, contradictions, and inconsistencies using state-of-the-art Natural Language Processing (NLP) techniques.
+## Features
+- **Duplicate Detection**: Identifies semantically identical or near-identical text segments using SBERT embeddings and FAISS vector search.
+- **Contradiction Detection**: Uses a Cross-Encoder Natural Language Inference (NLI) model to flag logically conflicting statements.
+- **Holistic Analysis**: Processes multiple documents (PDF, TXT) to find inconsistencies across the entire corpus.
+- **Evidence-Based Reporting**: Generates a downloadable Markdown report with source references and confidence scores.
+## Architecture
+1. **Document Processing**: Extracts text from PDFs/TXTs and chunks it into overlapping segments.
+2. **Embedding Generation**: `sentence-transformers/all-MiniLM-L6-v2` maps chunks to dense vector space.
+3. **Similarity Search**: `FAISS` efficiently finds potential duplicate candidates.
+4. **Logical Inference**: `cross-encoder/nli-distilroberta-base` verifies logical relationships (Contradiction/Entailment) between similar chunks.
+## Installation
+1. **Create a Virtual Environment** (Recommended):
+   ```bash
+   python3 -m venv venv
+   source venv/bin/activate  # On Windows: venv\Scripts\activate
+   ```
+2. **Install Dependencies**:
+   ```bash
+   pip install -r requirements.txt
+   ```
+   *Note: PyTorch installation might take a few minutes.*
+## Usage
+1. **Start the Application**:
+   ```bash
+   streamlit run app.py
+   ```
+   OR using the venv directly:
+   ```bash
+   ./venv/bin/streamlit run app.py
+   ```
+2. **Navigate to the UI**:
+   Open your browser at `http://localhost:8501`.
+3. **Analyze**:
+   - Upload PDF or TXT files via the sidebar.
+   - Click "Analyze Documents".
+   - View results on the dashboard and download the report.
+## Verification
+To verify the core logic without the UI:
+```bash
+./venv/bin/python verify_backend.py
+```
+This generates sample contradictory documents and checks if the system flags them correctly.

app.py ADDED Viewed

	@@ -0,0 +1,166 @@

+import streamlit as st
+import pandas as pd
+import os
+import tempfile
+from backend import SemanticAnalyzer
+st.set_page_config(page_title="Semantic Document Analyzer", layout="wide")
+st.markdown("""
+    <style>
+    /* Premium Look & Feel */
+    .stApp {
+        background: linear-gradient(to right, #f8f9fa, #e9ecef);
+        font-family: 'Inter', sans-serif;
+    }
+    .stButton>button {
+        background: linear-gradient(45deg, #4f46e5, #7c3aed);
+        color: white;
+        border: none;
+        border-radius: 8px;
+        padding: 0.75rem 1.5rem;
+        font-weight: 600;
+        transition: all 0.3s ease;
+    }
+    .stButton>button:hover {
+        transform: translateY(-2px);
+        box-shadow: 0 4px 12px rgba(79, 70, 229, 0.3);
+    }
+    div[data-testid="stMetricValue"] {
+        color: #111827;
+        font-weight: 700;
+    }
+    h1 {
+        background: -webkit-linear-gradient(45deg, #1e3a8a, #3b82f6);
+        -webkit-background-clip: text;
+        -webkit-text-fill-color: transparent;
+        font-weight: 800 !important;
+    }
+    .css-1d391kg {
+        background-color: #ffffff;
+        border-radius: 12px;
+        padding: 1.5rem;
+        box-shadow: 0 4px 6px -1px rgba(0, 0, 0, 0.1);
+    }
+    </style>
+""", unsafe_allow_html=True)
+st.title("🧠 Semantic Document Analyzer")
+st.markdown("""
+<div style='background-color: white; padding: 1.5rem; border-radius: 10px; box-shadow: 0 2px 5px rgba(0,0,0,0.05); margin-bottom: 2rem;'>
+    <h4 style='margin-top:0'>Holistic Document Understanding</h4>
+    <p style='color: #4b5563;'>
+    This AI system leverages <b>Sentence-BERT</b> and <b>Cross-Encoders</b> to perform deep semantic analysis across long documents.
+    It goes beyond simple keyword matching to understand context, detecting subtle contradictions and semantic duplicates.
+    </p>
+</div>
+""", unsafe_allow_html=True)
+# Sidebar
+with st.sidebar:
+    st.header("Upload Documents")
+    uploaded_files = st.file_uploader("Upload PDF files", type=['pdf'], accept_multiple_files=True)
+    analyze_btn = st.button("Analyze Documents", type="primary")
+if analyze_btn and uploaded_files:
+    if len(uploaded_files) == 0:
+        st.error("Please upload at least one document.")
+    else:
+        with st.spinner("Processing documents... This may take a while for large files."):
+            # Save uploaded files temporarily
+            temp_dir = tempfile.mkdtemp()
+            file_paths = []
+            for uploaded_file in uploaded_files:
+                path = os.path.join(temp_dir, uploaded_file.name)
+                with open(path, "wb") as f:
+                    f.write(uploaded_file.getbuffer())
+                file_paths.append(path)
+            # Initialize Analyzer
+            try:
+                analyzer = SemanticAnalyzer()
+                results = analyzer.analyze_documents(file_paths)
+                # Cleanup
+                # for path in file_paths: os.remove(path)
+                # os.rmdir(temp_dir)
+                if "error" in results:
+                    st.error(results["error"])
+                else:
+                    # Dashboard Layout
+                    col1, col2 = st.columns(2)
+                    with col1:
+                        st.metric("Total Documents", results['stats']['total_docs'])
+                    with col2:
+                        st.metric("Total Text Chunks", results['stats']['total_chunks'])
+                    st.divider()
+                    # 1. Duplicates
+                    st.subheader(f"⚠️ Potential Duplicates Detected ({len(results['duplicates'])})")
+                    if results['duplicates']:
+                        for dup in results['duplicates']:
+                            with st.expander(f"Similarity Score: {dup['score']:.4f}"):
+                                c1, c2 = st.columns(2)
+                                with c1:
+                                    st.caption(f"Source: {dup['chunk_a']['source']}")
+                                    st.info(dup['chunk_a']['text'])
+                                with c2:
+                                    st.caption(f"Source: {dup['chunk_b']['source']}")
+                                    st.info(dup['chunk_b']['text'])
+                    else:
+                        st.success("No duplicates found.")
+                    st.divider()
+                    # 2. Contradictions
+                    st.subheader(f"🛑 Contradictions / Inconsistencies ({len(results['contradictions'])})")
+                    if results['contradictions']:
+                        for contra in results['contradictions']:
+                            with st.expander(f"Contradiction Confidence: {contra['confidence']:.4f}"):
+                                c1, c2 = st.columns(2)
+                                with c1:
+                                    st.caption(f"Source: {contra['chunk_a']['source']}")
+                                    st.warning(contra['chunk_a']['text'])
+                                with c2:
+                                    st.caption(f"Source: {contra['chunk_b']['source']}")
+                                    st.warning(contra['chunk_b']['text'])
+                            # Export Report
+                    report_text = f"# Semantic Analysis Report\n\n"
+                    report_text += f"Total Documents: {results['stats']['total_docs']}\n"
+                    report_text += f"Total Chunks: {results['stats']['total_chunks']}\n\n"
+                    report_text += "## Duplicates\n"
+                    if results['duplicates']:
+                        for d in results['duplicates']:
+                            report_text += f"- Score: {d['score']:.4f}\n"
+                            report_text += f"  - Source A: {d['chunk_a']['source']} | \"{d['chunk_a']['text'][:100]}...\"\n"
+                            report_text += f"  - Source B: {d['chunk_b']['source']} | \"{d['chunk_b']['text'][:100]}...\"\n\n"
+                    else:
+                        report_text += "No duplicates found.\n\n"
+                    report_text += "## Contradictions\n"
+                    if results['contradictions']:
+                         for c in results['contradictions']:
+                            report_text += f"- Confidence: {c['confidence']:.4f}\n"
+                            report_text += f"  - Source A: {c['chunk_a']['source']} | \"{c['chunk_a']['text']}\"\n"
+                            report_text += f"  - Source B: {c['chunk_b']['source']} | \"{c['chunk_b']['text']}\"\n\n"
+                    else:
+                        report_text += "No contradictions found.\n"
+                    st.download_button(
+                        label="Download Report (Markdown)",
+                        data=report_text,
+                        file_name="analysis_report.md",
+                        mime="text/markdown"
+                    )
+            except Exception as e:
+                st.error(f"An error occurred during analysis: {str(e)}")
+                import traceback
+                st.write(traceback.format_exc())
+else:
+    st.info("Upload documents and click Analyze to start.")

backend.py ADDED Viewed

	@@ -0,0 +1,200 @@

+import os
+from typing import List, Dict, Tuple
+import pypdf
+import numpy as np
+import faiss
+import torch
+from sentence_transformers import SentenceTransformer, CrossEncoder
+class DocumentProcessor:
+    @staticmethod
+    def extract_text(file_path: str) -> str:
+        ext = os.path.splitext(file_path)[1].lower()
+        if ext == '.pdf':
+            with open(file_path, 'rb') as f:
+                reader = pypdf.PdfReader(f)
+                text = ""
+                for page in reader.pages:
+                    page_text = page.extract_text()
+                    if page_text:
+                        text += page_text + "\n"
+            return text
+        elif ext == '.txt':
+            with open(file_path, 'r', encoding='utf-8') as f:
+                return f.read()
+        else:
+            return ""
+    @staticmethod
+    def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> List[Dict]:
+        """
+        Splits text into chunks with overlap.
+        Returns a list of dicts with 'id', 'text', 'start_idx', 'end_idx'.
+        """
+        # Simple sliding window based on characters for simplicity,
+        # ideally this would be token-based or sentence-based.
+        chunks = []
+        text_len = len(text)
+        start = 0
+        chunk_id = 0
+        while start < text_len:
+            end = min(start + chunk_size, text_len)
+            chunk_text = text[start:end]
+            # Try to cut at the last newline or period to be cleaner
+            if end < text_len:
+                last_period = chunk_text.rfind('.')
+                last_newline = chunk_text.rfind('\n')
+                break_point = max(last_period, last_newline)
+                if break_point != -1 and break_point > chunk_size * 0.5:
+                     end = start + break_point + 1
+                     chunk_text = text[start:end]
+            chunks.append({
+                'id': chunk_id,
+                'text': chunk_text.strip(),
+                'start_char': start,
+                'end_char': end
+            })
+            start = end - overlap
+            chunk_id += 1
+            if start >= text_len:
+                break
+        return chunks
+class EmbeddingEngine:
+    def __init__(self, model_name: str = 'all-MiniLM-L6-v2'):
+        # Force CPU if no CUDA, though usually auto-detected.
+        device = 'cuda' if torch.cuda.is_available() else 'cpu'
+        self.model = SentenceTransformer(model_name, device=device)
+    def encode(self, texts: List[str]) -> np.ndarray:
+        embeddings = self.model.encode(texts, convert_to_numpy=True)
+        # Normalize for cosine similarity in FAISS
+        faiss.normalize_L2(embeddings)
+        return embeddings
+class VectorStore:
+    def __init__(self, dimension: int):
+        self.dimension = dimension
+        self.index = faiss.IndexFlatIP(dimension) # Inner Product + Normalized = Cosine Similarity
+    def add(self, embeddings: np.ndarray):
+        self.index.add(embeddings)
+    def search(self, query_embeddings: np.ndarray, k: int = 5) -> Tuple[np.ndarray, np.ndarray]:
+        return self.index.search(query_embeddings, k)
+class SemanticAnalyzer:
+    def __init__(self):
+        self.embedding_engine = EmbeddingEngine()
+        # NLI model for contradiction detection
+        # We load it lazily or here. Keeping it here for now.
+        # This model outputs logits for [Contradiction, Entailment, Neutral] or similar depending on training.
+        # cross-encoder/nli-distilroberta-base outputs: [contradiction, entailment, neutral] usually?
+        # Actually checking HuggingFace: cross-encoder/nli-distilroberta-base
+        # Label mapping: 0: contradiction, 1: entailment, 2: neutral (Check specific model card if unsure, usually standard)
+        self.nli_model = CrossEncoder('cross-encoder/nli-distilroberta-base')
+    def analyze_documents(self, file_paths: List[str]) -> Dict:
+        """
+        Main pipeline function.
+        """
+        all_chunks = []
+        doc_map = {} # chunk_id -> source_doc
+        # 1. Load and Chunk
+        global_chunk_id = 0
+        for fpath in file_paths:
+            fname = os.path.basename(fpath)
+            raw_text = DocumentProcessor.extract_text(fpath)
+            chunks = DocumentProcessor.chunk_text(raw_text)
+            for c in chunks:
+                c['global_id'] = global_chunk_id
+                c['source'] = fname
+                all_chunks.append(c)
+                global_chunk_id += 1
+        if not all_chunks:
+            return {"error": "No text extracted"}
+        texts = [c['text'] for c in all_chunks]
+        # 2. Embed
+        embeddings = self.embedding_engine.encode(texts)
+        # 3. Build Index
+        d = embeddings.shape[1]
+        vector_store = VectorStore(d)
+        vector_store.add(embeddings)
+        results = {
+            "duplicates": [],
+            "contradictions": [],
+            "stats": {
+                "total_docs": len(file_paths),
+                "total_chunks": len(all_chunks)
+            }
+        }
+        # 4. Detect Duplicates & Contradictions
+        # For every chunk, look for similar chunks
+        # k=10 neighbors
+        D, I = vector_store.search(embeddings, k=min(10, len(all_chunks)))
+        checked_pairs = set()
+        for i in range(len(all_chunks)):
+            for rank, j in enumerate(I[i]):
+                if i == j: continue # Skip self
+                sim_score = D[i][rank]
+                if sim_score < 0.5: continue # optimization: ignore low similarity
+                # Sort indices to avoid double checking (i,j) vs (j,i)
+                pair = tuple(sorted((i, j)))
+                if pair in checked_pairs:
+                    continue
+                checked_pairs.add(pair)
+                chunk_a = all_chunks[i]
+                chunk_b = all_chunks[j]
+                # DUPLICATE DETECTION
+                # Threshold > 0.95 usually implies near duplicate
+                if sim_score > 0.95:
+                    results["duplicates"].append({
+                        "score": float(sim_score),
+                        "chunk_a": chunk_a,
+                        "chunk_b": chunk_b
+                    })
+                    continue # If it's a duplicate, we barely care if it contradicts (it shouldn't)
+                # CONTRADICTION DETECTION
+                # If they are talking about the same thing (high similarity) but not identical
+                # Run NLI
+                if sim_score > 0.65:
+                    # CrossEncoder input is list of pairs
+                    scores = self.nli_model.predict([(chunk_a['text'], chunk_b['text'])])
+                    # scores is [logit_contradiction, logit_entailment, logit_neutral]
+                    # argmax 0 -> contradiction
+                    label = scores[0].argmax()
+                    # Assuming mapping: 0: contradiction, 1: entailment, 2: neutral
+                    # We need to verify this specific model's mapping.
+                    # Most nli models on HF: 0: contradiction, 1: entailment, 2: neutral.
+                    # verify: cross-encoder/nli-distilroberta-base
+                    # documentation says: label2id: {'contradiction': 0, 'entailment': 1, 'neutral': 2}
+                    if label == 0: # Contradiction
+                         results["contradictions"].append({
+                            "similarity": float(sim_score),
+                            "confidence": float(scores[0][0]), # logit strength? convert to prob with softmax if needed
+                            "chunk_a": chunk_a,
+                            "chunk_b": chunk_b
+                        })
+        return results

doc_a.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+The software release is scheduled for Q3 2024.
+Machine learning models require vast amounts of data.
+This is a generic statement about AI capabilities.

doc_b.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+The software release is strictly scheduled for Q4 2025.
+Machine learning models require vast amounts of data.
+AI generates images from text prompts.

requirements.txt CHANGED Viewed

@@ -1,3 +1,7 @@
-altair
 pandas
-streamlit

+streamlit
+sentence-transformers
+faiss-cpu
+torch
+numpy
 pandas
+pypdf

run_app.sh ADDED Viewed

	@@ -0,0 +1,12 @@

+#!/bin/bash
+# Ensure we are using the virtual environment's streamlit
+# This resolves ModuleNotFoundError for packages installed in venv
+# Check if venv exists
+if [ -d "venv" ]; then
+    echo "Starting Semantic Analyzer from venv..."
+    ./venv/bin/streamlit run app.py
+else
+    echo "Error: Virtual environment 'venv' not found."
+    echo "Please run: python3 -m venv venv && ./venv/bin/pip install -r requirements.txt"
+fi

verify_backend.py ADDED Viewed

	@@ -0,0 +1,55 @@

+import os
+import shutil
+from backend import SemanticAnalyzer
+def create_dummy_files():
+    if os.path.exists("test_data"):
+        shutil.rmtree("test_data")
+    os.makedirs("test_data")
+    # File A
+    with open("test_data/doc_a.txt", "w") as f:
+        f.write("The software release is scheduled for Q3 2024.\n")
+        f.write("Machine learning models require vast amounts of data.\n")
+        f.write("This is a generic statement about AI capabilities.\n")
+    # File B
+    with open("test_data/doc_b.txt", "w") as f:
+        f.write("The software release is strictly scheduled for Q4 2025.\n") # Contradiction
+        f.write("Machine learning models require vast amounts of data.\n") # Duplicate
+        f.write("AI generates images from text prompts.\n") # Unrelated
+def run_test():
+    create_dummy_files()
+    files = ["test_data/doc_a.txt", "test_data/doc_b.txt"]
+    print("Initializing Analyzer...")
+    analyzer = SemanticAnalyzer()
+    print("Analyzing...")
+    results = analyzer.analyze_documents(files)
+    print("\n=== RESULTS ===")
+    print(f"Duplicates found: {len(results['duplicates'])}")
+    for d in results['duplicates']:
+        print(f"  [Match] ({d['score']:.4f})")
+        print(f"  A: {d['chunk_a']['text']}")
+        print(f"  B: {d['chunk_b']['text']}")
+    print(f"\nContradictions found: {len(results['contradictions'])}")
+    for c in results['contradictions']:
+        print(f"  [Conflict] (Conf: {c['confidence']:.4f})")
+        print(f"  A: {c['chunk_a']['text']}")
+        print(f"  B: {c['chunk_b']['text']}")
+    # Validation logic
+    has_dup = any("vast amounts of data" in d['chunk_a']['text'] for d in results['duplicates'])
+    has_contra = any("software release" in c['chunk_a']['text'] for c in results['contradictions'])
+    if has_dup and has_contra:
+        print("\n✅ VERIFICATION PASSED: Core logic works.")
+    else:
+        print("\n❌ VERIFICATION FAILED: Missing expected detections.")
+if __name__ == "__main__":
+    run_test()