Spaces:

Krish-05
/

text-extraction-api

Sleeping

App Files Files Community

krishnachoudhary-hclguvi commited on Apr 2

Commit

52a0fe9

unverified ·

1 Parent(s): d1e9916

Deploy text extraction API files

Browse files

Files changed (23) hide show

DEPLOYMENT.md +64 -0
Dockerfile +40 -0
README.md +64 -10
analyzers/__init__.py +1 -0
analyzers/ner_extractor.py +145 -0
analyzers/sentiment.py +103 -0
analyzers/summarizer.py +98 -0
config.py +77 -0
docker-compose.yml +13 -0
extractors/__init__.py +1 -0
extractors/docx_extractor.py +95 -0
extractors/ocr_extractor.py +245 -0
extractors/pdf_extractor.py +79 -0
extractors/url_extractor.py +108 -0
main.py +307 -0
models/__init__.py +1 -0
models/schemas.py +116 -0
requirements.txt +16 -0
static/app.js +586 -0
static/index.html +268 -0
static/styles.css +1156 -0
test_api.py +78 -0
test_simple.py +49 -0

DEPLOYMENT.md ADDED Viewed

	@@ -0,0 +1,64 @@

+# Alldocex - Deployment Guide
+This guide provides three main options for deploying the Alldocex application to a production environment.
+## 🏗️ Option 1: Docker (Recommended)
+Docker is the best choice because it packages all the AI models, dependencies, and system libraries into a single container.
+### 1. Build the image
+```bash
+docker build -t alldocex-app .
+```
+### 2. Run with Docker Compose
+```bash
+docker-compose up -d
+```
+The application will be available at `http://localhost:8000`.
+---
+## ☁️ Option 2: Cloud Deployment (Render / Railway / Fly.io)
+### **Render Deployment (Recommended)**
+1.  **Connect GitHub**: Push your code to a GitHub repository.
+2.  **Create Web Service**: Select "Web Service" in Render.
+3.  **Docker Environment**: Render will automatically detect the `Dockerfile`.
+4.  **Resource Plan**: Ensure you select a plan with at least **4GB RAM** (e.g., Starter or Pro).
+5.  **Environment Variables**: Add `PORT = 8000` if required.
+---
+## 🖥️ Option 3: Manual Deployment (Ubuntu/Debian Server)
+If you are deploying directly to a Linux VPS (without Docker):
+### 1. Install System Dependencies
+```bash
+sudo apt-get update
+sudo apt-get install -y tesseract-ocr libgl1-mesa-glx libglib2.0-0 build-essential python3-venv
+```
+### 2. Set Up Environment
+```bash
+python3 -m venv venv
+source venv/bin/activate
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+```
+### 3. Run with Gunicorn (Production Server)
+```bash
+pip install gunicorn
+gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app --bind 0.0.0.0:8000
+```
+---
+## ⚠️ Important Considerations
+*   **RAM**: AI models (EasyOCR, Torch, spaCy) are memory-intensive. Do NOT deploy on a "Free Tier" instance with only 512MB or 1GB of RAM.
+*   **Disk Space**: The first time you run the app, it will download several hundred megabytes of model weights.
+*   **Permissions**: Ensure the `uploads/` directory has write permissions for the user running the application.
+*   **Reverse Proxy**: For public deployment, it is highly recommended to use **Nginx** as a reverse proxy with SSL (Let's Encrypt).

Dockerfile ADDED Viewed

	@@ -0,0 +1,40 @@

+# Use a lean Python 3.10 image as base
+FROM python:3.10-slim
+# Set environment variables
+ENV PYTHONDONTWRITEBYTECODE=1
+ENV PYTHONUNBUFFERED=1
+ENV DEBIAN_FRONTEND=noninteractive
+# Set working directory
+WORKDIR /app
+# Install system dependencies for OCR and NLP
+RUN apt-get update && apt-get install -y \
+    tesseract-ocr \
+    libgl1-mesa-glx \
+    libglib2.0-0 \
+    build-essential \
+    && apt-get clean \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements file
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Download spaCy model during build to improve runtime performance
+RUN python -m spacy download en_core_web_sm
+# Create uploads directory
+RUN mkdir -p /app/uploads
+# Copy the rest of the application code
+COPY . .
+# Expose the API port for Hugging Face Spaces
+EXPOSE 7860
+# Start the application using Uvicorn
+CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,11 +1,65 @@
----
-title: Text Extraction Api
-emoji: 🦀
-colorFrom: blue
-colorTo: indigo
-sdk: docker
-pinned: false
-license: mit
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Alldocex — Intelligent Document Processing System
+![Version](https://img.shields.io/badge/version-1.1.0-blue)
+![License](https://img.shields.io/badge/license-MIT-green)
+**Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
+## 🚀 Key Features
+*   **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
+*   **Layout-Aware PDF Engine**: Uses advanced 'layout' mode to preserve columns, tables, and physical text positioning.
+*   **Intelligent OCR**: Powered by **EasyOCR** (Deep Learning based) for superior accuracy in scanned documents.
+*   **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
+*   **AI Analysis Suite**:
+    *   **Extractive Summarization**: Condenses long documents into key highlights.
+    *   **Named Entity Recognition (NER)**: Detects People, Organizations, Dates, and more via **spaCy**.
+    *   **Sentiment Analysis**: Analyzes emotional tone using the **VADER** algorithm.
+*   **Downloadable Results**: Export extracted text as clean `.txt` files.
+*   **Corporate UI**: A professional Blue & White dashboard with smooth animations and intuitive navigation.
+## 🛠️ Technology Stack
+*   **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
+*   **PDF Processing**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) (Layout Mode)
+*   **OCR**: [EasyOCR](https://github.com/JaidedAI/EasyOCR) & [Tesseract](https://github.com/tesseract-ocr/tesseract)
+*   **NLP**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
+*   **Frontend**: Vanilla HTML5, CSS3 (Modern UI), and JavaScript (ES6+)
+## 📦 Installation
+### 1. Clone the repository
+```bash
+git clone <your-repo-url>
+cd guvi-extraction
+```
+### 2. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+### 3. Install NLP model
+```bash
+python -m spacy download en_core_web_sm
+```
+## 🏃 Getting Started
+1.  Start the backend server:
+    ```bash
+    python main.py
+    ```
+2.  Open your browser and navigate to:
+    `http://localhost:7860
+    `
+## 📘 Usage
+1.  **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
+2.  **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
+3.  **URL Entry**: Paste a web link to summarize online articles instantly.
+4.  **Download**: Once processing is complete, use the **Download** button to save the extracted text.
+---

analyzers/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Analyzers package

analyzers/ner_extractor.py ADDED Viewed

	@@ -0,0 +1,145 @@

+"""
+Named Entity Recognition using spaCy.
+Extracts persons, organizations, dates, monetary amounts, locations, and more.
+Also uses regex patterns for additional entity types.
+"""
+import re
+from collections import Counter
+from typing import List, Dict
+from models.schemas import Entity, EntityResult
+from config import SPACY_MODEL, NER_ENTITY_TYPES
+# Try to load spaCy model
+try:
+    import spacy
+    nlp = spacy.load(SPACY_MODEL)
+    SPACY_AVAILABLE = True
+except (ImportError, OSError):
+    SPACY_AVAILABLE = False
+    nlp = None
+# Entity label descriptions
+LABEL_DESCRIPTIONS = {
+    "PERSON": "Person name",
+    "ORG": "Organization",
+    "GPE": "Country / City / State",
+    "DATE": "Date or period",
+    "MONEY": "Monetary value",
+    "TIME": "Time expression",
+    "PERCENT": "Percentage",
+    "EVENT": "Named event",
+    "PRODUCT": "Product name",
+    "LAW": "Law or regulation",
+    "NORP": "Nationality / Group",
+    "FAC": "Facility / Building",
+    "LOC": "Non-GPE location",
+    "WORK_OF_ART": "Title of work",
+    "LANGUAGE": "Language name",
+    "CARDINAL": "Number",
+    "ORDINAL": "Ordinal number",
+    "QUANTITY": "Measurement",
+    "EMAIL": "Email address",
+    "PHONE": "Phone number",
+    "URL": "Web URL",
+}
+# Regex patterns for additional entity types
+REGEX_PATTERNS = {
+    "EMAIL": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
+    "PHONE": r'(?:\+?\d{1,3}[-.\s]?)?\(?\d{2,4}\)?[-.\s]?\d{3,4}[-.\s]?\d{3,4}',
+    "URL": r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+[/\w\-._~:/?#\[\]@!$&\'()*+,;=%]*',
+}
+def _extract_regex_entities(text: str) -> List[Entity]:
+    """Extract entities using regex patterns."""
+    entities = []
+    for label, pattern in REGEX_PATTERNS.items():
+        matches = re.findall(pattern, text)
+        if matches:
+            counted = Counter(matches)
+            for match_text, count in counted.most_common():
+                entities.append(Entity(
+                    text=match_text,
+                    label=label,
+                    label_description=LABEL_DESCRIPTIONS.get(label, label),
+                    count=count,
+                ))
+    return entities
+def _extract_spacy_entities(text: str) -> List[Entity]:
+    """Extract entities using spaCy NER."""
+    if not SPACY_AVAILABLE or nlp is None:
+        return []
+    # Process text (handle long texts by chunking)
+    max_length = 100000
+    if len(text) > max_length:
+        text = text[:max_length]
+    doc = nlp(text)
+    # Collect and deduplicate entities
+    entity_map: Dict[str, Dict] = {}
+    for ent in doc.ents:
+        if ent.label_ not in NER_ENTITY_TYPES:
+            continue
+        clean_text = ent.text.strip()
+        if not clean_text or len(clean_text) < 2:
+            continue
+        key = f"{ent.label_}:{clean_text.lower()}"
+        if key in entity_map:
+            entity_map[key]["count"] += 1
+            entity_map[key]["positions"].append(ent.start_char)
+        else:
+            entity_map[key] = {
+                "text": clean_text,
+                "label": ent.label_,
+                "label_description": LABEL_DESCRIPTIONS.get(ent.label_, ent.label_),
+                "count": 1,
+                "positions": [ent.start_char],
+            }
+    # Convert to Entity objects and sort by count
+    entities = [
+        Entity(**data)
+        for data in sorted(entity_map.values(), key=lambda x: x["count"], reverse=True)
+    ]
+    return entities
+def extract_entities(text: str) -> EntityResult:
+    """
+    Extract named entities from text using spaCy and regex patterns.
+    Args:
+        text: The input text to analyze.
+    Returns:
+        EntityResult with all found entities and statistics.
+    """
+    if not text.strip():
+        return EntityResult(entities=[], entity_counts={}, total_entities=0)
+    # Get entities from both sources
+    spacy_entities = _extract_spacy_entities(text)
+    regex_entities = _extract_regex_entities(text)
+    # Combine (spaCy entities first, then regex)
+    all_entities = spacy_entities + regex_entities
+    # Count by category
+    entity_counts: Dict[str, int] = {}
+    for ent in all_entities:
+        entity_counts[ent.label] = entity_counts.get(ent.label, 0) + ent.count
+    return EntityResult(
+        entities=all_entities,
+        entity_counts=entity_counts,
+        total_entities=sum(ent.count for ent in all_entities),
+    )

analyzers/sentiment.py ADDED Viewed

	@@ -0,0 +1,103 @@

+"""
+Sentiment analysis using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner).
+Provides both overall and sentence-level sentiment analysis.
+"""
+import nltk
+from nltk.sentiment.vader import SentimentIntensityAnalyzer
+from nltk.tokenize import sent_tokenize
+from models.schemas import SentimentResult, SentimentBreakdown
+from config import SENTIMENT_THRESHOLDS
+from typing import List
+# Download required NLTK data
+try:
+    nltk.data.find("sentiment/vader_lexicon.zip")
+except LookupError:
+    nltk.download("vader_lexicon", quiet=True)
+try:
+    nltk.data.find("tokenizers/punkt_tab")
+except LookupError:
+    nltk.download("punkt_tab", quiet=True)
+# Initialize analyzer
+sia = SentimentIntensityAnalyzer()
+def _get_sentiment_label(compound: float) -> str:
+    """Convert compound score to human-readable label."""
+    if compound >= 0.5:
+        return "Very Positive"
+    elif compound >= SENTIMENT_THRESHOLDS["positive"]:
+        return "Positive"
+    elif compound <= -0.5:
+        return "Very Negative"
+    elif compound <= SENTIMENT_THRESHOLDS["negative"]:
+        return "Negative"
+    else:
+        return "Neutral"
+def analyze_sentiment(text: str) -> SentimentResult:
+    """
+    Perform sentiment analysis on the given text.
+    Returns overall sentiment scores and sentence-level breakdown.
+    Args:
+        text: The input text to analyze.
+    Returns:
+        SentimentResult with overall and per-sentence sentiment analysis.
+    """
+    if not text.strip():
+        return SentimentResult(
+            overall_compound=0.0,
+            overall_positive=0.0,
+            overall_negative=0.0,
+            overall_neutral=1.0,
+            overall_label="Neutral",
+            sentence_breakdown=[],
+            confidence=0.0,
+        )
+    # Overall sentiment
+    overall_scores = sia.polarity_scores(text)
+    # Sentence-level breakdown
+    sentences = sent_tokenize(text)
+    sentence_breakdown: List[SentimentBreakdown] = []
+    # Limit to first 50 sentences for performance
+    for sent in sentences[:50]:
+        sent = sent.strip()
+        if not sent or len(sent) < 5:
+            continue
+        scores = sia.polarity_scores(sent)
+        sentence_breakdown.append(SentimentBreakdown(
+            text=sent[:200],  # Truncate very long sentences
+            compound=round(scores["compound"], 4),
+            positive=round(scores["pos"], 4),
+            negative=round(scores["neg"], 4),
+            neutral=round(scores["neu"], 4),
+            label=_get_sentiment_label(scores["compound"]),
+        ))
+    # Calculate confidence based on consistency of sentence sentiments
+    if sentence_breakdown:
+        compounds = [sb.compound for sb in sentence_breakdown]
+        avg_magnitude = sum(abs(c) for c in compounds) / len(compounds)
+        confidence = min(avg_magnitude * 2, 1.0)  # Scale to 0-1
+    else:
+        confidence = abs(overall_scores["compound"])
+    return SentimentResult(
+        overall_compound=round(overall_scores["compound"], 4),
+        overall_positive=round(overall_scores["pos"], 4),
+        overall_negative=round(overall_scores["neg"], 4),
+        overall_neutral=round(overall_scores["neu"], 4),
+        overall_label=_get_sentiment_label(overall_scores["compound"]),
+        sentence_breakdown=sentence_breakdown,
+        confidence=round(confidence, 4),
+    )

analyzers/summarizer.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""
+Extractive text summarization using sumy library.
+Uses LexRank algorithm by default for graph-based sentence ranking.
+"""
+from sumy.parsers.plaintext import PlaintextParser
+from sumy.nlp.tokenizers import Tokenizer
+from sumy.summarizers.lex_rank import LexRankSummarizer
+from sumy.summarizers.lsa import LsaSummarizer
+from sumy.summarizers.luhn import LuhnSummarizer
+from sumy.nlp.stemmers import Stemmer
+from sumy.utils import get_stop_words
+from models.schemas import SummaryResult
+from config import SUMMARY_SENTENCE_COUNT, SUMMARY_ALGORITHM
+LANGUAGE = "english"
+def _get_summarizer(algorithm: str):
+    """Get the appropriate summarizer based on algorithm name."""
+    stemmer = Stemmer(LANGUAGE)
+    if algorithm == "lsa":
+        summarizer = LsaSummarizer(stemmer)
+    elif algorithm == "luhn":
+        summarizer = LuhnSummarizer(stemmer)
+    else:  # default to lex-rank
+        summarizer = LexRankSummarizer(stemmer)
+    summarizer.stop_words = get_stop_words(LANGUAGE)
+    return summarizer
+def summarize_text(text: str, sentence_count: int = None, algorithm: str = None) -> SummaryResult:
+    """
+    Generate an extractive summary of the given text.
+    Args:
+        text: The input text to summarize.
+        sentence_count: Number of sentences in the summary (default from config).
+        algorithm: Summarization algorithm to use (default from config).
+    Returns:
+        SummaryResult with the summary and statistics.
+    """
+    if sentence_count is None:
+        sentence_count = SUMMARY_SENTENCE_COUNT
+    if algorithm is None:
+        algorithm = SUMMARY_ALGORITHM
+    # Handle short texts
+    sentences_in_text = [s.strip() for s in text.replace("\n", " ").split(".") if s.strip()]
+    if len(sentences_in_text) <= sentence_count:
+        # Text is already short enough
+        clean_text = " ".join(text.split())
+        return SummaryResult(
+            summary=clean_text,
+            original_length=len(text),
+            summary_length=len(clean_text),
+            compression_ratio=1.0,
+            sentence_count=len(sentences_in_text),
+            algorithm=algorithm,
+        )
+    try:
+        # Parse the text
+        parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
+        summarizer = _get_summarizer(algorithm)
+        # Generate summary
+        summary_sentences = summarizer(parser.document, sentence_count)
+        summary = " ".join(str(sentence) for sentence in summary_sentences)
+        if not summary.strip():
+            # Fallback: return first N sentences
+            summary = ". ".join(sentences_in_text[:sentence_count]) + "."
+        compression_ratio = len(summary) / len(text) if len(text) > 0 else 1.0
+        return SummaryResult(
+            summary=summary,
+            original_length=len(text),
+            summary_length=len(summary),
+            compression_ratio=round(compression_ratio, 4),
+            sentence_count=sentence_count,
+            algorithm=algorithm,
+        )
+    except Exception as e:
+        # Fallback: return first few sentences
+        fallback = ". ".join(sentences_in_text[:sentence_count]) + "."
+        return SummaryResult(
+            summary=fallback,
+            original_length=len(text),
+            summary_length=len(fallback),
+            compression_ratio=round(len(fallback) / len(text), 4) if len(text) > 0 else 1.0,
+            sentence_count=sentence_count,
+            algorithm=f"{algorithm} (fallback)",
+        )

config.py ADDED Viewed

	@@ -0,0 +1,77 @@

+"""
+Configuration settings for the Document Processing System.
+"""
+import os
+import shutil
+# --- Paths ---
+BASE_DIR = os.path.dirname(os.path.abspath(__file__))
+UPLOAD_DIR = os.path.join(BASE_DIR, "uploads")
+STATIC_DIR = os.path.join(BASE_DIR, "static")
+# Create uploads directory if it doesn't exist
+os.makedirs(UPLOAD_DIR, exist_ok=True)
+# --- File Upload Settings ---
+MAX_FILE_SIZE_MB = 50
+MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
+ALLOWED_EXTENSIONS = {
+    "pdf": "application/pdf",
+    "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
+    "png": "image/png",
+    "jpg": "image/jpeg",
+    "jpeg": "image/jpeg",
+    "tiff": "image/tiff",
+    "bmp": "image/bmp",
+    "webp": "image/webp",
+}
+# --- OCR Configuration ---
+# EasyOCR settings
+EASYOCR_LANGS = ["en"]  # Languages to support
+EASYOCR_GPU = False      # Set to True if NVIDIA GPU is available and CUDA is installed
+# Keep Tesseract as fallback if needed, but prioritize EasyOCR for accuracy
+def find_tesseract():
+    """Auto-detect Tesseract installation path on Windows."""
+    import shutil
+    tesseract_in_path = shutil.which("tesseract")
+    if tesseract_in_path:
+        return tesseract_in_path
+    common_paths = [
+        r"C:\Program Files\Tesseract-OCR\tesseract.exe",
+        r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
+        r"C:\Users\{}\AppData\Local\Tesseract-OCR\tesseract.exe".format(os.getenv("USERNAME", "")),
+    ]
+    for path in common_paths:
+        if os.path.isfile(path):
+            return path
+    return None
+TESSERACT_CMD = find_tesseract()
+TESSERACT_LANG = "eng"
+def check_ocr_availability():
+    """Check if any OCR engine is available."""
+    try:
+        import easyocr
+        return "available"
+    except ImportError:
+        if TESSERACT_CMD:
+            return "tesseract-only"
+        return "not-found"
+# --- Summarization Settings ---
+SUMMARY_SENTENCE_COUNT = 5
+SUMMARY_ALGORITHM = "lex-rank"  # Options: lex-rank, lsa, luhn, edmundson
+# --- NER Settings ---
+SPACY_MODEL = "en_core_web_sm"
+NER_ENTITY_TYPES = ["PERSON", "ORG", "DATE", "MONEY", "GPE", "EVENT", "PRODUCT", "LAW", "NORP"]
+# --- Sentiment Settings ---
+SENTIMENT_THRESHOLDS = {
+    "positive": 0.05,
+    "negative": -0.05,
+}

docker-compose.yml ADDED Viewed

	@@ -0,0 +1,13 @@

+version: '3.8'
+services:
+  alldocex:
+    build: .
+    container_name: alldocex-app
+    ports:
+      - "7860:7860"
+    volumes:
+      - ./uploads:/app/uploads
+    environment:
+      - PORT=8000
+    restart: always

extractors/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Extractors package

extractors/docx_extractor.py ADDED Viewed

	@@ -0,0 +1,95 @@

+"""
+DOCX text extraction using python-docx.
+Extracts text preserving paragraph structure, tables, and document properties.
+"""
+import time
+import os
+from docx import Document
+from models.schemas import ExtractionResult, DocumentMetadata
+def extract_docx(file_path: str) -> ExtractionResult:
+    """Extract text and metadata from a DOCX file."""
+    start_time = time.time()
+    try:
+        doc = Document(file_path)
+        # Extract paragraphs
+        paragraphs = []
+        for para in doc.paragraphs:
+            text = para.text.strip()
+            if text:
+                # Preserve heading structure
+                if para.style and para.style.name.startswith("Heading"):
+                    level = para.style.name.replace("Heading ", "").strip()
+                    prefix = "#" * int(level) if level.isdigit() else "##"
+                    paragraphs.append(f"{prefix} {text}")
+                else:
+                    paragraphs.append(text)
+        # Extract tables
+        tables_text = []
+        for table_idx, table in enumerate(doc.tables):
+            table_data = []
+            for row in table.rows:
+                row_data = [cell.text.strip() for cell in row.cells]
+                table_data.append(" | ".join(row_data))
+            if table_data:
+                tables_text.append(f"\n[Table {table_idx + 1}]\n" + "\n".join(table_data))
+        # Combine all text
+        full_text = "\n\n".join(paragraphs)
+        if tables_text:
+            full_text += "\n\n" + "\n".join(tables_text)
+        # Extract metadata from core properties
+        props = doc.core_properties
+        metadata = DocumentMetadata(
+            title=props.title or os.path.basename(file_path),
+            author=props.author or "Unknown",
+            creation_date=str(props.created) if props.created else "",
+            modification_date=str(props.modified) if props.modified else "",
+            page_count=None,  # DOCX doesn't expose page count easily
+            word_count=len(full_text.split()) if full_text else 0,
+            character_count=len(full_text),
+            file_type="DOCX",
+            extra={
+                "category": props.category or "",
+                "comments": props.comments or "",
+                "last_modified_by": props.last_modified_by or "",
+                "revision": props.revision,
+                "subject": props.subject or "",
+                "keywords": props.keywords or "",
+                "paragraph_count": len(doc.paragraphs),
+                "table_count": len(doc.tables),
+            }
+        )
+        elapsed = (time.time() - start_time) * 1000
+        if not full_text.strip():
+            return ExtractionResult(
+                raw_text="",
+                metadata=metadata,
+                success=False,
+                error_message="No text content found in the DOCX file.",
+                extraction_time_ms=elapsed,
+            )
+        return ExtractionResult(
+            raw_text=full_text,
+            metadata=metadata,
+            success=True,
+            extraction_time_ms=elapsed,
+        )
+    except Exception as e:
+        elapsed = (time.time() - start_time) * 1000
+        return ExtractionResult(
+            raw_text="",
+            metadata=DocumentMetadata(file_type="DOCX"),
+            success=False,
+            error_message=f"DOCX extraction failed: {str(e)}",
+            extraction_time_ms=elapsed,
+        )

extractors/ocr_extractor.py ADDED Viewed

	@@ -0,0 +1,245 @@

+"""
+Image OCR extraction using EasyOCR (primary) and Tesseract (fallback).
+Includes advanced image preprocessing for maximum accuracy.
+"""
+import time
+import os
+import numpy as np
+from PIL import Image, ImageEnhance, ImageFilter, ImageOps
+from models.schemas import ExtractionResult, DocumentMetadata
+import config
+# --- OCR Engine Detection ---
+try:
+    import easyocr
+    EASYOCR_AVAILABLE = True
+except ImportError:
+    EASYOCR_AVAILABLE = False
+try:
+    import pytesseract
+    TESSERACT_AVAILABLE = True
+except ImportError:
+    TESSERACT_AVAILABLE = False
+# Global reader instance for EasyOCR (lazy loaded)
+_EASY_READER = None
+def get_easyocr_reader():
+    """Get or create the EasyOCR reader instance."""
+    global _EASY_READER
+    if _EASY_READER is None and EASYOCR_AVAILABLE:
+        try:
+            # Initialize with configured languages and GPU setting
+            _EASY_READER = easyocr.Reader(config.EASYOCR_LANGS, gpu=config.EASYOCR_GPU)
+        except Exception as e:
+            print(f"Error initializing EasyOCR: {e}")
+            return None
+    return _EASY_READER
+def _configure_tesseract():
+    """Configure tesseract path from config."""
+    if config.TESSERACT_CMD and TESSERACT_AVAILABLE:
+        pytesseract.pytesseract.tesseract_cmd = config.TESSERACT_CMD
+        return True
+    elif TESSERACT_AVAILABLE:
+        try:
+            pytesseract.get_tesseract_version()
+            return True
+        except Exception:
+            return False
+    return False
+def _preprocess_image(image: Image.Image) -> Image.Image:
+    """Preprocess image for maximum OCR accuracy."""
+    # 1. Convert to grayscale
+    if image.mode != "L":
+        image = image.convert("L")
+    # 2. Dynamic Contrast / Lighting correction
+    image = ImageOps.autocontrast(image)
+    # 3. Resize to optimal DPI (approx 300)
+    width, height = image.size
+    if width < 1500 or height < 1500:
+        scale = max(1800 / width, 1800 / height, 2.0)
+        new_size = (int(width * scale), int(height * scale))
+        image = image.resize(new_size, Image.Resampling.LANCZOS)
+    # 4. Sharpening (Unsharp Mask equivalent)
+    image = image.filter(ImageFilter.SHARPEN)
+    enhancer = ImageEnhance.Contrast(image)
+    image = enhancer.enhance(1.8)
+    # 5. Denoising
+    image = image.filter(ImageFilter.MedianFilter(size=3))
+    return image
+def _reconstruct_from_boxes(results: list) -> str:
+    """ Reconstruct text layout from bounding boxes.
+        Sort by top, then group by 'lines' based on y-coordinate.
+    """
+    if not results:
+        return ""
+    # Sort results by top y-coordinate
+    results.sort(key=lambda x: x[0][0][1])
+    lines = []
+    if results:
+        current_line = [results[0]]
+        for i in range(1, len(results)):
+            # If the current block's mid-y is within the previous block's height range
+            prev_box = results[i-1][0]
+            curr_box = results[i][0]
+            prev_y_center = (prev_box[0][1] + prev_box[2][1]) / 2
+            curr_y_center = (curr_box[0][1] + curr_box[2][1]) / 2
+            # Threshold for 'same line' is approx 1/3 of the box height
+            height = prev_box[2][1] - prev_box[0][1]
+            if abs(curr_y_center - prev_y_center) < (height * 0.5):
+                current_line.append(results[i])
+            else:
+                lines.append(current_line)
+                current_line = [results[i]]
+        lines.append(current_line)
+    final_text = []
+    for line in lines:
+        # Sort each line by left x-coordinate
+        line.sort(key=lambda x: x[0][0][0])
+        line_text = []
+        for i, res in enumerate(line):
+            # Add relative spacing based on horizontal gap
+            if i > 0:
+                prev_right = line[i-1][0][1][0]
+                curr_left = res[0][0][0]
+                gap = curr_left - prev_right
+                # If gap is significant, add spaces
+                char_width = (res[0][1][0] - res[0][0][0]) / (len(res[1]) or 1)
+                num_spaces = int(gap / (char_width * 1.5))
+                line_text.append(" " * max(1, num_spaces))
+            line_text.append(res[1])
+        final_text.append(" ".join(line_text))
+    return "\n".join(final_text)
+def extract_image(file_path: str) -> ExtractionResult:
+    """Extract text from an image using the best available OCR engine."""
+    start_time = time.time()
+    # 1. Check for EasyOCR (Preferred)
+    if EASYOCR_AVAILABLE:
+        try:
+            reader = get_easyocr_reader()
+            if reader:
+                # EasyOCR works well with both original and preprocessed images
+                # We'll use a slightly preprocessed version for consistency
+                # Perform OCR with layout awareness
+                # Adjusting thresholds for better numeric and tabular capture
+                results = reader.readtext(
+                    file_path,
+                    detail=1,
+                    paragraph=False, # We want individual boxes for layout reconstruction
+                    width_ths=0.7,   # Better for long numbers/strings
+                    height_ths=0.7,
+                    contrast_ths=0.3
+                )
+                # Reconstruct full layout from bounding boxes
+                text = _reconstruct_from_boxes(results)
+                if text.strip():
+                    elapsed = (time.time() - start_time) * 1000
+                    metadata = DocumentMetadata(
+                        title=os.path.basename(file_path),
+                        page_count=1,
+                        word_count=len(text.split()),
+                        character_count=len(text),
+                        file_type="Image (EasyOCR)",
+                        extra={
+                            "image_width": original_size[0],
+                            "image_height": original_size[1],
+                            "ocr_engine": "EasyOCR",
+                            "accuracy": "High (Deep Learning)"
+                        }
+                    )
+                    return ExtractionResult(
+                        raw_text=text.strip(),
+                        metadata=metadata,
+                        success=True,
+                        extraction_time_ms=elapsed
+                    )
+        except Exception as e:
+            print(f"EasyOCR extraction failed, falling back to Tesseract: {e}")
+    # 2. Fallback to Tesseract
+    if TESSERACT_AVAILABLE and _configure_tesseract():
+        try:
+            image = Image.open(file_path)
+            original_size = image.size
+            processed_image = _preprocess_image(image)
+            custom_config = f"--oem 3 --psm 6 -l {config.TESSERACT_LANG}"
+            text = pytesseract.image_to_string(processed_image, config=custom_config)
+            # Confidence
+            try:
+                data = pytesseract.image_to_data(processed_image, config=custom_config, output_type=pytesseract.Output.DICT)
+                confidences = [int(c) for c in data["conf"] if int(c) > 0]
+                avg_confidence = sum(confidences) / len(confidences) if confidences else 0
+            except Exception:
+                avg_confidence = 0
+            elapsed = (time.time() - start_time) * 1000
+            if text.strip():
+                metadata = DocumentMetadata(
+                    title=os.path.basename(file_path),
+                    page_count=1,
+                    word_count=len(text.split()),
+                    character_count=len(text),
+                    file_type="Image (Tesseract)",
+                    extra={
+                        "image_width": original_size[0],
+                        "image_height": original_size[1],
+                        "ocr_confidence": round(avg_confidence, 2),
+                        "ocr_engine": "Tesseract"
+                    }
+                )
+                return ExtractionResult(
+                    raw_text=text.strip(),
+                    metadata=metadata,
+                    success=True,
+                    extraction_time_ms=elapsed
+                )
+        except Exception as e:
+            print(f"Tesseract extraction failed: {e}")
+    # 3. Failure cases
+    elapsed = (time.time() - start_time) * 1000
+    if not EASYOCR_AVAILABLE and not TESSERACT_AVAILABLE:
+        error_msg = "No OCR libraries installed. Please run 'pip install easyocr'."
+    elif not EASYOCR_AVAILABLE and TESSERACT_AVAILABLE:
+        error_msg = "EasyOCR is not installed, and Tesseract binary was not found or failed. Please run 'pip install easyocr' for best results."
+    elif EASYOCR_AVAILABLE and not TESSERACT_AVAILABLE:
+        error_msg = "EasyOCR failed to extract text, and Tesseract is not installed."
+    else:
+        error_msg = "OCR extraction failed. Both EasyOCR and Tesseract engines were unable to extract text from this image."
+    return ExtractionResult(
+        raw_text="",
+        metadata=DocumentMetadata(file_type="Image (OCR)"),
+        success=False,
+        error_message=error_msg,
+        extraction_time_ms=elapsed,
+    )

extractors/pdf_extractor.py ADDED Viewed

	@@ -0,0 +1,79 @@

+"""
+PDF text extraction using PyMuPDF (fitz).
+Extracts text with layout preservation and document metadata.
+"""
+import fitz  # PyMuPDF
+import time
+import os
+from models.schemas import ExtractionResult, DocumentMetadata
+def extract_pdf(file_path: str) -> ExtractionResult:
+    """Extract text and metadata from a PDF file."""
+    start_time = time.time()
+    try:
+        doc = fitz.open(file_path)
+        # Extract text from all pages with full layout preservation
+        pages_text = []
+        for page_num in range(len(doc)):
+            page = doc[page_num]
+            # "layout" mode preserves the physical positioning of text (columns, tables, etc.)
+            # This ensures the "pointer position" matches the original PDF look.
+            text = page.get_text("layout")
+            if text.strip():
+                pages_text.append(f"--- Page {page_num + 1} ---\n{text}")
+        full_text = "\n\n".join(pages_text)
+        # Extract metadata
+        meta = doc.metadata
+        metadata = DocumentMetadata(
+            title=meta.get("title", "") or os.path.basename(file_path),
+            author=meta.get("author", "") or "Unknown",
+            creation_date=meta.get("creationDate", ""),
+            modification_date=meta.get("modDate", ""),
+            page_count=len(doc),
+            word_count=len(full_text.split()) if full_text else 0,
+            character_count=len(full_text),
+            file_type="PDF",
+            extra={
+                "producer": meta.get("producer", ""),
+                "creator": meta.get("creator", ""),
+                "subject": meta.get("subject", ""),
+                "keywords": meta.get("keywords", ""),
+                "format": meta.get("format", ""),
+                "encryption": doc.is_encrypted,
+            }
+        )
+        doc.close()
+        elapsed = (time.time() - start_time) * 1000
+        if not full_text.strip():
+            return ExtractionResult(
+                raw_text="",
+                metadata=metadata,
+                success=False,
+                error_message="No extractable text found in PDF. The document may contain only images — try uploading as an image for OCR processing.",
+                extraction_time_ms=elapsed,
+            )
+        return ExtractionResult(
+            raw_text=full_text,
+            metadata=metadata,
+            success=True,
+            extraction_time_ms=elapsed,
+        )
+    except Exception as e:
+        elapsed = (time.time() - start_time) * 1000
+        return ExtractionResult(
+            raw_text="",
+            metadata=DocumentMetadata(file_type="PDF"),
+            success=False,
+            error_message=f"PDF extraction failed: {str(e)}",
+            extraction_time_ms=elapsed,
+        )

extractors/url_extractor.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""
+Web content extraction from URLs using requests and BeautifulSoup4.
+Extracts title and main text content from HTML pages.
+"""
+import time
+import requests
+from bs4 import BeautifulSoup
+from models.schemas import ExtractionResult, DocumentMetadata
+def extract_url(url: str) -> ExtractionResult:
+    """Fetch and extract text content from a web URL."""
+    start_time = time.time()
+    try:
+        # 1. Fetch content
+        headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+        }
+        response = requests.get(url, headers=headers, timeout=10)
+        response.raise_for_status()
+        # 2. Parse HTML
+        soup = BeautifulSoup(response.text, 'html.parser')
+        # 3. Remove script and style elements
+        for script_or_style in soup(["script", "style", "nav", "footer", "header", "aside"]):
+            script_or_style.decompose()
+        # 4. Get text
+        # Try to find the title
+        title = soup.title.string.strip() if soup.title else url
+        # Get main text - simple heuristic: look for <article> or just <body>
+        content_area = soup.find('article') or soup.body
+        if not content_area:
+             content_area = soup
+        # Extract text while preserving some paragraph structure
+        lines = []
+        for element in content_area.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'li']):
+            text = element.get_text().strip()
+            if text:
+                if element.name.startswith('h'):
+                    prefix = '#' * int(element.name[1])
+                    lines.append(f"\n{prefix} {text}\n")
+                else:
+                    lines.append(text)
+        full_text = "\n\n".join(lines)
+        if not full_text.strip():
+            # Fallback to general text extraction
+            full_text = soup.get_text(separator='\n\n', strip=True)
+        # 5. Build metadata
+        metadata = DocumentMetadata(
+            title=title,
+            author="Web Content",
+            creation_date="",
+            modification_date="",
+            page_count=None,
+            word_count=len(full_text.split()),
+            character_count=len(full_text),
+            file_type="URL",
+            extra={
+                "url": url,
+                "domain": url.split('/')[2] if '//' in url else url.split('/')[0],
+                "status_code": response.status_code,
+                "content_type": response.headers.get('Content-Type', '')
+            }
+        )
+        elapsed = (time.time() - start_time) * 1000
+        if not full_text.strip():
+            return ExtractionResult(
+                raw_text="",
+                metadata=metadata,
+                success=False,
+                error_message="Could not extract any meaningful text from the provided URL.",
+                extraction_time_ms=elapsed,
+            )
+        return ExtractionResult(
+            raw_text=full_text,
+            metadata=metadata,
+            success=True,
+            extraction_time_ms=elapsed,
+        )
+    except requests.exceptions.RequestException as e:
+        elapsed = (time.time() - start_time) * 1000
+        return ExtractionResult(
+            raw_text="",
+            metadata=DocumentMetadata(file_type="URL", extra={"url": url}),
+            success=False,
+            error_message=f"Failed to fetch URL: {str(e)}",
+            extraction_time_ms=elapsed,
+        )
+    except Exception as e:
+        elapsed = (time.time() - start_time) * 1000
+        return ExtractionResult(
+            raw_text="",
+            metadata=DocumentMetadata(file_type="URL", extra={"url": url}),
+            success=False,
+            error_message=f"Web extraction failed: {str(e)}",
+            extraction_time_ms=elapsed,
+        )

main.py ADDED Viewed

	@@ -0,0 +1,307 @@

+"""
+Intelligent Document Processing System
+FastAPI backend with async document processing.
+"""
+import os
+import uuid
+import time
+import asyncio
+from typing import Dict
+from fastapi import FastAPI, UploadFile, File, HTTPException
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import FileResponse, JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
+from models.schemas import (
+    UploadResponse, ProcessingResult, TaskStatus,
+    ExtractionResult, DocumentMetadata,
+)
+from extractors.pdf_extractor import extract_pdf
+from extractors.docx_extractor import extract_docx
+from extractors.ocr_extractor import extract_image
+from extractors.url_extractor import extract_url
+from analyzers.summarizer import summarize_text
+from analyzers.ner_extractor import extract_entities
+from analyzers.sentiment import analyze_sentiment
+# --- App Setup ---
+app = FastAPI(
+    title="Alldocex - Intelligent Document Processing",
+    description="Extract, analyse, and summarize content from PDF, DOCX, and image files using AI.",
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# In-memory task store
+tasks: Dict[str, ProcessingResult] = {}
+# --- Utility Functions ---
+def _human_readable_size(size_bytes: int) -> str:
+    """Convert bytes to human readable string."""
+    for unit in ["B", "KB", "MB", "GB"]:
+        if size_bytes < 1024:
+            return f"{size_bytes:.1f} {unit}"
+        size_bytes /= 1024
+    return f"{size_bytes:.1f} TB"
+def _get_file_type(filename: str) -> str:
+    """Determine file type from extension."""
+    ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
+    if ext == "pdf":
+        return "pdf"
+    elif ext == "docx":
+        return "docx"
+    elif ext in ("png", "jpg", "jpeg", "tiff", "bmp", "webp"):
+        return "image"
+    return "unknown"
+def _process_document(file_path: str, file_type: str, task_id: str):
+    """
+    Process a document: extract text, then run all analyzers.
+    This runs in a thread pool to avoid blocking the event loop.
+    """
+    start_time = time.time()
+    task = tasks[task_id]
+    task.status = TaskStatus.PROCESSING
+    try:
+        # Step 1: Extract text based on file type
+        if file_type == "pdf":
+            extraction = extract_pdf(file_path)
+        elif file_type == "docx":
+            extraction = extract_docx(file_path)
+        elif file_type == "image":
+            extraction = extract_image(file_path)
+        elif file_type == "url":
+            # file_path is the URL string here
+            extraction = extract_url(file_path)
+        else:
+            raise ValueError(f"Unsupported file type: {file_type}")
+        task.extraction = extraction
+        if not extraction.success or not extraction.raw_text.strip():
+            task.status = TaskStatus.COMPLETED
+            task.error_message = extraction.error_message or "No text could be extracted."
+            task.processing_time_ms = (time.time() - start_time) * 1000
+            return
+        raw_text = extraction.raw_text
+        # Step 2: Summarization
+        try:
+            task.summary = summarize_text(raw_text)
+        except Exception as e:
+            print(f"Summarization error: {e}")
+        # Step 3: Named Entity Recognition
+        try:
+            task.entities = extract_entities(raw_text)
+        except Exception as e:
+            print(f"NER error: {e}")
+        # Step 4: Sentiment Analysis
+        try:
+            task.sentiment = analyze_sentiment(raw_text)
+        except Exception as e:
+            print(f"Sentiment error: {e}")
+        task.status = TaskStatus.COMPLETED
+        task.processing_time_ms = (time.time() - start_time) * 1000
+    except Exception as e:
+        task.status = TaskStatus.ERROR
+        task.error_message = str(e)
+        task.processing_time_ms = (time.time() - start_time) * 1000
+    finally:
+        # Clean up uploaded file
+        try:
+            if os.path.exists(file_path):
+                os.remove(file_path)
+        except Exception:
+            pass
+# --- API Endpoints ---
+@app.post("/api/upload", response_model=ProcessingResult)
+async def upload_and_process(file: UploadFile = File(...)):
+    """
+    Upload a document and start processing.
+    Supports PDF, DOCX, PNG, JPG, JPEG, TIFF, BMP, WEBP.
+    """
+    # Validate file extension
+    filename = file.filename or "unknown"
+    ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
+    if ext not in ALLOWED_EXTENSIONS:
+        raise HTTPException(
+            status_code=400,
+            detail=f"Unsupported file type: .{ext}. Supported: {', '.join(ALLOWED_EXTENSIONS.keys())}"
+        )
+    # Read file content
+    content = await file.read()
+    file_size = len(content)
+    # Validate file size
+    if file_size > MAX_FILE_SIZE_BYTES:
+        raise HTTPException(
+            status_code=400,
+            detail=f"File too large. Maximum size: {MAX_FILE_SIZE_BYTES // (1024*1024)}MB"
+        )
+    if file_size == 0:
+        raise HTTPException(status_code=400, detail="Empty file uploaded.")
+    # Save file
+    file_id = str(uuid.uuid4())[:8]
+    safe_filename = f"{file_id}_{filename}"
+    file_path = os.path.join(UPLOAD_DIR, safe_filename)
+    with open(file_path, "wb") as f:
+        f.write(content)
+    # Determine file type
+    file_type = _get_file_type(filename)
+    # Create task
+    task = ProcessingResult.create_pending(
+        file_id=file_id,
+        filename=filename,
+        file_type=file_type,
+    )
+    tasks[file_id] = task
+    # Start async processing in a thread
+    asyncio.get_event_loop().run_in_executor(
+        None, _process_document, file_path, file_type, file_id
+    )
+    return task
+@app.post("/api/extract/url", response_model=ProcessingResult)
+async def extract_from_url(data: Dict[str, str]):
+    """
+    Extract content from a web URL and process it.
+    """
+    url = data.get("url")
+    if not url:
+        raise HTTPException(status_code=400, detail="URL is required.")
+    if not url.startswith(("http://", "https://")):
+        raise HTTPException(status_code=400, detail="Invalid URL format. Must start with http:// or https://")
+    # Create task
+    file_id = str(uuid.uuid4())[:8]
+    # For URLs, we use the domain as the "filename"
+    filename = url.split('/')[2] if '//' in url else url.split('/')[0]
+    task = ProcessingResult.create_pending(
+        file_id=file_id,
+        filename=filename,
+        file_type="url",
+    )
+    tasks[file_id] = task
+    # Start async processing in a thread
+    asyncio.get_event_loop().run_in_executor(
+        None, _process_document, url, "url", file_id
+    )
+    return task
+@app.get("/api/status/{task_id}")
+async def get_task_status(task_id: str):
+    """Get the processing status and results for a task."""
+    if task_id not in tasks:
+        raise HTTPException(status_code=404, detail="Task not found.")
+    return tasks[task_id]
+@app.get("/api/download/{task_id}")
+async def download_results(task_id: str):
+    """Download the extracted text as a .txt file."""
+    if task_id not in tasks:
+        raise HTTPException(status_code=404, detail="Task not found.")
+    task = tasks[task_id]
+    if not task.extraction or not task.extraction.raw_text:
+        raise HTTPException(status_code=400, detail="No text available for download.")
+    # Create temporary file path
+    filename = f"extracted_{task.filename}.txt"
+    temp_path = os.path.join(UPLOAD_DIR, filename)
+    try:
+        with open(temp_path, "w", encoding="utf-8") as f:
+            f.write(task.extraction.raw_text)
+        return FileResponse(
+            temp_path,
+            filename=filename,
+            media_type="text/plain",
+            background=None # Note: ideally we'd use BackgroundTask to delete this file later
+        )
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=f"Failed to generate download: {str(e)}")
+@app.get("/api/health")
+async def health_check():
+    """Health check endpoint."""
+    from config import check_ocr_availability
+    # Check OCR status
+    ocr_status = check_ocr_availability()
+    # Check spaCy
+    try:
+        import spacy
+        spacy.load("en_core_web_sm")
+        spacy_status = "available"
+    except Exception:
+        spacy_status = "not available"
+    return {
+        "status": "healthy",
+        "ocr": ocr_status,
+        "tesseract": "available" if ocr_status in ("available", "tesseract-only") else "not found",
+        "spacy": spacy_status,
+        "version": "1.1.0",
+    }
+# --- Static Files ---
+# Serve the main page
+@app.get("/")
+async def serve_index():
+    index_path = os.path.join(STATIC_DIR, "index.html")
+    if os.path.exists(index_path):
+        return FileResponse(index_path)
+    return JSONResponse({"message": "Alldocex API is running. Frontend not found."})
+# Mount static files
+app.mount("/static", StaticFiles(directory=STATIC_DIR), name="static")
+if __name__ == "__main__":
+    import uvicorn
+    print("\n🚀 Alldocex - Intelligent Document Processing System")
+    print("📄 Open http://localhost:7860 in your browser\n")
+    uvicorn.run(app, host="0.0.0.0", port=7860)

models/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Models package

models/schemas.py ADDED Viewed

	@@ -0,0 +1,116 @@

+"""
+Pydantic models for request/response schemas.
+"""
+from pydantic import BaseModel
+from typing import Optional, List, Dict, Any
+from enum import Enum
+import time
+import uuid
+class TaskStatus(str, Enum):
+    PENDING = "pending"
+    PROCESSING = "processing"
+    COMPLETED = "completed"
+    ERROR = "error"
+class FileType(str, Enum):
+    PDF = "pdf"
+    DOCX = "docx"
+    IMAGE = "image"
+class UploadResponse(BaseModel):
+    file_id: str
+    filename: str
+    file_type: str
+    size_bytes: int
+    size_human: str
+    message: str
+class DocumentMetadata(BaseModel):
+    title: Optional[str] = None
+    author: Optional[str] = None
+    creation_date: Optional[str] = None
+    modification_date: Optional[str] = None
+    page_count: Optional[int] = None
+    word_count: int = 0
+    character_count: int = 0
+    file_type: str = ""
+    extra: Dict[str, Any] = {}
+class ExtractionResult(BaseModel):
+    raw_text: str
+    metadata: DocumentMetadata
+    success: bool = True
+    error_message: Optional[str] = None
+    extraction_time_ms: float = 0
+class SummaryResult(BaseModel):
+    summary: str
+    original_length: int
+    summary_length: int
+    compression_ratio: float
+    sentence_count: int
+    algorithm: str
+class Entity(BaseModel):
+    text: str
+    label: str
+    label_description: str
+    count: int = 1
+    positions: List[int] = []
+class EntityResult(BaseModel):
+    entities: List[Entity]
+    entity_counts: Dict[str, int]
+    total_entities: int
+class SentimentBreakdown(BaseModel):
+    text: str
+    compound: float
+    positive: float
+    negative: float
+    neutral: float
+    label: str
+class SentimentResult(BaseModel):
+    overall_compound: float
+    overall_positive: float
+    overall_negative: float
+    overall_neutral: float
+    overall_label: str
+    sentence_breakdown: List[SentimentBreakdown]
+    confidence: float
+class ProcessingResult(BaseModel):
+    file_id: str
+    filename: str
+    file_type: str
+    status: TaskStatus
+    extraction: Optional[ExtractionResult] = None
+    summary: Optional[SummaryResult] = None
+    entities: Optional[EntityResult] = None
+    sentiment: Optional[SentimentResult] = None
+    processing_time_ms: float = 0
+    error_message: Optional[str] = None
+    timestamp: float = 0
+    @staticmethod
+    def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
+        return ProcessingResult(
+            file_id=file_id,
+            filename=filename,
+            file_type=file_type,
+            status=TaskStatus.PENDING,
+            timestamp=time.time(),
+        )

requirements.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+fastapi==0.115.0
+uvicorn[standard]==0.30.0
+python-multipart==0.0.9
+PyMuPDF==1.24.0
+python-docx==1.1.0
+easyocr==1.7.1
+torch
+torchvision
+pytesseract==0.3.13
+Pillow==10.4.0
+spacy==3.7.5
+sumy==0.11.0
+nltk==3.9.0
+aiofiles==24.1.0
+requests==2.32.3
+beautifulsoup4==4.12.3

static/app.js ADDED Viewed

	@@ -0,0 +1,586 @@

+/**
+ * Alldocex - Intelligent Document Processing
+ * Frontend application logic
+ */
+// ===== State =====
+let currentTaskId = null;
+let pollInterval = null;
+// ===== DOM Elements =====
+const $ = (sel) => document.querySelector(sel);
+const $$ = (sel) => document.querySelectorAll(sel);
+const dropZone = $('#dropZone');
+const fileInput = $('#fileInput');
+const uploadSection = $('#uploadSection');
+const processingSection = $('#processingSection');
+const resultsSection = $('#resultsSection');
+const toastContainer = $('#toastContainer');
+const btnExtractUrl = $('#btnExtractUrl');
+const urlInput = $('#urlInput');
+// ===== Init =====
+document.addEventListener('DOMContentLoaded', () => {
+    initUpload();
+    initTabs();
+    initButtons();
+});
+// ===== Health Check =====
+// ===== Upload =====
+function initUpload() {
+    // Click to upload
+    dropZone.addEventListener('click', () => fileInput.click());
+    // File selected
+    fileInput.addEventListener('change', (e) => {
+        if (e.target.files.length > 0) {
+            handleFile(e.target.files[0]);
+        }
+    });
+    // URL input
+    btnExtractUrl.addEventListener('click', () => {
+        const url = urlInput.value.trim();
+        if (url) {
+            handleUrl(url);
+        } else {
+            showToast('Please enter a valid URL', 'error');
+        }
+    });
+    // Drag and drop
+    dropZone.addEventListener('dragover', (e) => {
+        e.preventDefault();
+        dropZone.classList.add('drag-over');
+    });
+    dropZone.addEventListener('dragleave', (e) => {
+        e.preventDefault();
+        dropZone.classList.remove('drag-over');
+    });
+    dropZone.addEventListener('drop', (e) => {
+        e.preventDefault();
+        dropZone.classList.remove('drag-over');
+        if (e.dataTransfer.files.length > 0) {
+            handleFile(e.dataTransfer.files[0]);
+        }
+    });
+    // Format badge filters
+    $$('.format-badge').forEach(badge => {
+        badge.addEventListener('click', (e) => {
+            e.stopPropagation(); // Don't trigger the main dropZone click
+            const format = badge.textContent.trim().toLowerCase();
+            openFilteredPicker(format);
+        });
+    });
+}
+function openFilteredPicker(format) {
+    const defaultAccept = fileInput.accept;
+    // Map of extensions
+    const extMap = {
+        pdf: '.pdf',
+        docx: '.docx',
+        png: '.png',
+        jpg: '.jpg,.jpeg',
+        jpeg: '.jpg,.jpeg',
+        tiff: '.tiff',
+        bmp: '.bmp',
+        webp: '.webp'
+    };
+    if (extMap[format]) {
+        fileInput.accept = extMap[format];
+    }
+    fileInput.click();
+    // Reset accept after a short delay so the main zone works normally
+    setTimeout(() => {
+        fileInput.accept = defaultAccept;
+    }, 500);
+}
+async function handleFile(file) {
+    // Validate extension
+    const validExts = ['pdf', 'docx', 'png', 'jpg', 'jpeg', 'tiff', 'bmp', 'webp'];
+    const ext = file.name.split('.').pop().toLowerCase();
+    if (!validExts.includes(ext)) {
+        showToast(`Unsupported file type: .${ext}`, 'error');
+        return;
+    }
+    // Validate size (20MB)
+    if (file.size > 20 * 1024 * 1024) {
+        showToast('File too large. Maximum size: 20MB', 'error');
+        return;
+    }
+    // Show processing UI
+    showSection('processing');
+    resetProcessingSteps();
+    // Upload
+    const formData = new FormData();
+    formData.append('file', file);
+    try {
+        const res = await fetch('/api/upload', {
+            method: 'POST',
+            body: formData,
+        });
+        if (!res.ok) {
+            const err = await res.json();
+            throw new Error(err.detail || 'Upload failed');
+        }
+        const data = await res.json();
+        currentTaskId = data.file_id;
+        // Start polling for results
+        updateStep('stepExtract', 'active');
+        startPolling(data.file_id);
+    } catch (e) {
+        showToast(e.message || 'Upload failed', 'error');
+        showSection('upload');
+    }
+}
+async function handleUrl(url) {
+    if (!url.startsWith('http')) {
+        showToast('URL must start with http:// or https://', 'error');
+        return;
+    }
+    try {
+        resetAll();
+        showSection('processing');
+        updateStep('stepExtract', 'active');
+        const response = await fetch('/api/extract/url', {
+            method: 'POST',
+            headers: { 'Content-Type': 'application/json' },
+            body: JSON.stringify({ url: url })
+        });
+        if (!response.ok) {
+            const error = await response.json();
+            throw new Error(error.detail || 'Failed to start URL extraction');
+        }
+        const data = await response.json();
+        currentTaskId = data.file_id;
+        // Polling results
+        startPolling(data.file_id);
+    } catch (error) {
+        showSection('upload');
+        showToast(error.message, 'error');
+    }
+}
+// ===== Polling =====
+function startPolling(taskId) {
+    if (pollInterval) clearInterval(pollInterval);
+    pollInterval = setInterval(async () => {
+        try {
+            const res = await fetch(`/api/status/${taskId}`);
+            const data = await res.json();
+            if (data.status === 'processing') {
+                // Update steps based on available data
+                if (data.extraction) {
+                    updateStep('stepExtract', 'done');
+                    updateStep('stepSummary', 'active');
+                }
+                if (data.summary) {
+                    updateStep('stepSummary', 'done');
+                    updateStep('stepEntities', 'active');
+                }
+                if (data.entities) {
+                    updateStep('stepEntities', 'done');
+                    updateStep('stepSentiment', 'active');
+                }
+                if (data.sentiment) {
+                    updateStep('stepSentiment', 'done');
+                }
+            }
+            if (data.status === 'completed' || data.status === 'error') {
+                clearInterval(pollInterval);
+                pollInterval = null;
+                // Mark all steps as done
+                updateStep('stepExtract', 'done');
+                updateStep('stepSummary', 'done');
+                updateStep('stepEntities', 'done');
+                updateStep('stepSentiment', 'done');
+                // Short delay to show completion
+                setTimeout(() => {
+                    if (data.status === 'error' && !data.extraction) {
+                        showToast(data.error_message || 'Processing failed', 'error');
+                        showSection('upload');
+                    } else {
+                        displayResults(data);
+                        showSection('results');
+                    }
+                }, 600);
+            }
+        } catch (e) {
+            clearInterval(pollInterval);
+            pollInterval = null;
+            showToast('Lost connection to server', 'error');
+            showSection('upload');
+        }
+    }, 800);
+}
+// ===== Display Results =====
+function displayResults(data) {
+    // File info bar
+    const typeIcons = { pdf: '📕', docx: '📘', image: '🖼️' };
+    $('#fileTypeIcon').textContent = typeIcons[data.file_type] || '📄';
+    $('#fileName').textContent = data.filename;
+    const meta = data.extraction?.metadata;
+    const parts = [data.file_type.toUpperCase()];
+    if (meta?.word_count) parts.push(`${meta.word_count.toLocaleString()} words`);
+    if (meta?.page_count) parts.push(`${meta.page_count} pages`);
+    $('#fileMeta').textContent = parts.join(' • ');
+    const timeSeconds = (data.processing_time_ms / 1000).toFixed(1);
+    $('#processingTime').textContent = `⏱ ${timeSeconds}s`;
+    // Extracted Text
+    const textEl = $('#extractedText');
+    if (data.extraction?.raw_text) {
+        textEl.textContent = data.extraction.raw_text;
+    } else {
+        textEl.innerHTML = `<p class="placeholder">${data.extraction?.error_message || 'No text extracted.'}</p>`;
+    }
+    // Summary
+    if (data.summary) {
+        $('#summaryContent').textContent = data.summary.summary;
+        $('#summaryStats').classList.remove('hidden');
+        $('#statOriginalLen').textContent = data.summary.original_length.toLocaleString();
+        $('#statSummaryLen').textContent = data.summary.summary_length.toLocaleString();
+        const pct = Math.round((1 - data.summary.compression_ratio) * 100);
+        $('#statCompression').textContent = `${pct}%`;
+        $('#statAlgorithm').textContent = data.summary.algorithm;
+    } else {
+        $('#summaryContent').innerHTML = '<p class="placeholder">Summarization not available.</p>';
+        $('#summaryStats').classList.add('hidden');
+    }
+    // Entities
+    displayEntities(data.entities);
+    // Sentiment
+    displaySentiment(data.sentiment);
+    // Metadata
+    displayMetadata(data.extraction?.metadata);
+    // Activate first tab
+    activateTab('extracted');
+}
+function displayEntities(entityData) {
+    const catEl = $('#entityCategories');
+    const listEl = $('#entityList');
+    const countEl = $('#entityCount');
+    if (!entityData || entityData.entities.length === 0) {
+        catEl.innerHTML = '<p class="placeholder">No entities detected in this document.</p>';
+        listEl.innerHTML = '';
+        countEl.textContent = '0 entities found';
+        return;
+    }
+    countEl.textContent = `${entityData.total_entities} entities found`;
+    // Category badges
+    const catColors = {
+        PERSON: '#ec4899', ORG: '#3b82f6', GPE: '#10b981', DATE: '#f59e0b',
+        MONEY: '#8b5cf6', EVENT: '#06b6d4', PRODUCT: '#fb923c', LAW: '#a855f7',
+        NORP: '#f472b6', EMAIL: '#06b6d4', PHONE: '#3b82f6', URL: '#10b981',
+        TIME: '#f59e0b', PERCENT: '#8b5cf6', CARDINAL: '#94a3b8',
+    };
+    catEl.innerHTML = Object.entries(entityData.entity_counts)
+        .sort((a, b) => b[1] - a[1])
+        .map(([label, count]) => `
+            <div class="entity-category-badge">
+                <span class="cat-dot" style="background: ${catColors[label] || '#94a3b8'}"></span>
+                ${label}
+                <span class="cat-count">${count}</span>
+            </div>
+        `).join('');
+    // Entity list
+    listEl.innerHTML = entityData.entities
+        .slice(0, 100)
+        .map(ent => `
+            <div class="entity-item">
+                <div class="entity-item-left">
+                    <span class="entity-type-badge badge-${ent.label}">${ent.label}</span>
+                    <span class="entity-text" title="${escapeHtml(ent.text)}">${escapeHtml(ent.text)}</span>
+                </div>
+                ${ent.count > 1 ? `<span class="entity-item-count">×${ent.count}</span>` : ''}
+            </div>
+        `).join('');
+}
+function displaySentiment(sentData) {
+    const overviewEl = $('#sentimentOverview');
+    if (!sentData) {
+        overviewEl.innerHTML = '<p class="placeholder">Sentiment analysis not available.</p>';
+        return;
+    }
+    const score = sentData.overall_compound;
+    const label = sentData.overall_label;
+    const posW = Math.round(sentData.overall_positive * 100);
+    const neuW = Math.round(sentData.overall_neutral * 100);
+    const negW = Math.round(sentData.overall_negative * 100);
+    // Label color
+    let labelColor;
+    if (score >= 0.05) labelColor = 'var(--accent-green)';
+    else if (score <= -0.05) labelColor = 'var(--accent-red)';
+    else labelColor = 'var(--text-muted)';
+    let html = `
+        <div class="sentiment-gauge-container">
+            <div class="sentiment-label-display" style="color: ${labelColor}">${label}</div>
+            <div class="sentiment-score">${score >= 0 ? '+' : ''}${score.toFixed(3)}</div>
+            <div class="sentiment-bar-container">
+                <div class="sentiment-bar">
+                    <div class="sentiment-bar-positive" style="width: ${posW}%"></div>
+                    <div class="sentiment-bar-neutral" style="width: ${neuW}%"></div>
+                    <div class="sentiment-bar-negative" style="width: ${negW}%"></div>
+                </div>
+                <div class="sentiment-bar-labels">
+                    <span><span class="dot dot-pos"></span> Positive ${posW}%</span>
+                    <span><span class="dot dot-neu"></span> Neutral ${neuW}%</span>
+                    <span><span class="dot dot-neg"></span> Negative ${negW}%</span>
+                </div>
+            </div>
+        </div>
+    `;
+    // Sentence breakdown
+    if (sentData.sentence_breakdown && sentData.sentence_breakdown.length > 0) {
+        html += `
+            <div class="sentiment-sentences">
+                <h4>Sentence-Level Breakdown (top ${Math.min(sentData.sentence_breakdown.length, 20)})</h4>
+                ${sentData.sentence_breakdown.slice(0, 20).map(s => {
+                    let cls = 'sent-neutral';
+                    if (s.compound >= 0.05) cls = 'sent-positive';
+                    else if (s.compound <= -0.05) cls = 'sent-negative';
+                    return `
+                        <div class="sentence-item">
+                            <span class="sentence-sentiment-badge ${cls}">${s.label}</span>
+                            <span class="sentence-text">${escapeHtml(s.text)}</span>
+                        </div>
+                    `;
+                }).join('')}
+            </div>
+        `;
+    }
+    overviewEl.innerHTML = html;
+}
+function displayMetadata(meta) {
+    const metaEl = $('#metadataContent');
+    if (!meta) {
+        metaEl.innerHTML = '<p class="placeholder">No metadata available.</p>';
+        return;
+    }
+    const rows = [
+        ['Title', meta.title],
+        ['Author', meta.author],
+        ['File Type', meta.file_type],
+        ['Page Count', meta.page_count],
+        ['Word Count', meta.word_count?.toLocaleString()],
+        ['Character Count', meta.character_count?.toLocaleString()],
+        ['Created', meta.creation_date],
+        ['Modified', meta.modification_date],
+    ];
+    // Add extra metadata
+    if (meta.extra) {
+        for (const [key, value] of Object.entries(meta.extra)) {
+            if (value && value !== '' && value !== 0 && value !== false) {
+                const label = key.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
+                rows.push([label, String(value)]);
+            }
+        }
+    }
+    metaEl.innerHTML = `
+        <table class="metadata-table">
+            ${rows.filter(([, v]) => v && v !== 'None' && v !== 'null' && v !== '')
+              .map(([k, v]) => `<tr><td>${k}</td><td>${escapeHtml(String(v))}</td></tr>`)
+              .join('')}
+        </table>
+    `;
+}
+// ===== Tabs =====
+function initTabs() {
+    $$('.tab').forEach(tab => {
+        tab.addEventListener('click', () => {
+            activateTab(tab.dataset.tab);
+        });
+    });
+}
+function activateTab(tabName) {
+    $$('.tab').forEach(t => t.classList.remove('active'));
+    $$('.tab-panel').forEach(p => p.classList.remove('active'));
+    const tab = $(`.tab[data-tab="${tabName}"]`);
+    const panel = $(`#panel${tabName.charAt(0).toUpperCase() + tabName.slice(1)}`);
+    if (tab) tab.classList.add('active');
+    if (panel) panel.classList.add('active');
+}
+// ===== Buttons =====
+function initButtons() {
+    // New upload
+    $('#btnNewUpload').addEventListener('click', () => {
+        resetAll();
+        showSection('upload');
+    });
+    // Back to upload (without full reset if possible, or just same as New)
+    $('#btnBackToUpload').addEventListener('click', () => {
+        // We reset anyway for now to avoid data conflicts,
+        // but user specifically asked for "Back"
+        resetAll();
+        showSection('upload');
+    });
+    // Cancel processing
+    $('#btnCancelProcessing').addEventListener('click', () => {
+        if (pollInterval) {
+            clearInterval(pollInterval);
+            pollInterval = null;
+        }
+        showSection('upload');
+        showToast('Processing cancelled', 'info');
+    });
+    // Copy buttons
+    $('#btnCopyText').addEventListener('click', () => {
+        copyToClipboard($('#extractedText').textContent, '#btnCopyText');
+    });
+    $('#btnCopySummary').addEventListener('click', () => {
+        copyToClipboard($('#summaryContent').textContent, '#btnCopySummary');
+    });
+    // Download button
+    $('#btnDownloadText').addEventListener('click', () => {
+        if (currentTaskId) {
+            window.location.href = `/api/download/${currentTaskId}`;
+        } else {
+            showToast('No active document to download', 'error');
+        }
+    });
+}
+async function copyToClipboard(text, btnSelector) {
+    try {
+        await navigator.clipboard.writeText(text);
+        const btn = $(btnSelector);
+        btn.classList.add('copied');
+        const originalHTML = btn.innerHTML;
+        btn.innerHTML = `<svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M3 8l3 3 7-7" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/></svg> Copied!`;
+        setTimeout(() => {
+            btn.classList.remove('copied');
+            btn.innerHTML = originalHTML;
+        }, 2000);
+    } catch (e) {
+        showToast('Failed to copy to clipboard', 'error');
+    }
+}
+// ===== UI Helpers =====
+function showSection(sectionId) {
+    [uploadSection, processingSection, resultsSection].forEach(s => s.classList.add('hidden'));
+    if (sectionId === 'upload') {
+        uploadSection.classList.remove('hidden');
+    } else if (sectionId === 'processing') {
+        processingSection.classList.remove('hidden');
+    } else if (sectionId === 'results') {
+        resultsSection.classList.remove('hidden');
+    }
+}
+function resetProcessingSteps() {
+    ['stepExtract', 'stepSummary', 'stepEntities', 'stepSentiment'].forEach(id => {
+        const el = $(`#${id}`);
+        el.classList.remove('active', 'done');
+        el.querySelector('.step-status').textContent = '⏳';
+    });
+}
+function updateStep(stepId, state) {
+    const el = $(`#${stepId}`);
+    el.classList.remove('active', 'done');
+    el.classList.add(state);
+    el.querySelector('.step-status').textContent = state === 'done' ? '✅' : '⚡';
+}
+function resetAll() {
+    currentTaskId = null;
+    if (pollInterval) {
+        clearInterval(pollInterval);
+        pollInterval = null;
+    }
+    fileInput.value = '';
+    $('#extractedText').innerHTML = '<p class="placeholder">No text extracted yet.</p>';
+    $('#summaryContent').innerHTML = '<p class="placeholder">No summary available.</p>';
+    $('#summaryStats').classList.add('hidden');
+    $('#entityCategories').innerHTML = '<p class="placeholder">No entities detected.</p>';
+    $('#entityList').innerHTML = '';
+    $('#sentimentOverview').innerHTML = '<p class="placeholder">No sentiment data available.</p>';
+    $('#metadataContent').innerHTML = '<p class="placeholder">No metadata available.</p>';
+}
+function showToast(message, type = 'info') {
+    const icons = { info: 'ℹ️', error: '❌', success: '✅' };
+    const toast = document.createElement('div');
+    toast.className = `toast toast-${type}`;
+    toast.innerHTML = `<span class="toast-icon">${icons[type]}</span><span>${escapeHtml(message)}</span>`;
+    toastContainer.appendChild(toast);
+    setTimeout(() => {
+        if (toast.parentNode) toast.remove();
+    }, 4000);
+}
+function escapeHtml(text) {
+    if (!text) return '';
+    const div = document.createElement('div');
+    div.textContent = text;
+    return div.innerHTML;
+}

static/index.html ADDED Viewed

	@@ -0,0 +1,268 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Alldocex — Intelligent Document Processing</title>
+    <meta name="description" content="Extract, analyse, and summarize content from PDF, DOCX, and image files using AI-powered document processing.">
+    <link rel="preconnect" href="https://fonts.googleapis.com">
+    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
+    <link rel="stylesheet" href="/static/styles.css">
+</head>
+<body>
+    <!-- subtle background -->
+    <div class="bg-orbs"></div>
+    <div class="app-container">
+        <!-- Header -->
+        <header class="header">
+            <div class="logo">
+                <div class="logo-icon">
+                    <svg width="32" height="32" viewBox="0 0 32 32" fill="none">
+                        <rect x="4" y="2" width="18" height="24" rx="3" stroke="currentColor" stroke-width="2.5"/>
+                        <path d="M10 8h8M10 12h10M10 16h6" stroke="currentColor" stroke-width="2" stroke-linecap="round"/>
+                        <circle cx="22" cy="22" r="8" fill="var(--accent-blue)" opacity="0.9"/>
+                        <path d="M20 22l1.5 1.5L24 20" stroke="white" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
+                    </svg>
+                </div>
+                <div>
+                    <h1>Alldocex</h1>
+                    <p class="logo-subtitle">Intelligent Document Processing</p>
+                </div>
+            </div>
+        </header>
+        <!-- Main Content -->
+        <main class="main-content">
+            <!-- Upload Section -->
+            <section class="upload-section" id="uploadSection">
+                <div class="upload-zone" id="dropZone">
+                    <div class="upload-icon">
+                        <svg width="64" height="64" viewBox="0 0 64 64" fill="none">
+                            <path d="M32 44V20M32 20L22 30M32 20L42 30" stroke="var(--accent-blue)" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
+                            <path d="M12 40v6a6 6 0 006 6h28a6 6 0 006-6v-6" stroke="var(--accent-blue)" stroke-width="3" stroke-linecap="round"/>
+                        </svg>
+                    </div>
+                    <h2 class="upload-title">Drop your document here</h2>
+                    <p class="upload-subtitle">or click to browse files</p>
+                    <div class="upload-formats">
+                        <span class="format-badge">PDF</span>
+                        <span class="format-badge">DOCX</span>
+                        <span class="format-badge">PNG</span>
+                        <span class="format-badge">JPG</span>
+                        <span class="format-badge">TIFF</span>
+                        <span class="format-badge">BMP</span>
+                    </div>
+                    <p class="upload-limit">Maximum file size: 50MB</p>
+                    <input type="file" id="fileInput" accept=".pdf,.docx,.png,.jpg,.jpeg,.tiff,.bmp,.webp" hidden>
+                </div>
+                <div class="url-section">
+                    <div class="divider">
+                        <span>OR</span>
+                    </div>
+                    <div class="url-input-container">
+                        <div class="url-icon-subtle">
+                            <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07l-1.72 1.71"></path><path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 0 0 7.07 7.07l1.71-1.71"></path></svg>
+                        </div>
+                        <input type="text" id="urlInput" placeholder="Paste a web URL here (e.g. https://wikipedia.org/...)">
+                        <button class="btn-url" id="btnExtractUrl">
+                            Summarize URL
+                        </button>
+                    </div>
+                </div>
+            </section>
+            <!-- Processing Indicator -->
+            <section class="processing-section hidden" id="processingSection">
+                <div class="processing-card">
+                    <div class="processing-spinner">
+                        <div class="spinner-ring"></div>
+                        <div class="spinner-ring ring-inner"></div>
+                    </div>
+                    <h3 class="processing-title" id="processingTitle">Processing document...</h3>
+                    <p class="processing-subtitle" id="processingSubtitle">Extracting text and running AI analysis</p>
+                    <div class="processing-steps">
+                        <div class="step" id="stepExtract">
+                            <span class="step-icon">📄</span>
+                            <span>Text Extraction</span>
+                            <span class="step-status">⏳</span>
+                        </div>
+                        <div class="step" id="stepSummary">
+                            <span class="step-icon">📝</span>
+                            <span>Summarization</span>
+                            <span class="step-status">⏳</span>
+                        </div>
+                        <div class="step" id="stepEntities">
+                            <span class="step-icon">🏷️</span>
+                            <span>Entity Recognition</span>
+                            <span class="step-status">⏳</span>
+                        </div>
+                        <div class="step" id="stepSentiment">
+                            <span class="step-icon">💭</span>
+                            <span>Sentiment Analysis</span>
+                            <span class="step-status">⏳</span>
+                        </div>
+                    </div>
+                    <!-- Cancel Button -->
+                    <button class="btn-cancel" id="btnCancelProcessing" title="Cancel and return to upload">
+                        <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M13.5 4.5l-9 9M4.5 4.5l9 9" stroke="currentColor" stroke-width="1.8" stroke-linecap="round" stroke-linejoin="round"/></svg>
+                        Cancel
+                    </button>
+                </div>
+            </section>
+            <!-- Results Section -->
+            <section class="results-section hidden" id="resultsSection">
+                <!-- File Info Bar -->
+                <div class="file-info-bar" id="fileInfoBar">
+                    <div class="file-info-left">
+                        <span class="file-type-icon" id="fileTypeIcon">📄</span>
+                        <div>
+                            <h3 class="file-name" id="fileName">document.pdf</h3>
+                            <p class="file-meta" id="fileMeta">PDF • 2.3 MB • 15 pages</p>
+                        </div>
+                    </div>
+                    <div class="file-info-right">
+                        <span class="processing-time" id="processingTime">⏱ 1.2s</span>
+                        <button class="btn-back" id="btnBackToUpload" title="Back to upload">
+                            <svg width="18" height="18" viewBox="0 0 20 20" fill="none">
+                                <path d="M15 10H5M10 15l-5-5 5-5" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
+                            </svg>
+                            Back
+                        </button>
+                        <button class="btn-new" id="btnNewUpload" title="Upload new document">
+                            <svg width="18" height="18" viewBox="0 0 20 20" fill="none">
+                                <path d="M10 4v12M4 10h12" stroke="currentColor" stroke-width="2" stroke-linecap="round"/>
+                            </svg>
+                            New Upload
+                        </button>
+                    </div>
+                </div>
+                <!-- Tabs -->
+                <div class="tabs">
+                    <button class="tab active" data-tab="extracted" id="tabExtracted">
+                        <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M3 3h12v12H3z" stroke="currentColor" stroke-width="1.5" rx="2"/><path d="M6 7h6M6 10h4" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
+                        Extracted Text
+                    </button>
+                    <button class="tab" data-tab="summary" id="tabSummary">
+                        <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M3 5h12M3 9h8M3 13h10" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
+                        Summary
+                    </button>
+                    <button class="tab" data-tab="entities" id="tabEntities">
+                        <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="7" cy="7" r="4" stroke="currentColor" stroke-width="1.5"/><path d="M14 14l-3-3" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
+                        Entities
+                    </button>
+                    <button class="tab" data-tab="sentiment" id="tabSentiment">
+                        <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="9" cy="9" r="7" stroke="currentColor" stroke-width="1.5"/><path d="M6 11c.5 1 1.5 2 3 2s2.5-1 3-2" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/><circle cx="6.5" cy="7.5" r="1" fill="currentColor"/><circle cx="11.5" cy="7.5" r="1" fill="currentColor"/></svg>
+                        Sentiment
+                    </button>
+                    <button class="tab" data-tab="metadata" id="tabMetadata">
+                        <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="9" cy="9" r="7" stroke="currentColor" stroke-width="1.5"/><path d="M9 6v3l2 2" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
+                        Metadata
+                    </button>
+                </div>
+                <!-- Tab Content -->
+                <div class="tab-content">
+                    <!-- Extracted Text -->
+                    <div class="tab-panel active" id="panelExtracted">
+                        <div class="panel-header">
+                            <h3>Extracted Text</h3>
+                            <div class="panel-actions">
+                                <button class="btn-copy" id="btnCopyText" title="Copy to clipboard">
+                                    <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><rect x="5" y="5" width="9" height="9" rx="1.5" stroke="currentColor" stroke-width="1.5"/><path d="M3 11V3a1 1 0 011-1h8" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
+                                    Copy
+                                </button>
+                                <button class="btn-download" id="btnDownloadText" title="Download as .txt">
+                                    <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M8 2v9M4 7l4 4 4-4M2 14h12" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
+                                    Download
+                                </button>
+                            </div>
+                        </div>
+                        <div class="text-content" id="extractedText">
+                            <p class="placeholder">No text extracted yet.</p>
+                        </div>
+                    </div>
+                    <!-- Summary -->
+                    <div class="tab-panel" id="panelSummary">
+                        <div class="panel-header">
+                            <h3>AI Summary</h3>
+                            <button class="btn-copy" id="btnCopySummary" title="Copy to clipboard">
+                                <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><rect x="5" y="5" width="9" height="9" rx="1.5" stroke="currentColor" stroke-width="1.5"/><path d="M3 11V3a1 1 0 011-1h8" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
+                                Copy
+                            </button>
+                        </div>
+                        <div class="summary-content" id="summaryContent">
+                            <p class="placeholder">No summary available.</p>
+                        </div>
+                        <div class="summary-stats hidden" id="summaryStats">
+                            <div class="stat-card">
+                                <span class="stat-value" id="statOriginalLen">0</span>
+                                <span class="stat-label">Original chars</span>
+                            </div>
+                            <div class="stat-card">
+                                <span class="stat-value" id="statSummaryLen">0</span>
+                                <span class="stat-label">Summary chars</span>
+                            </div>
+                            <div class="stat-card">
+                                <span class="stat-value" id="statCompression">0%</span>
+                                <span class="stat-label">Compression</span>
+                            </div>
+                            <div class="stat-card">
+                                <span class="stat-value" id="statAlgorithm">—</span>
+                                <span class="stat-label">Algorithm</span>
+                            </div>
+                        </div>
+                    </div>
+                    <!-- Entities -->
+                    <div class="tab-panel" id="panelEntities">
+                        <div class="panel-header">
+                            <h3>Named Entities</h3>
+                            <span class="entity-count" id="entityCount">0 entities found</span>
+                        </div>
+                        <div class="entity-categories" id="entityCategories">
+                            <p class="placeholder">No entities detected.</p>
+                        </div>
+                        <div class="entity-list" id="entityList"></div>
+                    </div>
+                    <!-- Sentiment -->
+                    <div class="tab-panel" id="panelSentiment">
+                        <div class="panel-header">
+                            <h3>Sentiment Analysis</h3>
+                        </div>
+                        <div class="sentiment-overview" id="sentimentOverview">
+                            <p class="placeholder">No sentiment data available.</p>
+                        </div>
+                    </div>
+                    <!-- Metadata -->
+                    <div class="tab-panel" id="panelMetadata">
+                        <div class="panel-header">
+                            <h3>Document Metadata</h3>
+                        </div>
+                        <div class="metadata-content" id="metadataContent">
+                            <p class="placeholder">No metadata available.</p>
+                        </div>
+                    </div>
+                </div>
+            </section>
+        </main>
+        <!-- Footer -->
+        <footer class="footer">
+            <p>Alldocex v1.0 — Powered by FastAPI, spaCy, VADER & Tesseract OCR</p>
+        </footer>
+    </div>
+    <!-- Toast Container -->
+    <div class="toast-container" id="toastContainer"></div>
+    <script src="/static/app.js"></script>
+</body>
+</html>

static/styles.css ADDED Viewed

	@@ -0,0 +1,1156 @@

+/* --- CSS Variables / Design Tokens (Corporate Blue & White) --- */
+:root {
+    /* Primary Colors */
+    --bg-primary: #f8fafc;
+    --bg-secondary: #ffffff;
+    --bg-card: #ffffff;
+    --bg-subtle: #f1f5f9;
+    /* Borders & Accents */
+    --border-light: #e2e8f0;
+    --border-accent: rgba(37, 99, 235, 0.2);
+    /* Text */
+    --text-primary: #1e293b;
+    --text-secondary: #475569;
+    --text-muted: #94a3b8;
+    /* Corporate Blue Palette */
+    --accent-blue-deep: #1e40af;
+    --accent-blue: #2563eb;
+    --accent-blue-light: #60a5fa;
+    --accent-blue-subtle: #dbeafe;
+    /* Functional Colors */
+    --accent-green: #10b981;
+    --accent-yellow: #f59e0b;
+    --accent-red: #ef4444;
+    /* Gradients */
+    --gradient-primary: linear-gradient(135deg, #1e40af, #2563eb);
+    --gradient-professional: linear-gradient(135deg, #2563eb, #60a5fa);
+    /* Shadows (Clean & Soft) */
+    --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.1);
+    --shadow-md: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
+    --shadow-lg: 0 10px 15px -3px rgba(0, 0, 0, 0.1), 0 4px 6px -2px rgba(0, 0, 0, 0.05);
+    --shadow-glow: 0 0 20px rgba(37, 99, 235, 0.1);
+    /* Radius */
+    --radius-sm: 6px;
+    --radius-md: 8px;
+    --radius-lg: 12px;
+    --radius-xl: 16px;
+    /* Font */
+    --font-main: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
+    /* Transitions */
+    --transition-fast: 0.15s ease;
+    --transition-normal: 0.25s ease;
+    --transition-smooth: 0.4s cubic-bezier(0.4, 0, 0.2, 1);
+}
+/* --- Reset & Base --- */
+*, *::before, *::after {
+    box-sizing: border-box;
+    margin: 0;
+    padding: 0;
+}
+html {
+    font-size: 16px;
+    scroll-behavior: smooth;
+}
+body {
+    font-family: var(--font-main);
+    background: var(--bg-primary);
+    color: var(--text-primary);
+    min-height: 100vh;
+    line-height: 1.6;
+    -webkit-font-smoothing: antialiased;
+}
+/* --- Background Decorations --- */
+.bg-orbs {
+    position: fixed;
+    inset: 0;
+    z-index: -1;
+    background: radial-gradient(circle at top right, var(--accent-blue-subtle), transparent 400px),
+                radial-gradient(circle at bottom left, #f1f5f9, transparent 400px);
+    opacity: 0.5;
+}
+/* --- App Container --- */
+.app-container {
+    max-width: 1100px;
+    margin: 0 auto;
+    padding: 32px 20px;
+    min-height: 100vh;
+    display: flex;
+    flex-direction: column;
+}
+/* --- Header --- */
+.header {
+    display: flex;
+    align-items: center;
+    justify-content: space-between;
+    padding: 20px 32px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border-light);
+    border-radius: var(--radius-lg);
+    box-shadow: var(--shadow-sm);
+    margin-bottom: 32px;
+}
+.logo {
+    display: flex;
+    align-items: center;
+    gap: 16px;
+}
+.logo-icon {
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    width: 44px;
+    height: 44px;
+    background: var(--accent-blue-subtle);
+    border-radius: var(--radius-md);
+    color: var(--accent-blue);
+}
+.logo h1 {
+    font-size: 1.5rem;
+    font-weight: 800;
+    color: var(--accent-blue-deep);
+    letter-spacing: -0.5px;
+}
+.logo-subtitle {
+    font-size: 0.75rem;
+    color: var(--text-secondary);
+    font-weight: 500;
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+/* --- Main Content --- */
+.main-content {
+    flex: 1;
+}
+/* --- Upload Section --- */
+.upload-zone {
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+    justify-content: center;
+    padding: 64px 40px;
+    background: var(--bg-secondary);
+    border: 2px dashed var(--border-light);
+    border-radius: var(--radius-xl);
+    cursor: pointer;
+    transition: var(--transition-smooth);
+    box-shadow: var(--shadow-sm);
+}
+.upload-zone:hover {
+    border-color: var(--accent-blue-light);
+    background: var(--accent-blue-subtle);
+    box-shadow: var(--shadow-glow);
+    transform: translateY(-2px);
+}
+.upload-zone.drag-over {
+    border-color: var(--accent-blue);
+    background: var(--accent-blue-subtle);
+    box-shadow: 0 0 20px rgba(37,99,235, 0.15);
+}
+.upload-icon {
+    margin-bottom: 24px;
+    color: var(--accent-blue);
+}
+.upload-title {
+    font-size: 1.5rem;
+    font-weight: 700;
+    margin-bottom: 8px;
+    color: var(--text-primary);
+}
+.upload-subtitle {
+    color: var(--text-secondary);
+    font-size: 0.95rem;
+    margin-bottom: 24px;
+}
+.upload-formats {
+    display: flex;
+    gap: 8px;
+    flex-wrap: wrap;
+    justify-content: center;
+    margin-bottom: 16px;
+}
+.format-badge {
+    padding: 6px 14px;
+    font-size: 0.75rem;
+    font-weight: 600;
+    color: var(--accent-blue);
+    background: var(--accent-blue-subtle);
+    border-radius: 100px;
+    text-transform: uppercase;
+    cursor: pointer;
+    transition: var(--transition-fast);
+}
+.format-badge:hover {
+    background: var(--accent-blue);
+    color: white;
+    transform: translateY(-1px);
+    box-shadow: var(--shadow-sm);
+}
+.upload-limit {
+    font-size: 0.75rem;
+    color: var(--text-muted);
+}
+/* --- URL Section --- */
+.url-section {
+    margin-top: 32px;
+    width: 100%;
+}
+.divider {
+    display: flex;
+    align-items: center;
+    text-align: center;
+    margin-bottom: 24px;
+    color: var(--text-muted);
+    font-size: 0.75rem;
+    font-weight: 600;
+    letter-spacing: 1px;
+}
+.divider::before, .divider::after {
+    content: '';
+    flex: 1;
+    border-bottom: 1px solid var(--border-light);
+}
+.divider:not(:empty)::before {
+    margin-right: 16px;
+}
+.divider:not(:empty)::after {
+    margin-left: 16px;
+}
+.url-input-container {
+    display: flex;
+    gap: 12px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border-light);
+    padding: 8px;
+    border-radius: var(--radius-lg);
+    box-shadow: var(--shadow-sm);
+    transition: var(--transition-normal);
+}
+.url-input-container:focus-within {
+    border-color: var(--accent-blue);
+    box-shadow: var(--shadow-glow);
+}
+.url-icon-subtle {
+    display: flex;
+    align-items: center;
+    padding-left: 12px;
+    color: var(--text-muted);
+}
+.url-input-container input {
+    flex: 1;
+    border: none;
+    background: transparent;
+    font-family: var(--font-main);
+    font-size: 0.95rem;
+    color: var(--text-primary);
+    outline: none;
+}
+.btn-url {
+    background: var(--gradient-primary);
+    color: white;
+    border: none;
+    padding: 10px 24px;
+    border-radius: var(--radius-md);
+    font-family: var(--font-main);
+    font-size: 0.85rem;
+    font-weight: 600;
+    cursor: pointer;
+    transition: var(--transition-normal);
+    white-space: nowrap;
+}
+.btn-url:hover {
+    transform: translateY(-1px);
+    box-shadow: 0 4px 12px rgba(37, 99, 235, 0.3);
+}
+/* --- Transitions & Animations --- */
+@keyframes fadeInUp {
+    from { opacity: 0; transform: translateY(15px); }
+    to { opacity: 1; transform: translateY(0); }
+}
+@keyframes fadeIn {
+    from { opacity: 0; }
+    to { opacity: 1; }
+}
+.upload-section, .processing-section, .results-section {
+    animation: fadeInUp 0.5s var(--transition-smooth);
+}
+.hidden {
+    display: none !important;
+}
+/* --- Processing Section --- */
+.processing-card {
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+    padding: 60px 40px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border-light);
+    border-radius: var(--radius-xl);
+    box-shadow: var(--shadow-md);
+}
+.processing-spinner {
+    position: relative;
+    width: 80px;
+    height: 80px;
+    margin-bottom: 24px;
+}
+.spinner-ring {
+    position: absolute;
+    inset: 0;
+    border-radius: 50%;
+    border: 3px solid #e2e8f0;
+    border-top-color: var(--accent-blue);
+    animation: spin 1s linear infinite;
+}
+.ring-inner {
+    inset: 10px;
+    border-top-color: var(--accent-blue-light);
+    animation-direction: reverse;
+    animation-duration: 0.8s;
+}
+@keyframes spin {
+    to { transform: rotate(360deg); }
+}
+.processing-title {
+    font-size: 1.2rem;
+    font-weight: 600;
+    margin-bottom: 6px;
+}
+.processing-subtitle {
+    color: var(--text-secondary);
+    font-size: 0.85rem;
+    margin-bottom: 30px;
+}
+.processing-steps {
+    display: flex;
+    flex-direction: column;
+    gap: 12px;
+    width: 100%;
+    max-width: 360px;
+}
+.step {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+    padding: 10px 16px;
+    background: var(--bg-primary);
+    border-radius: var(--radius-sm);
+    border: 1px solid var(--border-light);
+    font-size: 0.85rem;
+    color: var(--text-secondary);
+    transition: var(--transition-normal);
+}
+.step.active {
+    border-color: var(--accent-blue);
+    color: var(--accent-blue-deep);
+    background: var(--accent-blue-subtle);
+}
+.step.done {
+    border-color: rgba(16, 185, 129, 0.3);
+    color: var(--accent-green);
+}
+.step-icon {
+    font-size: 1.1rem;
+}
+.step-status {
+    margin-left: auto;
+    font-size: 0.9rem;
+}
+/* --- Results Section --- */
+.results-section {
+    animation: fadeInUp 0.5s ease;
+}
+/* File Info Bar */
+.file-info-bar {
+    display: flex;
+    align-items: center;
+    justify-content: space-between;
+    padding: 16px 24px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border-light);
+    border-radius: var(--radius-lg);
+    box-shadow: var(--shadow-sm);
+    margin-bottom: 20px;
+}
+.file-info-left {
+    display: flex;
+    align-items: center;
+    gap: 14px;
+}
+.file-type-icon {
+    font-size: 2rem;
+}
+.file-name {
+    font-size: 1rem;
+    font-weight: 600;
+}
+.file-meta {
+    font-size: 0.8rem;
+    color: var(--text-secondary);
+}
+.file-info-right {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+}
+.processing-time {
+    font-size: 0.8rem;
+    color: var(--accent-blue);
+    padding: 4px 12px;
+    background: var(--accent-blue-subtle);
+    border-radius: 100px;
+}
+.btn-new, .btn-back {
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    padding: 8px 18px;
+    border: none;
+    border-radius: var(--radius-sm);
+    font-family: var(--font-main);
+    font-size: 0.85rem;
+    font-weight: 600;
+    cursor: pointer;
+    transition: var(--transition-normal);
+}
+.btn-new {
+    background: var(--gradient-primary);
+    color: white;
+}
+.btn-back {
+    background: var(--bg-glass-strong);
+    color: var(--text-secondary);
+    border: 1px solid var(--border-glass);
+}
+.btn-new:hover, .btn-back:hover {
+    transform: translateY(-1px);
+    box-shadow: var(--shadow-md);
+}
+.btn-new:hover {
+    box-shadow: 0 4px 20px rgba(139, 92, 246, 0.4);
+}
+.btn-back:hover {
+    color: var(--text-primary);
+    border-color: var(--border-accent);
+}
+.btn-cancel {
+    margin-top: 24px;
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    padding: 10px 24px;
+    background: rgba(239, 68, 68, 0.1);
+    border: 1px solid rgba(239, 68, 68, 0.2);
+    border-radius: var(--radius-sm);
+    color: var(--accent-red);
+    font-family: var(--font-main);
+    font-size: 0.85rem;
+    font-weight: 600;
+    cursor: pointer;
+    transition: var(--transition-normal);
+}
+.btn-cancel:hover {
+    background: rgba(239, 68, 68, 0.2);
+    border-color: rgba(239, 68, 68, 0.4);
+    transform: translateY(-1px);
+}
+/* --- Tabs --- */
+.tabs {
+    display: flex;
+    gap: 4px;
+    padding: 4px;
+    background: var(--bg-subtle);
+    border: 1px solid var(--border-light);
+    border-radius: var(--radius-lg);
+    margin-bottom: 20px;
+    overflow-x: auto;
+}
+.tab {
+    display: flex;
+    align-items: center;
+    gap: 8px;
+    padding: 10px 18px;
+    background: transparent;
+    border: 1px solid transparent;
+    border-radius: var(--radius-md);
+    color: var(--text-secondary);
+    font-family: var(--font-main);
+    font-size: 0.82rem;
+    font-weight: 500;
+    cursor: pointer;
+    transition: var(--transition-normal);
+    white-space: nowrap;
+    flex: 1;
+    justify-content: center;
+}
+.tab:hover {
+    color: var(--accent-blue);
+    background: #fff;
+}
+.tab.active {
+    color: var(--accent-blue);
+    background: #fff;
+    border-color: var(--accent-blue);
+    box-shadow: var(--shadow-sm);
+}
+.tab svg {
+    opacity: 0.7;
+    flex-shrink: 0;
+}
+.tab.active svg {
+    opacity: 1;
+}
+/* --- Tab Panels --- */
+.tab-content {
+    position: relative;
+}
+.tab-panel {
+    display: none;
+    animation: fadeIn 0.3s ease;
+}
+.tab-panel.active {
+    display: block;
+}
+.panel-header {
+    display: flex;
+    align-items: center;
+    justify-content: space-between;
+    margin-bottom: 16px;
+}
+.panel-header h3 {
+    font-size: 1.1rem;
+    font-weight: 600;
+}
+.panel-actions {
+    display: flex;
+    gap: 8px;
+}
+.btn-copy, .btn-download {
+    display: flex;
+    align-items: center;
+    gap: 6px;
+    padding: 6px 14px;
+    background: var(--bg-glass-strong);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-sm);
+    color: var(--text-secondary);
+    font-family: var(--font-main);
+    font-size: 0.8rem;
+    cursor: pointer;
+    transition: var(--transition-normal);
+}
+.btn-copy:hover, .btn-download:hover {
+    color: var(--accent-cyan);
+    border-color: rgba(6, 182, 212, 0.3);
+    background: rgba(6, 182, 212, 0.05);
+}
+.btn-copy.copied {
+    color: var(--accent-green);
+    border-color: rgba(16, 185, 129, 0.3);
+}
+/* --- Text Content --- */
+.text-content, .summary-content {
+    padding: 24px;
+    background: #ffffff;
+    border: 1px solid var(--border-light);
+    border-radius: var(--radius-lg);
+    color: var(--text-primary);
+    box-shadow: var(--shadow-sm);
+    max-height: 500px;
+    overflow-y: auto;
+    font-size: 0.9rem;
+    line-height: 1.8;
+    white-space: pre-wrap;
+    word-wrap: break-word;
+}
+.summary-content {
+    border-left: 4px solid var(--accent-blue);
+    background: #fafbff;
+    font-size: 0.95rem;
+}
+.placeholder {
+    color: var(--text-muted);
+    font-style: italic;
+    text-align: center;
+    padding: 30px 0;
+}
+/* --- Summary Stats --- */
+.summary-stats {
+    display: grid;
+    grid-template-columns: repeat(4, 1fr);
+    gap: 12px;
+    margin-top: 16px;
+}
+.stat-card {
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+    gap: 4px;
+    padding: 16px 12px;
+    background: var(--bg-glass);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-md);
+    text-align: center;
+}
+.stat-value {
+    font-size: 1.2rem;
+    font-weight: 700;
+    background: var(--gradient-primary);
+    -webkit-background-clip: text;
+    -webkit-text-fill-color: transparent;
+    background-clip: text;
+}
+.stat-label {
+    font-size: 0.7rem;
+    color: var(--text-muted);
+    text-transform: uppercase;
+    letter-spacing: 0.5px;
+}
+/* --- Entities --- */
+.entity-count {
+    font-size: 0.85rem;
+    color: var(--accent-cyan);
+    padding: 4px 12px;
+    background: rgba(6, 182, 212, 0.1);
+    border-radius: 100px;
+}
+.entity-categories {
+    display: flex;
+    flex-wrap: wrap;
+    gap: 8px;
+    margin-bottom: 20px;
+}
+.entity-category-badge {
+    display: flex;
+    align-items: center;
+    gap: 6px;
+    padding: 6px 14px;
+    background: var(--bg-glass);
+    border: 1px solid var(--border-glass);
+    border-radius: 100px;
+    font-size: 0.78rem;
+    font-weight: 500;
+    cursor: pointer;
+    transition: var(--transition-normal);
+}
+.entity-category-badge:hover {
+    background: var(--bg-glass-strong);
+}
+.entity-category-badge .cat-dot {
+    width: 8px;
+    height: 8px;
+    border-radius: 50%;
+}
+.entity-category-badge .cat-count {
+    font-weight: 700;
+    margin-left: 4px;
+}
+.entity-list {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(280px, 1fr));
+    gap: 10px;
+}
+.entity-item {
+    display: flex;
+    align-items: center;
+    justify-content: space-between;
+    padding: 12px 16px;
+    background: var(--bg-glass);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-md);
+    transition: var(--transition-normal);
+}
+.entity-item:hover {
+    background: var(--bg-glass-strong);
+    border-color: var(--border-accent);
+}
+.entity-item-left {
+    display: flex;
+    align-items: center;
+    gap: 10px;
+    min-width: 0;
+}
+.entity-type-badge {
+    padding: 2px 8px;
+    font-size: 0.65rem;
+    font-weight: 700;
+    border-radius: 4px;
+    letter-spacing: 0.5px;
+    text-transform: uppercase;
+    white-space: nowrap;
+    flex-shrink: 0;
+}
+.entity-text {
+    font-size: 0.88rem;
+    font-weight: 500;
+    white-space: nowrap;
+    overflow: hidden;
+    text-overflow: ellipsis;
+}
+.entity-item-count {
+    font-size: 0.75rem;
+    color: var(--text-muted);
+    padding: 2px 8px;
+    background: var(--bg-glass-strong);
+    border-radius: 100px;
+    flex-shrink: 0;
+    margin-left: 8px;
+}
+/* Entity color mapping */
+.badge-PERSON { background: rgba(236, 72, 153, 0.15); color: var(--accent-pink); }
+.badge-ORG { background: rgba(59, 130, 246, 0.15); color: var(--accent-blue); }
+.badge-GPE { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
+.badge-DATE { background: rgba(245, 158, 11, 0.15); color: var(--accent-yellow); }
+.badge-MONEY { background: rgba(139, 92, 246, 0.15); color: var(--accent-purple); }
+.badge-EVENT { background: rgba(6, 182, 212, 0.15); color: var(--accent-cyan); }
+.badge-PRODUCT { background: rgba(251, 146, 60, 0.15); color: #fb923c; }
+.badge-LAW { background: rgba(168, 85, 247, 0.15); color: #a855f7; }
+.badge-NORP { background: rgba(244, 114, 182, 0.15); color: #f472b6; }
+.badge-EMAIL { background: rgba(6, 182, 212, 0.15); color: var(--accent-cyan); }
+.badge-PHONE { background: rgba(59, 130, 246, 0.15); color: var(--accent-blue); }
+.badge-URL { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
+.badge-TIME { background: rgba(245, 158, 11, 0.15); color: var(--accent-yellow); }
+.badge-PERCENT { background: rgba(139, 92, 246, 0.15); color: var(--accent-purple); }
+.badge-CARDINAL { background: rgba(100, 116, 139, 0.15); color: #94a3b8; }
+.badge-ORDINAL { background: rgba(100, 116, 139, 0.15); color: #94a3b8; }
+.badge-QUANTITY { background: rgba(251, 146, 60, 0.15); color: #fb923c; }
+/* --- Sentiment --- */
+.sentiment-overview {
+    display: flex;
+    flex-direction: column;
+    gap: 20px;
+}
+.sentiment-gauge-container {
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+    gap: 16px;
+    padding: 30px;
+    background: var(--bg-glass);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-lg);
+    backdrop-filter: blur(20px);
+}
+.sentiment-label-display {
+    font-size: 1.6rem;
+    font-weight: 800;
+    letter-spacing: -0.5px;
+}
+.sentiment-score {
+    font-size: 3rem;
+    font-weight: 800;
+    background: var(--gradient-primary);
+    -webkit-background-clip: text;
+    -webkit-text-fill-color: transparent;
+    background-clip: text;
+    line-height: 1;
+}
+.sentiment-bar-container {
+    width: 100%;
+    max-width: 500px;
+}
+.sentiment-bar {
+    width: 100%;
+    height: 12px;
+    border-radius: 6px;
+    background: var(--bg-glass-strong);
+    overflow: hidden;
+    display: flex;
+}
+.sentiment-bar-positive {
+    background: var(--accent-green);
+    transition: width 0.8s ease;
+}
+.sentiment-bar-neutral {
+    background: var(--text-muted);
+    transition: width 0.8s ease;
+}
+.sentiment-bar-negative {
+    background: var(--accent-red);
+    transition: width 0.8s ease;
+}
+.sentiment-bar-labels {
+    display: flex;
+    justify-content: space-between;
+    margin-top: 8px;
+    font-size: 0.75rem;
+    color: var(--text-secondary);
+}
+.sentiment-bar-labels span {
+    display: flex;
+    align-items: center;
+    gap: 6px;
+}
+.sentiment-bar-labels .dot {
+    width: 8px;
+    height: 8px;
+    border-radius: 50%;
+}
+.dot-pos { background: var(--accent-green); }
+.dot-neu { background: var(--text-muted); }
+.dot-neg { background: var(--accent-red); }
+.sentiment-sentences {
+    background: var(--bg-glass);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-lg);
+    padding: 20px;
+    max-height: 400px;
+    overflow-y: auto;
+}
+.sentiment-sentences h4 {
+    font-size: 0.9rem;
+    font-weight: 600;
+    margin-bottom: 12px;
+    color: var(--text-secondary);
+}
+.sentence-item {
+    display: flex;
+    align-items: flex-start;
+    gap: 12px;
+    padding: 10px 0;
+    border-bottom: 1px solid var(--border-glass);
+    font-size: 0.85rem;
+}
+.sentence-item:last-child {
+    border-bottom: none;
+}
+.sentence-sentiment-badge {
+    padding: 2px 8px;
+    font-size: 0.65rem;
+    font-weight: 700;
+    border-radius: 4px;
+    white-space: nowrap;
+    flex-shrink: 0;
+    margin-top: 2px;
+}
+.sent-positive { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
+.sent-negative { background: rgba(239, 68, 68, 0.15); color: var(--accent-red); }
+.sent-neutral { background: rgba(100, 116, 139, 0.15); color: var(--text-muted); }
+.sentence-text {
+    color: var(--text-secondary);
+    line-height: 1.5;
+}
+/* --- Metadata --- */
+.metadata-content {
+    background: var(--bg-glass);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-lg);
+    overflow: hidden;
+}
+.metadata-table {
+    width: 100%;
+    border-collapse: collapse;
+}
+.metadata-table tr {
+    border-bottom: 1px solid var(--border-glass);
+}
+.metadata-table tr:last-child {
+    border-bottom: none;
+}
+.metadata-table td {
+    padding: 12px 20px;
+    font-size: 0.88rem;
+}
+.metadata-table td:first-child {
+    font-weight: 600;
+    color: var(--text-secondary);
+    width: 200px;
+    white-space: nowrap;
+}
+.metadata-table td:last-child {
+    color: var(--text-primary);
+}
+/* --- Footer --- */
+.footer {
+    text-align: center;
+    padding: 24px 0 12px;
+    font-size: 0.75rem;
+    color: var(--text-muted);
+}
+/* --- Toast Notifications --- */
+.toast-container {
+    position: fixed;
+    bottom: 24px;
+    right: 24px;
+    z-index: 1000;
+    display: flex;
+    flex-direction: column;
+    gap: 10px;
+}
+.toast {
+    display: flex;
+    align-items: center;
+    gap: 10px;
+    padding: 14px 20px;
+    background: var(--bg-card);
+    border: 1px solid var(--border-glass);
+    border-radius: var(--radius-md);
+    backdrop-filter: blur(20px);
+    color: var(--text-primary);
+    font-size: 0.85rem;
+    box-shadow: var(--shadow-lg);
+    animation: slideInRight 0.3s ease, fadeOut 0.5s ease 3.5s forwards;
+    max-width: 380px;
+}
+.toast.toast-error {
+    border-color: rgba(239, 68, 68, 0.3);
+}
+.toast.toast-success {
+    border-color: rgba(16, 185, 129, 0.3);
+}
+.toast-icon {
+    font-size: 1.2rem;
+    flex-shrink: 0;
+}
+/* --- Utility Classes --- */
+.hidden {
+    display: none !important;
+}
+/* --- Animations --- */
+@keyframes fadeInUp {
+    from {
+        opacity: 0;
+        transform: translateY(20px);
+    }
+    to {
+        opacity: 1;
+        transform: translateY(0);
+    }
+}
+@keyframes fadeIn {
+    from { opacity: 0; }
+    to { opacity: 1; }
+}
+@keyframes slideInRight {
+    from {
+        opacity: 0;
+        transform: translateX(40px);
+    }
+    to {
+        opacity: 1;
+        transform: translateX(0);
+    }
+}
+@keyframes fadeOut {
+    from { opacity: 1; }
+    to { opacity: 0; }
+}
+/* --- Responsive --- */
+@media (max-width: 768px) {
+    .app-container {
+        padding: 12px;
+    }
+    .header {
+        flex-direction: column;
+        gap: 12px;
+        padding: 14px;
+    }
+    .upload-zone {
+        padding: 40px 24px;
+    }
+    .upload-title {
+        font-size: 1.1rem;
+    }
+    .tabs {
+        overflow-x: auto;
+    }
+    .tab {
+        padding: 8px 12px;
+        font-size: 0.75rem;
+    }
+    .tab svg {
+        display: none;
+    }
+    .summary-stats {
+        grid-template-columns: repeat(2, 1fr);
+    }
+    .entity-list {
+        grid-template-columns: 1fr;
+    }
+    .file-info-bar {
+        flex-direction: column;
+        gap: 12px;
+        align-items: flex-start;
+    }
+    .file-info-right {
+        width: 100%;
+        justify-content: flex-end;
+    }
+    .metadata-table td:first-child {
+        width: 140px;
+    }
+    .sentiment-score {
+        font-size: 2rem;
+    }
+}
+@media (max-width: 480px) {
+    .summary-stats {
+        grid-template-columns: 1fr 1fr;
+    }
+    .format-badge {
+        font-size: 0.65rem;
+        padding: 3px 8px;
+    }
+}

test_api.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""Quick API test script for Alldocex."""
+import requests
+import time
+import json
+BASE_URL = "http://localhost:8000"
+# Upload the test document
+print("Uploading test_document.docx...")
+with open("test_document.docx", "rb") as f:
+    res = requests.post(
+        f"{BASE_URL}/api/upload",
+        files={"file": ("test_document.docx", f, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}
+    )
+data = res.json()
+print(f"Upload response: {data['status']} - File ID: {data['file_id']}")
+task_id = data["file_id"]
+# Poll for results
+print("Waiting for processing...")
+for i in range(30):
+    time.sleep(1)
+    res = requests.get(f"{BASE_URL}/api/status/{task_id}")
+    result = res.json()
+    status = result["status"]
+    print(f"  Poll {i+1}: {status}")
+    if status in ("completed", "error"):
+        break
+print(f"\n{'='*50}")
+print(f"STATUS: {result['status']}")
+print(f"Processing time: {round(result.get('processing_time_ms', 0), 1)} ms")
+print(f"{'='*50}")
+# Extraction
+if result.get("extraction"):
+    ext = result["extraction"]
+    print(f"\n--- EXTRACTION ---")
+    print(f"Success: {ext['success']}")
+    print(f"Word count: {ext['metadata']['word_count']}")
+    print(f"Char count: {ext['metadata']['character_count']}")
+    print(f"File type: {ext['metadata']['file_type']}")
+    print(f"First 300 chars:\n{ext['raw_text'][:300]}")
+# Summary
+if result.get("summary"):
+    s = result["summary"]
+    print(f"\n--- SUMMARY ---")
+    print(f"Algorithm: {s['algorithm']}")
+    print(f"Original length: {s['original_length']}")
+    print(f"Summary length: {s['summary_length']}")
+    print(f"Compression: {round((1 - s['compression_ratio']) * 100, 1)}%")
+    print(f"Summary:\n{s['summary'][:500]}")
+# Entities
+if result.get("entities"):
+    e = result["entities"]
+    print(f"\n--- ENTITIES ---")
+    print(f"Total entities: {e['total_entities']}")
+    print(f"Categories: {json.dumps(e['entity_counts'], indent=2)}")
+    for ent in e["entities"][:20]:
+        print(f"  [{ent['label']:8s}] {ent['text']} (x{ent['count']})")
+# Sentiment
+if result.get("sentiment"):
+    sent = result["sentiment"]
+    print(f"\n--- SENTIMENT ---")
+    print(f"Label: {sent['overall_label']}")
+    print(f"Compound: {sent['overall_compound']}")
+    print(f"Positive: {sent['overall_positive']}")
+    print(f"Negative: {sent['overall_negative']}")
+    print(f"Neutral: {sent['overall_neutral']}")
+    print(f"Sentence breakdowns: {len(sent['sentence_breakdown'])}")
+    for sb in sent["sentence_breakdown"][:5]:
+        print(f"  [{sb['label']:15s}] {sb['text'][:80]}...")
+print("\n=== TEST COMPLETE ===")

test_simple.py ADDED Viewed

	@@ -0,0 +1,49 @@

+"""Simple API test - writes results to file."""
+import requests, time, json
+BASE = "http://localhost:8000"
+out = []
+with open("test_document.docx", "rb") as f:
+    res = requests.post(f"{BASE}/api/upload", files={"file": ("test_document.docx", f)})
+data = res.json()
+task_id = data["file_id"]
+out.append(f"Upload: {data['status']} (ID: {task_id})")
+for i in range(30):
+    time.sleep(1)
+    res = requests.get(f"{BASE}/api/status/{task_id}")
+    result = res.json()
+    if result["status"] in ("completed", "error"):
+        break
+out.append(f"Status: {result['status']}")
+out.append(f"Time: {round(result.get('processing_time_ms', 0))}ms")
+if result.get("extraction"):
+    e = result["extraction"]
+    out.append(f"\nEXTRACTION: success={e['success']}, words={e['metadata']['word_count']}")
+    out.append(f"Text preview: {e['raw_text'][:200]}...")
+if result.get("summary"):
+    s = result["summary"]
+    out.append(f"\nSUMMARY ({s['algorithm']}): compression={round((1-s['compression_ratio'])*100)}%")
+    out.append(s["summary"][:400])
+if result.get("entities"):
+    ent = result["entities"]
+    out.append(f"\nENTITIES: {ent['total_entities']} found")
+    out.append(f"Categories: {json.dumps(ent['entity_counts'])}")
+    for e in ent["entities"][:15]:
+        out.append(f"  [{e['label']}] {e['text']} (x{e['count']})")
+if result.get("sentiment"):
+    s = result["sentiment"]
+    out.append(f"\nSENTIMENT: {s['overall_label']} (compound={s['overall_compound']})")
+    out.append(f"Pos={s['overall_positive']} Neg={s['overall_negative']} Neu={s['overall_neutral']}")
+out.append("\nDONE")
+text = "\n".join(out)
+with open("test_output.txt", "w", encoding="utf-8") as f:
+    f.write(text)
+print(text)