krishnachoudhary-hclguvi commited on
Commit
52a0fe9
·
unverified ·
1 Parent(s): d1e9916

Deploy text extraction API files

Browse files
DEPLOYMENT.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Alldocex - Deployment Guide
2
+
3
+ This guide provides three main options for deploying the Alldocex application to a production environment.
4
+
5
+ ## 🏗️ Option 1: Docker (Recommended)
6
+
7
+ Docker is the best choice because it packages all the AI models, dependencies, and system libraries into a single container.
8
+
9
+ ### 1. Build the image
10
+ ```bash
11
+ docker build -t alldocex-app .
12
+ ```
13
+
14
+ ### 2. Run with Docker Compose
15
+ ```bash
16
+ docker-compose up -d
17
+ ```
18
+ The application will be available at `http://localhost:8000`.
19
+
20
+ ---
21
+
22
+ ## ☁️ Option 2: Cloud Deployment (Render / Railway / Fly.io)
23
+
24
+ ### **Render Deployment (Recommended)**
25
+ 1. **Connect GitHub**: Push your code to a GitHub repository.
26
+ 2. **Create Web Service**: Select "Web Service" in Render.
27
+ 3. **Docker Environment**: Render will automatically detect the `Dockerfile`.
28
+ 4. **Resource Plan**: Ensure you select a plan with at least **4GB RAM** (e.g., Starter or Pro).
29
+ 5. **Environment Variables**: Add `PORT = 8000` if required.
30
+
31
+ ---
32
+
33
+ ## 🖥️ Option 3: Manual Deployment (Ubuntu/Debian Server)
34
+
35
+ If you are deploying directly to a Linux VPS (without Docker):
36
+
37
+ ### 1. Install System Dependencies
38
+ ```bash
39
+ sudo apt-get update
40
+ sudo apt-get install -y tesseract-ocr libgl1-mesa-glx libglib2.0-0 build-essential python3-venv
41
+ ```
42
+
43
+ ### 2. Set Up Environment
44
+ ```bash
45
+ python3 -m venv venv
46
+ source venv/bin/activate
47
+ pip install -r requirements.txt
48
+ python -m spacy download en_core_web_sm
49
+ ```
50
+
51
+ ### 3. Run with Gunicorn (Production Server)
52
+ ```bash
53
+ pip install gunicorn
54
+ gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app --bind 0.0.0.0:8000
55
+ ```
56
+
57
+ ---
58
+
59
+ ## ⚠️ Important Considerations
60
+
61
+ * **RAM**: AI models (EasyOCR, Torch, spaCy) are memory-intensive. Do NOT deploy on a "Free Tier" instance with only 512MB or 1GB of RAM.
62
+ * **Disk Space**: The first time you run the app, it will download several hundred megabytes of model weights.
63
+ * **Permissions**: Ensure the `uploads/` directory has write permissions for the user running the application.
64
+ * **Reverse Proxy**: For public deployment, it is highly recommended to use **Nginx** as a reverse proxy with SSL (Let's Encrypt).
Dockerfile ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use a lean Python 3.10 image as base
2
+ FROM python:3.10-slim
3
+
4
+ # Set environment variables
5
+ ENV PYTHONDONTWRITEBYTECODE=1
6
+ ENV PYTHONUNBUFFERED=1
7
+ ENV DEBIAN_FRONTEND=noninteractive
8
+
9
+ # Set working directory
10
+ WORKDIR /app
11
+
12
+ # Install system dependencies for OCR and NLP
13
+ RUN apt-get update && apt-get install -y \
14
+ tesseract-ocr \
15
+ libgl1-mesa-glx \
16
+ libglib2.0-0 \
17
+ build-essential \
18
+ && apt-get clean \
19
+ && rm -rf /var/lib/apt/lists/*
20
+
21
+ # Copy requirements file
22
+ COPY requirements.txt .
23
+
24
+ # Install Python dependencies
25
+ RUN pip install --no-cache-dir -r requirements.txt
26
+
27
+ # Download spaCy model during build to improve runtime performance
28
+ RUN python -m spacy download en_core_web_sm
29
+
30
+ # Create uploads directory
31
+ RUN mkdir -p /app/uploads
32
+
33
+ # Copy the rest of the application code
34
+ COPY . .
35
+
36
+ # Expose the API port for Hugging Face Spaces
37
+ EXPOSE 7860
38
+
39
+ # Start the application using Uvicorn
40
+ CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,11 +1,65 @@
1
- ---
2
- title: Text Extraction Api
3
- emoji: 🦀
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- pinned: false
8
- license: mit
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Alldocex — Intelligent Document Processing System
2
+
3
+ ![Version](https://img.shields.io/badge/version-1.1.0-blue)
4
+ ![License](https://img.shields.io/badge/license-MIT-green)
5
+
6
+ **Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
7
+
8
+ ## 🚀 Key Features
9
+
10
+ * **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
11
+ * **Layout-Aware PDF Engine**: Uses advanced 'layout' mode to preserve columns, tables, and physical text positioning.
12
+ * **Intelligent OCR**: Powered by **EasyOCR** (Deep Learning based) for superior accuracy in scanned documents.
13
+ * **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
14
+ * **AI Analysis Suite**:
15
+ * **Extractive Summarization**: Condenses long documents into key highlights.
16
+ * **Named Entity Recognition (NER)**: Detects People, Organizations, Dates, and more via **spaCy**.
17
+ * **Sentiment Analysis**: Analyzes emotional tone using the **VADER** algorithm.
18
+ * **Downloadable Results**: Export extracted text as clean `.txt` files.
19
+ * **Corporate UI**: A professional Blue & White dashboard with smooth animations and intuitive navigation.
20
+
21
+ ## 🛠️ Technology Stack
22
+
23
+ * **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
24
+ * **PDF Processing**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) (Layout Mode)
25
+ * **OCR**: [EasyOCR](https://github.com/JaidedAI/EasyOCR) & [Tesseract](https://github.com/tesseract-ocr/tesseract)
26
+ * **NLP**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
27
+ * **Frontend**: Vanilla HTML5, CSS3 (Modern UI), and JavaScript (ES6+)
28
+
29
+ ## 📦 Installation
30
 
31
+ ### 1. Clone the repository
32
+ ```bash
33
+ git clone <your-repo-url>
34
+ cd guvi-extraction
35
+ ```
36
+
37
+ ### 2. Install dependencies
38
+ ```bash
39
+ pip install -r requirements.txt
40
+ ```
41
+
42
+ ### 3. Install NLP model
43
+ ```bash
44
+ python -m spacy download en_core_web_sm
45
+ ```
46
+
47
+ ## 🏃 Getting Started
48
+
49
+ 1. Start the backend server:
50
+ ```bash
51
+ python main.py
52
+ ```
53
+ 2. Open your browser and navigate to:
54
+ `http://localhost:7860
55
+ `
56
+
57
+ ## 📘 Usage
58
+
59
+ 1. **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
60
+ 2. **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
61
+ 3. **URL Entry**: Paste a web link to summarize online articles instantly.
62
+ 4. **Download**: Once processing is complete, use the **Download** button to save the extracted text.
63
+
64
+
65
+ ---
analyzers/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Analyzers package
analyzers/ner_extractor.py ADDED
@@ -0,0 +1,145 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Named Entity Recognition using spaCy.
3
+ Extracts persons, organizations, dates, monetary amounts, locations, and more.
4
+ Also uses regex patterns for additional entity types.
5
+ """
6
+ import re
7
+ from collections import Counter
8
+ from typing import List, Dict
9
+ from models.schemas import Entity, EntityResult
10
+ from config import SPACY_MODEL, NER_ENTITY_TYPES
11
+
12
+ # Try to load spaCy model
13
+ try:
14
+ import spacy
15
+ nlp = spacy.load(SPACY_MODEL)
16
+ SPACY_AVAILABLE = True
17
+ except (ImportError, OSError):
18
+ SPACY_AVAILABLE = False
19
+ nlp = None
20
+
21
+ # Entity label descriptions
22
+ LABEL_DESCRIPTIONS = {
23
+ "PERSON": "Person name",
24
+ "ORG": "Organization",
25
+ "GPE": "Country / City / State",
26
+ "DATE": "Date or period",
27
+ "MONEY": "Monetary value",
28
+ "TIME": "Time expression",
29
+ "PERCENT": "Percentage",
30
+ "EVENT": "Named event",
31
+ "PRODUCT": "Product name",
32
+ "LAW": "Law or regulation",
33
+ "NORP": "Nationality / Group",
34
+ "FAC": "Facility / Building",
35
+ "LOC": "Non-GPE location",
36
+ "WORK_OF_ART": "Title of work",
37
+ "LANGUAGE": "Language name",
38
+ "CARDINAL": "Number",
39
+ "ORDINAL": "Ordinal number",
40
+ "QUANTITY": "Measurement",
41
+ "EMAIL": "Email address",
42
+ "PHONE": "Phone number",
43
+ "URL": "Web URL",
44
+ }
45
+
46
+ # Regex patterns for additional entity types
47
+ REGEX_PATTERNS = {
48
+ "EMAIL": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
49
+ "PHONE": r'(?:\+?\d{1,3}[-.\s]?)?\(?\d{2,4}\)?[-.\s]?\d{3,4}[-.\s]?\d{3,4}',
50
+ "URL": r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+[/\w\-._~:/?#\[\]@!$&\'()*+,;=%]*',
51
+ }
52
+
53
+
54
+ def _extract_regex_entities(text: str) -> List[Entity]:
55
+ """Extract entities using regex patterns."""
56
+ entities = []
57
+ for label, pattern in REGEX_PATTERNS.items():
58
+ matches = re.findall(pattern, text)
59
+ if matches:
60
+ counted = Counter(matches)
61
+ for match_text, count in counted.most_common():
62
+ entities.append(Entity(
63
+ text=match_text,
64
+ label=label,
65
+ label_description=LABEL_DESCRIPTIONS.get(label, label),
66
+ count=count,
67
+ ))
68
+ return entities
69
+
70
+
71
+ def _extract_spacy_entities(text: str) -> List[Entity]:
72
+ """Extract entities using spaCy NER."""
73
+ if not SPACY_AVAILABLE or nlp is None:
74
+ return []
75
+
76
+ # Process text (handle long texts by chunking)
77
+ max_length = 100000
78
+ if len(text) > max_length:
79
+ text = text[:max_length]
80
+
81
+ doc = nlp(text)
82
+
83
+ # Collect and deduplicate entities
84
+ entity_map: Dict[str, Dict] = {}
85
+
86
+ for ent in doc.ents:
87
+ if ent.label_ not in NER_ENTITY_TYPES:
88
+ continue
89
+
90
+ clean_text = ent.text.strip()
91
+ if not clean_text or len(clean_text) < 2:
92
+ continue
93
+
94
+ key = f"{ent.label_}:{clean_text.lower()}"
95
+ if key in entity_map:
96
+ entity_map[key]["count"] += 1
97
+ entity_map[key]["positions"].append(ent.start_char)
98
+ else:
99
+ entity_map[key] = {
100
+ "text": clean_text,
101
+ "label": ent.label_,
102
+ "label_description": LABEL_DESCRIPTIONS.get(ent.label_, ent.label_),
103
+ "count": 1,
104
+ "positions": [ent.start_char],
105
+ }
106
+
107
+ # Convert to Entity objects and sort by count
108
+ entities = [
109
+ Entity(**data)
110
+ for data in sorted(entity_map.values(), key=lambda x: x["count"], reverse=True)
111
+ ]
112
+
113
+ return entities
114
+
115
+
116
+ def extract_entities(text: str) -> EntityResult:
117
+ """
118
+ Extract named entities from text using spaCy and regex patterns.
119
+
120
+ Args:
121
+ text: The input text to analyze.
122
+
123
+ Returns:
124
+ EntityResult with all found entities and statistics.
125
+ """
126
+ if not text.strip():
127
+ return EntityResult(entities=[], entity_counts={}, total_entities=0)
128
+
129
+ # Get entities from both sources
130
+ spacy_entities = _extract_spacy_entities(text)
131
+ regex_entities = _extract_regex_entities(text)
132
+
133
+ # Combine (spaCy entities first, then regex)
134
+ all_entities = spacy_entities + regex_entities
135
+
136
+ # Count by category
137
+ entity_counts: Dict[str, int] = {}
138
+ for ent in all_entities:
139
+ entity_counts[ent.label] = entity_counts.get(ent.label, 0) + ent.count
140
+
141
+ return EntityResult(
142
+ entities=all_entities,
143
+ entity_counts=entity_counts,
144
+ total_entities=sum(ent.count for ent in all_entities),
145
+ )
analyzers/sentiment.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Sentiment analysis using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner).
3
+ Provides both overall and sentence-level sentiment analysis.
4
+ """
5
+ import nltk
6
+ from nltk.sentiment.vader import SentimentIntensityAnalyzer
7
+ from nltk.tokenize import sent_tokenize
8
+ from models.schemas import SentimentResult, SentimentBreakdown
9
+ from config import SENTIMENT_THRESHOLDS
10
+ from typing import List
11
+
12
+ # Download required NLTK data
13
+ try:
14
+ nltk.data.find("sentiment/vader_lexicon.zip")
15
+ except LookupError:
16
+ nltk.download("vader_lexicon", quiet=True)
17
+
18
+ try:
19
+ nltk.data.find("tokenizers/punkt_tab")
20
+ except LookupError:
21
+ nltk.download("punkt_tab", quiet=True)
22
+
23
+ # Initialize analyzer
24
+ sia = SentimentIntensityAnalyzer()
25
+
26
+
27
+ def _get_sentiment_label(compound: float) -> str:
28
+ """Convert compound score to human-readable label."""
29
+ if compound >= 0.5:
30
+ return "Very Positive"
31
+ elif compound >= SENTIMENT_THRESHOLDS["positive"]:
32
+ return "Positive"
33
+ elif compound <= -0.5:
34
+ return "Very Negative"
35
+ elif compound <= SENTIMENT_THRESHOLDS["negative"]:
36
+ return "Negative"
37
+ else:
38
+ return "Neutral"
39
+
40
+
41
+ def analyze_sentiment(text: str) -> SentimentResult:
42
+ """
43
+ Perform sentiment analysis on the given text.
44
+
45
+ Returns overall sentiment scores and sentence-level breakdown.
46
+
47
+ Args:
48
+ text: The input text to analyze.
49
+
50
+ Returns:
51
+ SentimentResult with overall and per-sentence sentiment analysis.
52
+ """
53
+ if not text.strip():
54
+ return SentimentResult(
55
+ overall_compound=0.0,
56
+ overall_positive=0.0,
57
+ overall_negative=0.0,
58
+ overall_neutral=1.0,
59
+ overall_label="Neutral",
60
+ sentence_breakdown=[],
61
+ confidence=0.0,
62
+ )
63
+
64
+ # Overall sentiment
65
+ overall_scores = sia.polarity_scores(text)
66
+
67
+ # Sentence-level breakdown
68
+ sentences = sent_tokenize(text)
69
+ sentence_breakdown: List[SentimentBreakdown] = []
70
+
71
+ # Limit to first 50 sentences for performance
72
+ for sent in sentences[:50]:
73
+ sent = sent.strip()
74
+ if not sent or len(sent) < 5:
75
+ continue
76
+
77
+ scores = sia.polarity_scores(sent)
78
+ sentence_breakdown.append(SentimentBreakdown(
79
+ text=sent[:200], # Truncate very long sentences
80
+ compound=round(scores["compound"], 4),
81
+ positive=round(scores["pos"], 4),
82
+ negative=round(scores["neg"], 4),
83
+ neutral=round(scores["neu"], 4),
84
+ label=_get_sentiment_label(scores["compound"]),
85
+ ))
86
+
87
+ # Calculate confidence based on consistency of sentence sentiments
88
+ if sentence_breakdown:
89
+ compounds = [sb.compound for sb in sentence_breakdown]
90
+ avg_magnitude = sum(abs(c) for c in compounds) / len(compounds)
91
+ confidence = min(avg_magnitude * 2, 1.0) # Scale to 0-1
92
+ else:
93
+ confidence = abs(overall_scores["compound"])
94
+
95
+ return SentimentResult(
96
+ overall_compound=round(overall_scores["compound"], 4),
97
+ overall_positive=round(overall_scores["pos"], 4),
98
+ overall_negative=round(overall_scores["neg"], 4),
99
+ overall_neutral=round(overall_scores["neu"], 4),
100
+ overall_label=_get_sentiment_label(overall_scores["compound"]),
101
+ sentence_breakdown=sentence_breakdown,
102
+ confidence=round(confidence, 4),
103
+ )
analyzers/summarizer.py ADDED
@@ -0,0 +1,98 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Extractive text summarization using sumy library.
3
+ Uses LexRank algorithm by default for graph-based sentence ranking.
4
+ """
5
+ from sumy.parsers.plaintext import PlaintextParser
6
+ from sumy.nlp.tokenizers import Tokenizer
7
+ from sumy.summarizers.lex_rank import LexRankSummarizer
8
+ from sumy.summarizers.lsa import LsaSummarizer
9
+ from sumy.summarizers.luhn import LuhnSummarizer
10
+ from sumy.nlp.stemmers import Stemmer
11
+ from sumy.utils import get_stop_words
12
+ from models.schemas import SummaryResult
13
+ from config import SUMMARY_SENTENCE_COUNT, SUMMARY_ALGORITHM
14
+
15
+ LANGUAGE = "english"
16
+
17
+
18
+ def _get_summarizer(algorithm: str):
19
+ """Get the appropriate summarizer based on algorithm name."""
20
+ stemmer = Stemmer(LANGUAGE)
21
+
22
+ if algorithm == "lsa":
23
+ summarizer = LsaSummarizer(stemmer)
24
+ elif algorithm == "luhn":
25
+ summarizer = LuhnSummarizer(stemmer)
26
+ else: # default to lex-rank
27
+ summarizer = LexRankSummarizer(stemmer)
28
+
29
+ summarizer.stop_words = get_stop_words(LANGUAGE)
30
+ return summarizer
31
+
32
+
33
+ def summarize_text(text: str, sentence_count: int = None, algorithm: str = None) -> SummaryResult:
34
+ """
35
+ Generate an extractive summary of the given text.
36
+
37
+ Args:
38
+ text: The input text to summarize.
39
+ sentence_count: Number of sentences in the summary (default from config).
40
+ algorithm: Summarization algorithm to use (default from config).
41
+
42
+ Returns:
43
+ SummaryResult with the summary and statistics.
44
+ """
45
+ if sentence_count is None:
46
+ sentence_count = SUMMARY_SENTENCE_COUNT
47
+ if algorithm is None:
48
+ algorithm = SUMMARY_ALGORITHM
49
+
50
+ # Handle short texts
51
+ sentences_in_text = [s.strip() for s in text.replace("\n", " ").split(".") if s.strip()]
52
+ if len(sentences_in_text) <= sentence_count:
53
+ # Text is already short enough
54
+ clean_text = " ".join(text.split())
55
+ return SummaryResult(
56
+ summary=clean_text,
57
+ original_length=len(text),
58
+ summary_length=len(clean_text),
59
+ compression_ratio=1.0,
60
+ sentence_count=len(sentences_in_text),
61
+ algorithm=algorithm,
62
+ )
63
+
64
+ try:
65
+ # Parse the text
66
+ parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
67
+ summarizer = _get_summarizer(algorithm)
68
+
69
+ # Generate summary
70
+ summary_sentences = summarizer(parser.document, sentence_count)
71
+ summary = " ".join(str(sentence) for sentence in summary_sentences)
72
+
73
+ if not summary.strip():
74
+ # Fallback: return first N sentences
75
+ summary = ". ".join(sentences_in_text[:sentence_count]) + "."
76
+
77
+ compression_ratio = len(summary) / len(text) if len(text) > 0 else 1.0
78
+
79
+ return SummaryResult(
80
+ summary=summary,
81
+ original_length=len(text),
82
+ summary_length=len(summary),
83
+ compression_ratio=round(compression_ratio, 4),
84
+ sentence_count=sentence_count,
85
+ algorithm=algorithm,
86
+ )
87
+
88
+ except Exception as e:
89
+ # Fallback: return first few sentences
90
+ fallback = ". ".join(sentences_in_text[:sentence_count]) + "."
91
+ return SummaryResult(
92
+ summary=fallback,
93
+ original_length=len(text),
94
+ summary_length=len(fallback),
95
+ compression_ratio=round(len(fallback) / len(text), 4) if len(text) > 0 else 1.0,
96
+ sentence_count=sentence_count,
97
+ algorithm=f"{algorithm} (fallback)",
98
+ )
config.py ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Configuration settings for the Document Processing System.
3
+ """
4
+ import os
5
+ import shutil
6
+
7
+ # --- Paths ---
8
+ BASE_DIR = os.path.dirname(os.path.abspath(__file__))
9
+ UPLOAD_DIR = os.path.join(BASE_DIR, "uploads")
10
+ STATIC_DIR = os.path.join(BASE_DIR, "static")
11
+
12
+ # Create uploads directory if it doesn't exist
13
+ os.makedirs(UPLOAD_DIR, exist_ok=True)
14
+
15
+ # --- File Upload Settings ---
16
+ MAX_FILE_SIZE_MB = 50
17
+ MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
18
+ ALLOWED_EXTENSIONS = {
19
+ "pdf": "application/pdf",
20
+ "docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
21
+ "png": "image/png",
22
+ "jpg": "image/jpeg",
23
+ "jpeg": "image/jpeg",
24
+ "tiff": "image/tiff",
25
+ "bmp": "image/bmp",
26
+ "webp": "image/webp",
27
+ }
28
+
29
+ # --- OCR Configuration ---
30
+ # EasyOCR settings
31
+ EASYOCR_LANGS = ["en"] # Languages to support
32
+ EASYOCR_GPU = False # Set to True if NVIDIA GPU is available and CUDA is installed
33
+
34
+ # Keep Tesseract as fallback if needed, but prioritize EasyOCR for accuracy
35
+ def find_tesseract():
36
+ """Auto-detect Tesseract installation path on Windows."""
37
+ import shutil
38
+ tesseract_in_path = shutil.which("tesseract")
39
+ if tesseract_in_path:
40
+ return tesseract_in_path
41
+
42
+ common_paths = [
43
+ r"C:\Program Files\Tesseract-OCR\tesseract.exe",
44
+ r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
45
+ r"C:\Users\{}\AppData\Local\Tesseract-OCR\tesseract.exe".format(os.getenv("USERNAME", "")),
46
+ ]
47
+ for path in common_paths:
48
+ if os.path.isfile(path):
49
+ return path
50
+ return None
51
+
52
+ TESSERACT_CMD = find_tesseract()
53
+ TESSERACT_LANG = "eng"
54
+
55
+ def check_ocr_availability():
56
+ """Check if any OCR engine is available."""
57
+ try:
58
+ import easyocr
59
+ return "available"
60
+ except ImportError:
61
+ if TESSERACT_CMD:
62
+ return "tesseract-only"
63
+ return "not-found"
64
+
65
+ # --- Summarization Settings ---
66
+ SUMMARY_SENTENCE_COUNT = 5
67
+ SUMMARY_ALGORITHM = "lex-rank" # Options: lex-rank, lsa, luhn, edmundson
68
+
69
+ # --- NER Settings ---
70
+ SPACY_MODEL = "en_core_web_sm"
71
+ NER_ENTITY_TYPES = ["PERSON", "ORG", "DATE", "MONEY", "GPE", "EVENT", "PRODUCT", "LAW", "NORP"]
72
+
73
+ # --- Sentiment Settings ---
74
+ SENTIMENT_THRESHOLDS = {
75
+ "positive": 0.05,
76
+ "negative": -0.05,
77
+ }
docker-compose.yml ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ version: '3.8'
2
+
3
+ services:
4
+ alldocex:
5
+ build: .
6
+ container_name: alldocex-app
7
+ ports:
8
+ - "7860:7860"
9
+ volumes:
10
+ - ./uploads:/app/uploads
11
+ environment:
12
+ - PORT=8000
13
+ restart: always
extractors/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Extractors package
extractors/docx_extractor.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ DOCX text extraction using python-docx.
3
+ Extracts text preserving paragraph structure, tables, and document properties.
4
+ """
5
+ import time
6
+ import os
7
+ from docx import Document
8
+ from models.schemas import ExtractionResult, DocumentMetadata
9
+
10
+
11
+ def extract_docx(file_path: str) -> ExtractionResult:
12
+ """Extract text and metadata from a DOCX file."""
13
+ start_time = time.time()
14
+
15
+ try:
16
+ doc = Document(file_path)
17
+
18
+ # Extract paragraphs
19
+ paragraphs = []
20
+ for para in doc.paragraphs:
21
+ text = para.text.strip()
22
+ if text:
23
+ # Preserve heading structure
24
+ if para.style and para.style.name.startswith("Heading"):
25
+ level = para.style.name.replace("Heading ", "").strip()
26
+ prefix = "#" * int(level) if level.isdigit() else "##"
27
+ paragraphs.append(f"{prefix} {text}")
28
+ else:
29
+ paragraphs.append(text)
30
+
31
+ # Extract tables
32
+ tables_text = []
33
+ for table_idx, table in enumerate(doc.tables):
34
+ table_data = []
35
+ for row in table.rows:
36
+ row_data = [cell.text.strip() for cell in row.cells]
37
+ table_data.append(" | ".join(row_data))
38
+ if table_data:
39
+ tables_text.append(f"\n[Table {table_idx + 1}]\n" + "\n".join(table_data))
40
+
41
+ # Combine all text
42
+ full_text = "\n\n".join(paragraphs)
43
+ if tables_text:
44
+ full_text += "\n\n" + "\n".join(tables_text)
45
+
46
+ # Extract metadata from core properties
47
+ props = doc.core_properties
48
+ metadata = DocumentMetadata(
49
+ title=props.title or os.path.basename(file_path),
50
+ author=props.author or "Unknown",
51
+ creation_date=str(props.created) if props.created else "",
52
+ modification_date=str(props.modified) if props.modified else "",
53
+ page_count=None, # DOCX doesn't expose page count easily
54
+ word_count=len(full_text.split()) if full_text else 0,
55
+ character_count=len(full_text),
56
+ file_type="DOCX",
57
+ extra={
58
+ "category": props.category or "",
59
+ "comments": props.comments or "",
60
+ "last_modified_by": props.last_modified_by or "",
61
+ "revision": props.revision,
62
+ "subject": props.subject or "",
63
+ "keywords": props.keywords or "",
64
+ "paragraph_count": len(doc.paragraphs),
65
+ "table_count": len(doc.tables),
66
+ }
67
+ )
68
+
69
+ elapsed = (time.time() - start_time) * 1000
70
+
71
+ if not full_text.strip():
72
+ return ExtractionResult(
73
+ raw_text="",
74
+ metadata=metadata,
75
+ success=False,
76
+ error_message="No text content found in the DOCX file.",
77
+ extraction_time_ms=elapsed,
78
+ )
79
+
80
+ return ExtractionResult(
81
+ raw_text=full_text,
82
+ metadata=metadata,
83
+ success=True,
84
+ extraction_time_ms=elapsed,
85
+ )
86
+
87
+ except Exception as e:
88
+ elapsed = (time.time() - start_time) * 1000
89
+ return ExtractionResult(
90
+ raw_text="",
91
+ metadata=DocumentMetadata(file_type="DOCX"),
92
+ success=False,
93
+ error_message=f"DOCX extraction failed: {str(e)}",
94
+ extraction_time_ms=elapsed,
95
+ )
extractors/ocr_extractor.py ADDED
@@ -0,0 +1,245 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Image OCR extraction using EasyOCR (primary) and Tesseract (fallback).
3
+ Includes advanced image preprocessing for maximum accuracy.
4
+ """
5
+ import time
6
+ import os
7
+ import numpy as np
8
+ from PIL import Image, ImageEnhance, ImageFilter, ImageOps
9
+ from models.schemas import ExtractionResult, DocumentMetadata
10
+ import config
11
+
12
+ # --- OCR Engine Detection ---
13
+
14
+ try:
15
+ import easyocr
16
+ EASYOCR_AVAILABLE = True
17
+ except ImportError:
18
+ EASYOCR_AVAILABLE = False
19
+
20
+ try:
21
+ import pytesseract
22
+ TESSERACT_AVAILABLE = True
23
+ except ImportError:
24
+ TESSERACT_AVAILABLE = False
25
+
26
+
27
+ # Global reader instance for EasyOCR (lazy loaded)
28
+ _EASY_READER = None
29
+
30
+ def get_easyocr_reader():
31
+ """Get or create the EasyOCR reader instance."""
32
+ global _EASY_READER
33
+ if _EASY_READER is None and EASYOCR_AVAILABLE:
34
+ try:
35
+ # Initialize with configured languages and GPU setting
36
+ _EASY_READER = easyocr.Reader(config.EASYOCR_LANGS, gpu=config.EASYOCR_GPU)
37
+ except Exception as e:
38
+ print(f"Error initializing EasyOCR: {e}")
39
+ return None
40
+ return _EASY_READER
41
+
42
+
43
+ def _configure_tesseract():
44
+ """Configure tesseract path from config."""
45
+ if config.TESSERACT_CMD and TESSERACT_AVAILABLE:
46
+ pytesseract.pytesseract.tesseract_cmd = config.TESSERACT_CMD
47
+ return True
48
+ elif TESSERACT_AVAILABLE:
49
+ try:
50
+ pytesseract.get_tesseract_version()
51
+ return True
52
+ except Exception:
53
+ return False
54
+ return False
55
+
56
+
57
+ def _preprocess_image(image: Image.Image) -> Image.Image:
58
+ """Preprocess image for maximum OCR accuracy."""
59
+ # 1. Convert to grayscale
60
+ if image.mode != "L":
61
+ image = image.convert("L")
62
+
63
+ # 2. Dynamic Contrast / Lighting correction
64
+ image = ImageOps.autocontrast(image)
65
+
66
+ # 3. Resize to optimal DPI (approx 300)
67
+ width, height = image.size
68
+ if width < 1500 or height < 1500:
69
+ scale = max(1800 / width, 1800 / height, 2.0)
70
+ new_size = (int(width * scale), int(height * scale))
71
+ image = image.resize(new_size, Image.Resampling.LANCZOS)
72
+
73
+ # 4. Sharpening (Unsharp Mask equivalent)
74
+ image = image.filter(ImageFilter.SHARPEN)
75
+ enhancer = ImageEnhance.Contrast(image)
76
+ image = enhancer.enhance(1.8)
77
+
78
+ # 5. Denoising
79
+ image = image.filter(ImageFilter.MedianFilter(size=3))
80
+
81
+ return image
82
+
83
+
84
+ def _reconstruct_from_boxes(results: list) -> str:
85
+ """ Reconstruct text layout from bounding boxes.
86
+ Sort by top, then group by 'lines' based on y-coordinate.
87
+ """
88
+ if not results:
89
+ return ""
90
+
91
+ # Sort results by top y-coordinate
92
+ results.sort(key=lambda x: x[0][0][1])
93
+
94
+ lines = []
95
+ if results:
96
+ current_line = [results[0]]
97
+ for i in range(1, len(results)):
98
+ # If the current block's mid-y is within the previous block's height range
99
+ prev_box = results[i-1][0]
100
+ curr_box = results[i][0]
101
+
102
+ prev_y_center = (prev_box[0][1] + prev_box[2][1]) / 2
103
+ curr_y_center = (curr_box[0][1] + curr_box[2][1]) / 2
104
+
105
+ # Threshold for 'same line' is approx 1/3 of the box height
106
+ height = prev_box[2][1] - prev_box[0][1]
107
+ if abs(curr_y_center - prev_y_center) < (height * 0.5):
108
+ current_line.append(results[i])
109
+ else:
110
+ lines.append(current_line)
111
+ current_line = [results[i]]
112
+ lines.append(current_line)
113
+
114
+ final_text = []
115
+ for line in lines:
116
+ # Sort each line by left x-coordinate
117
+ line.sort(key=lambda x: x[0][0][0])
118
+ line_text = []
119
+ for i, res in enumerate(line):
120
+ # Add relative spacing based on horizontal gap
121
+ if i > 0:
122
+ prev_right = line[i-1][0][1][0]
123
+ curr_left = res[0][0][0]
124
+ gap = curr_left - prev_right
125
+ # If gap is significant, add spaces
126
+ char_width = (res[0][1][0] - res[0][0][0]) / (len(res[1]) or 1)
127
+ num_spaces = int(gap / (char_width * 1.5))
128
+ line_text.append(" " * max(1, num_spaces))
129
+
130
+ line_text.append(res[1])
131
+ final_text.append(" ".join(line_text))
132
+
133
+ return "\n".join(final_text)
134
+
135
+
136
+ def extract_image(file_path: str) -> ExtractionResult:
137
+ """Extract text from an image using the best available OCR engine."""
138
+ start_time = time.time()
139
+
140
+ # 1. Check for EasyOCR (Preferred)
141
+ if EASYOCR_AVAILABLE:
142
+ try:
143
+ reader = get_easyocr_reader()
144
+ if reader:
145
+ # EasyOCR works well with both original and preprocessed images
146
+ # We'll use a slightly preprocessed version for consistency
147
+ # Perform OCR with layout awareness
148
+ # Adjusting thresholds for better numeric and tabular capture
149
+ results = reader.readtext(
150
+ file_path,
151
+ detail=1,
152
+ paragraph=False, # We want individual boxes for layout reconstruction
153
+ width_ths=0.7, # Better for long numbers/strings
154
+ height_ths=0.7,
155
+ contrast_ths=0.3
156
+ )
157
+
158
+ # Reconstruct full layout from bounding boxes
159
+ text = _reconstruct_from_boxes(results)
160
+
161
+ if text.strip():
162
+ elapsed = (time.time() - start_time) * 1000
163
+ metadata = DocumentMetadata(
164
+ title=os.path.basename(file_path),
165
+ page_count=1,
166
+ word_count=len(text.split()),
167
+ character_count=len(text),
168
+ file_type="Image (EasyOCR)",
169
+ extra={
170
+ "image_width": original_size[0],
171
+ "image_height": original_size[1],
172
+ "ocr_engine": "EasyOCR",
173
+ "accuracy": "High (Deep Learning)"
174
+ }
175
+ )
176
+ return ExtractionResult(
177
+ raw_text=text.strip(),
178
+ metadata=metadata,
179
+ success=True,
180
+ extraction_time_ms=elapsed
181
+ )
182
+ except Exception as e:
183
+ print(f"EasyOCR extraction failed, falling back to Tesseract: {e}")
184
+
185
+ # 2. Fallback to Tesseract
186
+ if TESSERACT_AVAILABLE and _configure_tesseract():
187
+ try:
188
+ image = Image.open(file_path)
189
+ original_size = image.size
190
+ processed_image = _preprocess_image(image)
191
+
192
+ custom_config = f"--oem 3 --psm 6 -l {config.TESSERACT_LANG}"
193
+ text = pytesseract.image_to_string(processed_image, config=custom_config)
194
+
195
+ # Confidence
196
+ try:
197
+ data = pytesseract.image_to_data(processed_image, config=custom_config, output_type=pytesseract.Output.DICT)
198
+ confidences = [int(c) for c in data["conf"] if int(c) > 0]
199
+ avg_confidence = sum(confidences) / len(confidences) if confidences else 0
200
+ except Exception:
201
+ avg_confidence = 0
202
+
203
+ elapsed = (time.time() - start_time) * 1000
204
+ if text.strip():
205
+ metadata = DocumentMetadata(
206
+ title=os.path.basename(file_path),
207
+ page_count=1,
208
+ word_count=len(text.split()),
209
+ character_count=len(text),
210
+ file_type="Image (Tesseract)",
211
+ extra={
212
+ "image_width": original_size[0],
213
+ "image_height": original_size[1],
214
+ "ocr_confidence": round(avg_confidence, 2),
215
+ "ocr_engine": "Tesseract"
216
+ }
217
+ )
218
+ return ExtractionResult(
219
+ raw_text=text.strip(),
220
+ metadata=metadata,
221
+ success=True,
222
+ extraction_time_ms=elapsed
223
+ )
224
+ except Exception as e:
225
+ print(f"Tesseract extraction failed: {e}")
226
+
227
+ # 3. Failure cases
228
+ elapsed = (time.time() - start_time) * 1000
229
+
230
+ if not EASYOCR_AVAILABLE and not TESSERACT_AVAILABLE:
231
+ error_msg = "No OCR libraries installed. Please run 'pip install easyocr'."
232
+ elif not EASYOCR_AVAILABLE and TESSERACT_AVAILABLE:
233
+ error_msg = "EasyOCR is not installed, and Tesseract binary was not found or failed. Please run 'pip install easyocr' for best results."
234
+ elif EASYOCR_AVAILABLE and not TESSERACT_AVAILABLE:
235
+ error_msg = "EasyOCR failed to extract text, and Tesseract is not installed."
236
+ else:
237
+ error_msg = "OCR extraction failed. Both EasyOCR and Tesseract engines were unable to extract text from this image."
238
+
239
+ return ExtractionResult(
240
+ raw_text="",
241
+ metadata=DocumentMetadata(file_type="Image (OCR)"),
242
+ success=False,
243
+ error_message=error_msg,
244
+ extraction_time_ms=elapsed,
245
+ )
extractors/pdf_extractor.py ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ PDF text extraction using PyMuPDF (fitz).
3
+ Extracts text with layout preservation and document metadata.
4
+ """
5
+ import fitz # PyMuPDF
6
+ import time
7
+ import os
8
+ from models.schemas import ExtractionResult, DocumentMetadata
9
+
10
+
11
+ def extract_pdf(file_path: str) -> ExtractionResult:
12
+ """Extract text and metadata from a PDF file."""
13
+ start_time = time.time()
14
+
15
+ try:
16
+ doc = fitz.open(file_path)
17
+
18
+ # Extract text from all pages with full layout preservation
19
+ pages_text = []
20
+ for page_num in range(len(doc)):
21
+ page = doc[page_num]
22
+ # "layout" mode preserves the physical positioning of text (columns, tables, etc.)
23
+ # This ensures the "pointer position" matches the original PDF look.
24
+ text = page.get_text("layout")
25
+ if text.strip():
26
+ pages_text.append(f"--- Page {page_num + 1} ---\n{text}")
27
+
28
+ full_text = "\n\n".join(pages_text)
29
+
30
+ # Extract metadata
31
+ meta = doc.metadata
32
+ metadata = DocumentMetadata(
33
+ title=meta.get("title", "") or os.path.basename(file_path),
34
+ author=meta.get("author", "") or "Unknown",
35
+ creation_date=meta.get("creationDate", ""),
36
+ modification_date=meta.get("modDate", ""),
37
+ page_count=len(doc),
38
+ word_count=len(full_text.split()) if full_text else 0,
39
+ character_count=len(full_text),
40
+ file_type="PDF",
41
+ extra={
42
+ "producer": meta.get("producer", ""),
43
+ "creator": meta.get("creator", ""),
44
+ "subject": meta.get("subject", ""),
45
+ "keywords": meta.get("keywords", ""),
46
+ "format": meta.get("format", ""),
47
+ "encryption": doc.is_encrypted,
48
+ }
49
+ )
50
+
51
+ doc.close()
52
+
53
+ elapsed = (time.time() - start_time) * 1000
54
+
55
+ if not full_text.strip():
56
+ return ExtractionResult(
57
+ raw_text="",
58
+ metadata=metadata,
59
+ success=False,
60
+ error_message="No extractable text found in PDF. The document may contain only images — try uploading as an image for OCR processing.",
61
+ extraction_time_ms=elapsed,
62
+ )
63
+
64
+ return ExtractionResult(
65
+ raw_text=full_text,
66
+ metadata=metadata,
67
+ success=True,
68
+ extraction_time_ms=elapsed,
69
+ )
70
+
71
+ except Exception as e:
72
+ elapsed = (time.time() - start_time) * 1000
73
+ return ExtractionResult(
74
+ raw_text="",
75
+ metadata=DocumentMetadata(file_type="PDF"),
76
+ success=False,
77
+ error_message=f"PDF extraction failed: {str(e)}",
78
+ extraction_time_ms=elapsed,
79
+ )
extractors/url_extractor.py ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Web content extraction from URLs using requests and BeautifulSoup4.
3
+ Extracts title and main text content from HTML pages.
4
+ """
5
+ import time
6
+ import requests
7
+ from bs4 import BeautifulSoup
8
+ from models.schemas import ExtractionResult, DocumentMetadata
9
+
10
+
11
+ def extract_url(url: str) -> ExtractionResult:
12
+ """Fetch and extract text content from a web URL."""
13
+ start_time = time.time()
14
+
15
+ try:
16
+ # 1. Fetch content
17
+ headers = {
18
+ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
19
+ }
20
+ response = requests.get(url, headers=headers, timeout=10)
21
+ response.raise_for_status()
22
+
23
+ # 2. Parse HTML
24
+ soup = BeautifulSoup(response.text, 'html.parser')
25
+
26
+ # 3. Remove script and style elements
27
+ for script_or_style in soup(["script", "style", "nav", "footer", "header", "aside"]):
28
+ script_or_style.decompose()
29
+
30
+ # 4. Get text
31
+ # Try to find the title
32
+ title = soup.title.string.strip() if soup.title else url
33
+
34
+ # Get main text - simple heuristic: look for <article> or just <body>
35
+ content_area = soup.find('article') or soup.body
36
+ if not content_area:
37
+ content_area = soup
38
+
39
+ # Extract text while preserving some paragraph structure
40
+ lines = []
41
+ for element in content_area.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'li']):
42
+ text = element.get_text().strip()
43
+ if text:
44
+ if element.name.startswith('h'):
45
+ prefix = '#' * int(element.name[1])
46
+ lines.append(f"\n{prefix} {text}\n")
47
+ else:
48
+ lines.append(text)
49
+
50
+ full_text = "\n\n".join(lines)
51
+ if not full_text.strip():
52
+ # Fallback to general text extraction
53
+ full_text = soup.get_text(separator='\n\n', strip=True)
54
+
55
+ # 5. Build metadata
56
+ metadata = DocumentMetadata(
57
+ title=title,
58
+ author="Web Content",
59
+ creation_date="",
60
+ modification_date="",
61
+ page_count=None,
62
+ word_count=len(full_text.split()),
63
+ character_count=len(full_text),
64
+ file_type="URL",
65
+ extra={
66
+ "url": url,
67
+ "domain": url.split('/')[2] if '//' in url else url.split('/')[0],
68
+ "status_code": response.status_code,
69
+ "content_type": response.headers.get('Content-Type', '')
70
+ }
71
+ )
72
+
73
+ elapsed = (time.time() - start_time) * 1000
74
+
75
+ if not full_text.strip():
76
+ return ExtractionResult(
77
+ raw_text="",
78
+ metadata=metadata,
79
+ success=False,
80
+ error_message="Could not extract any meaningful text from the provided URL.",
81
+ extraction_time_ms=elapsed,
82
+ )
83
+
84
+ return ExtractionResult(
85
+ raw_text=full_text,
86
+ metadata=metadata,
87
+ success=True,
88
+ extraction_time_ms=elapsed,
89
+ )
90
+
91
+ except requests.exceptions.RequestException as e:
92
+ elapsed = (time.time() - start_time) * 1000
93
+ return ExtractionResult(
94
+ raw_text="",
95
+ metadata=DocumentMetadata(file_type="URL", extra={"url": url}),
96
+ success=False,
97
+ error_message=f"Failed to fetch URL: {str(e)}",
98
+ extraction_time_ms=elapsed,
99
+ )
100
+ except Exception as e:
101
+ elapsed = (time.time() - start_time) * 1000
102
+ return ExtractionResult(
103
+ raw_text="",
104
+ metadata=DocumentMetadata(file_type="URL", extra={"url": url}),
105
+ success=False,
106
+ error_message=f"Web extraction failed: {str(e)}",
107
+ extraction_time_ms=elapsed,
108
+ )
main.py ADDED
@@ -0,0 +1,307 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Intelligent Document Processing System
3
+ FastAPI backend with async document processing.
4
+ """
5
+ import os
6
+ import uuid
7
+ import time
8
+ import asyncio
9
+ from typing import Dict
10
+ from fastapi import FastAPI, UploadFile, File, HTTPException
11
+ from fastapi.staticfiles import StaticFiles
12
+ from fastapi.responses import FileResponse, JSONResponse
13
+ from fastapi.middleware.cors import CORSMiddleware
14
+
15
+ from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
16
+ from models.schemas import (
17
+ UploadResponse, ProcessingResult, TaskStatus,
18
+ ExtractionResult, DocumentMetadata,
19
+ )
20
+ from extractors.pdf_extractor import extract_pdf
21
+ from extractors.docx_extractor import extract_docx
22
+ from extractors.ocr_extractor import extract_image
23
+ from extractors.url_extractor import extract_url
24
+ from analyzers.summarizer import summarize_text
25
+ from analyzers.ner_extractor import extract_entities
26
+ from analyzers.sentiment import analyze_sentiment
27
+
28
+ # --- App Setup ---
29
+ app = FastAPI(
30
+ title="Alldocex - Intelligent Document Processing",
31
+ description="Extract, analyse, and summarize content from PDF, DOCX, and image files using AI.",
32
+ version="1.0.0",
33
+ )
34
+
35
+ app.add_middleware(
36
+ CORSMiddleware,
37
+ allow_origins=["*"],
38
+ allow_credentials=True,
39
+ allow_methods=["*"],
40
+ allow_headers=["*"],
41
+ )
42
+
43
+ # In-memory task store
44
+ tasks: Dict[str, ProcessingResult] = {}
45
+
46
+ # --- Utility Functions ---
47
+
48
+ def _human_readable_size(size_bytes: int) -> str:
49
+ """Convert bytes to human readable string."""
50
+ for unit in ["B", "KB", "MB", "GB"]:
51
+ if size_bytes < 1024:
52
+ return f"{size_bytes:.1f} {unit}"
53
+ size_bytes /= 1024
54
+ return f"{size_bytes:.1f} TB"
55
+
56
+
57
+ def _get_file_type(filename: str) -> str:
58
+ """Determine file type from extension."""
59
+ ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
60
+ if ext == "pdf":
61
+ return "pdf"
62
+ elif ext == "docx":
63
+ return "docx"
64
+ elif ext in ("png", "jpg", "jpeg", "tiff", "bmp", "webp"):
65
+ return "image"
66
+ return "unknown"
67
+
68
+
69
+ def _process_document(file_path: str, file_type: str, task_id: str):
70
+ """
71
+ Process a document: extract text, then run all analyzers.
72
+ This runs in a thread pool to avoid blocking the event loop.
73
+ """
74
+ start_time = time.time()
75
+ task = tasks[task_id]
76
+ task.status = TaskStatus.PROCESSING
77
+
78
+ try:
79
+ # Step 1: Extract text based on file type
80
+ if file_type == "pdf":
81
+ extraction = extract_pdf(file_path)
82
+ elif file_type == "docx":
83
+ extraction = extract_docx(file_path)
84
+ elif file_type == "image":
85
+ extraction = extract_image(file_path)
86
+ elif file_type == "url":
87
+ # file_path is the URL string here
88
+ extraction = extract_url(file_path)
89
+ else:
90
+ raise ValueError(f"Unsupported file type: {file_type}")
91
+
92
+ task.extraction = extraction
93
+
94
+ if not extraction.success or not extraction.raw_text.strip():
95
+ task.status = TaskStatus.COMPLETED
96
+ task.error_message = extraction.error_message or "No text could be extracted."
97
+ task.processing_time_ms = (time.time() - start_time) * 1000
98
+ return
99
+
100
+ raw_text = extraction.raw_text
101
+
102
+ # Step 2: Summarization
103
+ try:
104
+ task.summary = summarize_text(raw_text)
105
+ except Exception as e:
106
+ print(f"Summarization error: {e}")
107
+
108
+ # Step 3: Named Entity Recognition
109
+ try:
110
+ task.entities = extract_entities(raw_text)
111
+ except Exception as e:
112
+ print(f"NER error: {e}")
113
+
114
+ # Step 4: Sentiment Analysis
115
+ try:
116
+ task.sentiment = analyze_sentiment(raw_text)
117
+ except Exception as e:
118
+ print(f"Sentiment error: {e}")
119
+
120
+ task.status = TaskStatus.COMPLETED
121
+ task.processing_time_ms = (time.time() - start_time) * 1000
122
+
123
+ except Exception as e:
124
+ task.status = TaskStatus.ERROR
125
+ task.error_message = str(e)
126
+ task.processing_time_ms = (time.time() - start_time) * 1000
127
+
128
+ finally:
129
+ # Clean up uploaded file
130
+ try:
131
+ if os.path.exists(file_path):
132
+ os.remove(file_path)
133
+ except Exception:
134
+ pass
135
+
136
+
137
+ # --- API Endpoints ---
138
+
139
+ @app.post("/api/upload", response_model=ProcessingResult)
140
+ async def upload_and_process(file: UploadFile = File(...)):
141
+ """
142
+ Upload a document and start processing.
143
+ Supports PDF, DOCX, PNG, JPG, JPEG, TIFF, BMP, WEBP.
144
+ """
145
+ # Validate file extension
146
+ filename = file.filename or "unknown"
147
+ ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
148
+ if ext not in ALLOWED_EXTENSIONS:
149
+ raise HTTPException(
150
+ status_code=400,
151
+ detail=f"Unsupported file type: .{ext}. Supported: {', '.join(ALLOWED_EXTENSIONS.keys())}"
152
+ )
153
+
154
+ # Read file content
155
+ content = await file.read()
156
+ file_size = len(content)
157
+
158
+ # Validate file size
159
+ if file_size > MAX_FILE_SIZE_BYTES:
160
+ raise HTTPException(
161
+ status_code=400,
162
+ detail=f"File too large. Maximum size: {MAX_FILE_SIZE_BYTES // (1024*1024)}MB"
163
+ )
164
+
165
+ if file_size == 0:
166
+ raise HTTPException(status_code=400, detail="Empty file uploaded.")
167
+
168
+ # Save file
169
+ file_id = str(uuid.uuid4())[:8]
170
+ safe_filename = f"{file_id}_{filename}"
171
+ file_path = os.path.join(UPLOAD_DIR, safe_filename)
172
+
173
+ with open(file_path, "wb") as f:
174
+ f.write(content)
175
+
176
+ # Determine file type
177
+ file_type = _get_file_type(filename)
178
+
179
+ # Create task
180
+ task = ProcessingResult.create_pending(
181
+ file_id=file_id,
182
+ filename=filename,
183
+ file_type=file_type,
184
+ )
185
+ tasks[file_id] = task
186
+
187
+ # Start async processing in a thread
188
+ asyncio.get_event_loop().run_in_executor(
189
+ None, _process_document, file_path, file_type, file_id
190
+ )
191
+
192
+ return task
193
+
194
+
195
+ @app.post("/api/extract/url", response_model=ProcessingResult)
196
+ async def extract_from_url(data: Dict[str, str]):
197
+ """
198
+ Extract content from a web URL and process it.
199
+ """
200
+ url = data.get("url")
201
+ if not url:
202
+ raise HTTPException(status_code=400, detail="URL is required.")
203
+
204
+ if not url.startswith(("http://", "https://")):
205
+ raise HTTPException(status_code=400, detail="Invalid URL format. Must start with http:// or https://")
206
+
207
+ # Create task
208
+ file_id = str(uuid.uuid4())[:8]
209
+ # For URLs, we use the domain as the "filename"
210
+ filename = url.split('/')[2] if '//' in url else url.split('/')[0]
211
+
212
+ task = ProcessingResult.create_pending(
213
+ file_id=file_id,
214
+ filename=filename,
215
+ file_type="url",
216
+ )
217
+ tasks[file_id] = task
218
+
219
+ # Start async processing in a thread
220
+ asyncio.get_event_loop().run_in_executor(
221
+ None, _process_document, url, "url", file_id
222
+ )
223
+
224
+ return task
225
+
226
+
227
+ @app.get("/api/status/{task_id}")
228
+ async def get_task_status(task_id: str):
229
+ """Get the processing status and results for a task."""
230
+ if task_id not in tasks:
231
+ raise HTTPException(status_code=404, detail="Task not found.")
232
+ return tasks[task_id]
233
+
234
+
235
+ @app.get("/api/download/{task_id}")
236
+ async def download_results(task_id: str):
237
+ """Download the extracted text as a .txt file."""
238
+ if task_id not in tasks:
239
+ raise HTTPException(status_code=404, detail="Task not found.")
240
+
241
+ task = tasks[task_id]
242
+ if not task.extraction or not task.extraction.raw_text:
243
+ raise HTTPException(status_code=400, detail="No text available for download.")
244
+
245
+ # Create temporary file path
246
+ filename = f"extracted_{task.filename}.txt"
247
+ temp_path = os.path.join(UPLOAD_DIR, filename)
248
+
249
+ try:
250
+ with open(temp_path, "w", encoding="utf-8") as f:
251
+ f.write(task.extraction.raw_text)
252
+
253
+ return FileResponse(
254
+ temp_path,
255
+ filename=filename,
256
+ media_type="text/plain",
257
+ background=None # Note: ideally we'd use BackgroundTask to delete this file later
258
+ )
259
+ except Exception as e:
260
+ raise HTTPException(status_code=500, detail=f"Failed to generate download: {str(e)}")
261
+
262
+
263
+ @app.get("/api/health")
264
+ async def health_check():
265
+ """Health check endpoint."""
266
+ from config import check_ocr_availability
267
+
268
+ # Check OCR status
269
+ ocr_status = check_ocr_availability()
270
+
271
+ # Check spaCy
272
+ try:
273
+ import spacy
274
+ spacy.load("en_core_web_sm")
275
+ spacy_status = "available"
276
+ except Exception:
277
+ spacy_status = "not available"
278
+
279
+ return {
280
+ "status": "healthy",
281
+ "ocr": ocr_status,
282
+ "tesseract": "available" if ocr_status in ("available", "tesseract-only") else "not found",
283
+ "spacy": spacy_status,
284
+ "version": "1.1.0",
285
+ }
286
+
287
+
288
+ # --- Static Files ---
289
+
290
+ # Serve the main page
291
+ @app.get("/")
292
+ async def serve_index():
293
+ index_path = os.path.join(STATIC_DIR, "index.html")
294
+ if os.path.exists(index_path):
295
+ return FileResponse(index_path)
296
+ return JSONResponse({"message": "Alldocex API is running. Frontend not found."})
297
+
298
+
299
+ # Mount static files
300
+ app.mount("/static", StaticFiles(directory=STATIC_DIR), name="static")
301
+
302
+
303
+ if __name__ == "__main__":
304
+ import uvicorn
305
+ print("\n🚀 Alldocex - Intelligent Document Processing System")
306
+ print("📄 Open http://localhost:7860 in your browser\n")
307
+ uvicorn.run(app, host="0.0.0.0", port=7860)
models/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Models package
models/schemas.py ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Pydantic models for request/response schemas.
3
+ """
4
+ from pydantic import BaseModel
5
+ from typing import Optional, List, Dict, Any
6
+ from enum import Enum
7
+ import time
8
+ import uuid
9
+
10
+
11
+ class TaskStatus(str, Enum):
12
+ PENDING = "pending"
13
+ PROCESSING = "processing"
14
+ COMPLETED = "completed"
15
+ ERROR = "error"
16
+
17
+
18
+ class FileType(str, Enum):
19
+ PDF = "pdf"
20
+ DOCX = "docx"
21
+ IMAGE = "image"
22
+
23
+
24
+ class UploadResponse(BaseModel):
25
+ file_id: str
26
+ filename: str
27
+ file_type: str
28
+ size_bytes: int
29
+ size_human: str
30
+ message: str
31
+
32
+
33
+ class DocumentMetadata(BaseModel):
34
+ title: Optional[str] = None
35
+ author: Optional[str] = None
36
+ creation_date: Optional[str] = None
37
+ modification_date: Optional[str] = None
38
+ page_count: Optional[int] = None
39
+ word_count: int = 0
40
+ character_count: int = 0
41
+ file_type: str = ""
42
+ extra: Dict[str, Any] = {}
43
+
44
+
45
+ class ExtractionResult(BaseModel):
46
+ raw_text: str
47
+ metadata: DocumentMetadata
48
+ success: bool = True
49
+ error_message: Optional[str] = None
50
+ extraction_time_ms: float = 0
51
+
52
+
53
+ class SummaryResult(BaseModel):
54
+ summary: str
55
+ original_length: int
56
+ summary_length: int
57
+ compression_ratio: float
58
+ sentence_count: int
59
+ algorithm: str
60
+
61
+
62
+ class Entity(BaseModel):
63
+ text: str
64
+ label: str
65
+ label_description: str
66
+ count: int = 1
67
+ positions: List[int] = []
68
+
69
+
70
+ class EntityResult(BaseModel):
71
+ entities: List[Entity]
72
+ entity_counts: Dict[str, int]
73
+ total_entities: int
74
+
75
+
76
+ class SentimentBreakdown(BaseModel):
77
+ text: str
78
+ compound: float
79
+ positive: float
80
+ negative: float
81
+ neutral: float
82
+ label: str
83
+
84
+
85
+ class SentimentResult(BaseModel):
86
+ overall_compound: float
87
+ overall_positive: float
88
+ overall_negative: float
89
+ overall_neutral: float
90
+ overall_label: str
91
+ sentence_breakdown: List[SentimentBreakdown]
92
+ confidence: float
93
+
94
+
95
+ class ProcessingResult(BaseModel):
96
+ file_id: str
97
+ filename: str
98
+ file_type: str
99
+ status: TaskStatus
100
+ extraction: Optional[ExtractionResult] = None
101
+ summary: Optional[SummaryResult] = None
102
+ entities: Optional[EntityResult] = None
103
+ sentiment: Optional[SentimentResult] = None
104
+ processing_time_ms: float = 0
105
+ error_message: Optional[str] = None
106
+ timestamp: float = 0
107
+
108
+ @staticmethod
109
+ def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
110
+ return ProcessingResult(
111
+ file_id=file_id,
112
+ filename=filename,
113
+ file_type=file_type,
114
+ status=TaskStatus.PENDING,
115
+ timestamp=time.time(),
116
+ )
requirements.txt ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi==0.115.0
2
+ uvicorn[standard]==0.30.0
3
+ python-multipart==0.0.9
4
+ PyMuPDF==1.24.0
5
+ python-docx==1.1.0
6
+ easyocr==1.7.1
7
+ torch
8
+ torchvision
9
+ pytesseract==0.3.13
10
+ Pillow==10.4.0
11
+ spacy==3.7.5
12
+ sumy==0.11.0
13
+ nltk==3.9.0
14
+ aiofiles==24.1.0
15
+ requests==2.32.3
16
+ beautifulsoup4==4.12.3
static/app.js ADDED
@@ -0,0 +1,586 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /**
2
+ * Alldocex - Intelligent Document Processing
3
+ * Frontend application logic
4
+ */
5
+
6
+ // ===== State =====
7
+ let currentTaskId = null;
8
+ let pollInterval = null;
9
+
10
+ // ===== DOM Elements =====
11
+ const $ = (sel) => document.querySelector(sel);
12
+ const $$ = (sel) => document.querySelectorAll(sel);
13
+
14
+ const dropZone = $('#dropZone');
15
+ const fileInput = $('#fileInput');
16
+ const uploadSection = $('#uploadSection');
17
+ const processingSection = $('#processingSection');
18
+ const resultsSection = $('#resultsSection');
19
+ const toastContainer = $('#toastContainer');
20
+ const btnExtractUrl = $('#btnExtractUrl');
21
+ const urlInput = $('#urlInput');
22
+
23
+ // ===== Init =====
24
+ document.addEventListener('DOMContentLoaded', () => {
25
+ initUpload();
26
+ initTabs();
27
+ initButtons();
28
+ });
29
+
30
+ // ===== Health Check =====
31
+
32
+ // ===== Upload =====
33
+ function initUpload() {
34
+ // Click to upload
35
+ dropZone.addEventListener('click', () => fileInput.click());
36
+
37
+ // File selected
38
+ fileInput.addEventListener('change', (e) => {
39
+ if (e.target.files.length > 0) {
40
+ handleFile(e.target.files[0]);
41
+ }
42
+ });
43
+
44
+ // URL input
45
+ btnExtractUrl.addEventListener('click', () => {
46
+ const url = urlInput.value.trim();
47
+ if (url) {
48
+ handleUrl(url);
49
+ } else {
50
+ showToast('Please enter a valid URL', 'error');
51
+ }
52
+ });
53
+
54
+ // Drag and drop
55
+ dropZone.addEventListener('dragover', (e) => {
56
+ e.preventDefault();
57
+ dropZone.classList.add('drag-over');
58
+ });
59
+
60
+ dropZone.addEventListener('dragleave', (e) => {
61
+ e.preventDefault();
62
+ dropZone.classList.remove('drag-over');
63
+ });
64
+
65
+ dropZone.addEventListener('drop', (e) => {
66
+ e.preventDefault();
67
+ dropZone.classList.remove('drag-over');
68
+ if (e.dataTransfer.files.length > 0) {
69
+ handleFile(e.dataTransfer.files[0]);
70
+ }
71
+ });
72
+
73
+ // Format badge filters
74
+ $$('.format-badge').forEach(badge => {
75
+ badge.addEventListener('click', (e) => {
76
+ e.stopPropagation(); // Don't trigger the main dropZone click
77
+ const format = badge.textContent.trim().toLowerCase();
78
+ openFilteredPicker(format);
79
+ });
80
+ });
81
+ }
82
+
83
+ function openFilteredPicker(format) {
84
+ const defaultAccept = fileInput.accept;
85
+
86
+ // Map of extensions
87
+ const extMap = {
88
+ pdf: '.pdf',
89
+ docx: '.docx',
90
+ png: '.png',
91
+ jpg: '.jpg,.jpeg',
92
+ jpeg: '.jpg,.jpeg',
93
+ tiff: '.tiff',
94
+ bmp: '.bmp',
95
+ webp: '.webp'
96
+ };
97
+
98
+ if (extMap[format]) {
99
+ fileInput.accept = extMap[format];
100
+ }
101
+
102
+ fileInput.click();
103
+
104
+ // Reset accept after a short delay so the main zone works normally
105
+ setTimeout(() => {
106
+ fileInput.accept = defaultAccept;
107
+ }, 500);
108
+ }
109
+
110
+ async function handleFile(file) {
111
+ // Validate extension
112
+ const validExts = ['pdf', 'docx', 'png', 'jpg', 'jpeg', 'tiff', 'bmp', 'webp'];
113
+ const ext = file.name.split('.').pop().toLowerCase();
114
+ if (!validExts.includes(ext)) {
115
+ showToast(`Unsupported file type: .${ext}`, 'error');
116
+ return;
117
+ }
118
+
119
+ // Validate size (20MB)
120
+ if (file.size > 20 * 1024 * 1024) {
121
+ showToast('File too large. Maximum size: 20MB', 'error');
122
+ return;
123
+ }
124
+
125
+ // Show processing UI
126
+ showSection('processing');
127
+ resetProcessingSteps();
128
+
129
+ // Upload
130
+ const formData = new FormData();
131
+ formData.append('file', file);
132
+
133
+ try {
134
+ const res = await fetch('/api/upload', {
135
+ method: 'POST',
136
+ body: formData,
137
+ });
138
+
139
+ if (!res.ok) {
140
+ const err = await res.json();
141
+ throw new Error(err.detail || 'Upload failed');
142
+ }
143
+
144
+ const data = await res.json();
145
+ currentTaskId = data.file_id;
146
+
147
+ // Start polling for results
148
+ updateStep('stepExtract', 'active');
149
+ startPolling(data.file_id);
150
+
151
+ } catch (e) {
152
+ showToast(e.message || 'Upload failed', 'error');
153
+ showSection('upload');
154
+ }
155
+ }
156
+
157
+ async function handleUrl(url) {
158
+ if (!url.startsWith('http')) {
159
+ showToast('URL must start with http:// or https://', 'error');
160
+ return;
161
+ }
162
+
163
+ try {
164
+ resetAll();
165
+ showSection('processing');
166
+ updateStep('stepExtract', 'active');
167
+
168
+ const response = await fetch('/api/extract/url', {
169
+ method: 'POST',
170
+ headers: { 'Content-Type': 'application/json' },
171
+ body: JSON.stringify({ url: url })
172
+ });
173
+
174
+ if (!response.ok) {
175
+ const error = await response.json();
176
+ throw new Error(error.detail || 'Failed to start URL extraction');
177
+ }
178
+
179
+ const data = await response.json();
180
+ currentTaskId = data.file_id;
181
+
182
+ // Polling results
183
+ startPolling(data.file_id);
184
+ } catch (error) {
185
+ showSection('upload');
186
+ showToast(error.message, 'error');
187
+ }
188
+ }
189
+
190
+ // ===== Polling =====
191
+ function startPolling(taskId) {
192
+ if (pollInterval) clearInterval(pollInterval);
193
+
194
+ pollInterval = setInterval(async () => {
195
+ try {
196
+ const res = await fetch(`/api/status/${taskId}`);
197
+ const data = await res.json();
198
+
199
+ if (data.status === 'processing') {
200
+ // Update steps based on available data
201
+ if (data.extraction) {
202
+ updateStep('stepExtract', 'done');
203
+ updateStep('stepSummary', 'active');
204
+ }
205
+ if (data.summary) {
206
+ updateStep('stepSummary', 'done');
207
+ updateStep('stepEntities', 'active');
208
+ }
209
+ if (data.entities) {
210
+ updateStep('stepEntities', 'done');
211
+ updateStep('stepSentiment', 'active');
212
+ }
213
+ if (data.sentiment) {
214
+ updateStep('stepSentiment', 'done');
215
+ }
216
+ }
217
+
218
+ if (data.status === 'completed' || data.status === 'error') {
219
+ clearInterval(pollInterval);
220
+ pollInterval = null;
221
+
222
+ // Mark all steps as done
223
+ updateStep('stepExtract', 'done');
224
+ updateStep('stepSummary', 'done');
225
+ updateStep('stepEntities', 'done');
226
+ updateStep('stepSentiment', 'done');
227
+
228
+ // Short delay to show completion
229
+ setTimeout(() => {
230
+ if (data.status === 'error' && !data.extraction) {
231
+ showToast(data.error_message || 'Processing failed', 'error');
232
+ showSection('upload');
233
+ } else {
234
+ displayResults(data);
235
+ showSection('results');
236
+ }
237
+ }, 600);
238
+ }
239
+ } catch (e) {
240
+ clearInterval(pollInterval);
241
+ pollInterval = null;
242
+ showToast('Lost connection to server', 'error');
243
+ showSection('upload');
244
+ }
245
+ }, 800);
246
+ }
247
+
248
+ // ===== Display Results =====
249
+ function displayResults(data) {
250
+ // File info bar
251
+ const typeIcons = { pdf: '📕', docx: '📘', image: '🖼️' };
252
+ $('#fileTypeIcon').textContent = typeIcons[data.file_type] || '📄';
253
+ $('#fileName').textContent = data.filename;
254
+
255
+ const meta = data.extraction?.metadata;
256
+ const parts = [data.file_type.toUpperCase()];
257
+ if (meta?.word_count) parts.push(`${meta.word_count.toLocaleString()} words`);
258
+ if (meta?.page_count) parts.push(`${meta.page_count} pages`);
259
+ $('#fileMeta').textContent = parts.join(' • ');
260
+
261
+ const timeSeconds = (data.processing_time_ms / 1000).toFixed(1);
262
+ $('#processingTime').textContent = `⏱ ${timeSeconds}s`;
263
+
264
+ // Extracted Text
265
+ const textEl = $('#extractedText');
266
+ if (data.extraction?.raw_text) {
267
+ textEl.textContent = data.extraction.raw_text;
268
+ } else {
269
+ textEl.innerHTML = `<p class="placeholder">${data.extraction?.error_message || 'No text extracted.'}</p>`;
270
+ }
271
+
272
+ // Summary
273
+ if (data.summary) {
274
+ $('#summaryContent').textContent = data.summary.summary;
275
+ $('#summaryStats').classList.remove('hidden');
276
+ $('#statOriginalLen').textContent = data.summary.original_length.toLocaleString();
277
+ $('#statSummaryLen').textContent = data.summary.summary_length.toLocaleString();
278
+ const pct = Math.round((1 - data.summary.compression_ratio) * 100);
279
+ $('#statCompression').textContent = `${pct}%`;
280
+ $('#statAlgorithm').textContent = data.summary.algorithm;
281
+ } else {
282
+ $('#summaryContent').innerHTML = '<p class="placeholder">Summarization not available.</p>';
283
+ $('#summaryStats').classList.add('hidden');
284
+ }
285
+
286
+ // Entities
287
+ displayEntities(data.entities);
288
+
289
+ // Sentiment
290
+ displaySentiment(data.sentiment);
291
+
292
+ // Metadata
293
+ displayMetadata(data.extraction?.metadata);
294
+
295
+ // Activate first tab
296
+ activateTab('extracted');
297
+ }
298
+
299
+ function displayEntities(entityData) {
300
+ const catEl = $('#entityCategories');
301
+ const listEl = $('#entityList');
302
+ const countEl = $('#entityCount');
303
+
304
+ if (!entityData || entityData.entities.length === 0) {
305
+ catEl.innerHTML = '<p class="placeholder">No entities detected in this document.</p>';
306
+ listEl.innerHTML = '';
307
+ countEl.textContent = '0 entities found';
308
+ return;
309
+ }
310
+
311
+ countEl.textContent = `${entityData.total_entities} entities found`;
312
+
313
+ // Category badges
314
+ const catColors = {
315
+ PERSON: '#ec4899', ORG: '#3b82f6', GPE: '#10b981', DATE: '#f59e0b',
316
+ MONEY: '#8b5cf6', EVENT: '#06b6d4', PRODUCT: '#fb923c', LAW: '#a855f7',
317
+ NORP: '#f472b6', EMAIL: '#06b6d4', PHONE: '#3b82f6', URL: '#10b981',
318
+ TIME: '#f59e0b', PERCENT: '#8b5cf6', CARDINAL: '#94a3b8',
319
+ };
320
+
321
+ catEl.innerHTML = Object.entries(entityData.entity_counts)
322
+ .sort((a, b) => b[1] - a[1])
323
+ .map(([label, count]) => `
324
+ <div class="entity-category-badge">
325
+ <span class="cat-dot" style="background: ${catColors[label] || '#94a3b8'}"></span>
326
+ ${label}
327
+ <span class="cat-count">${count}</span>
328
+ </div>
329
+ `).join('');
330
+
331
+ // Entity list
332
+ listEl.innerHTML = entityData.entities
333
+ .slice(0, 100)
334
+ .map(ent => `
335
+ <div class="entity-item">
336
+ <div class="entity-item-left">
337
+ <span class="entity-type-badge badge-${ent.label}">${ent.label}</span>
338
+ <span class="entity-text" title="${escapeHtml(ent.text)}">${escapeHtml(ent.text)}</span>
339
+ </div>
340
+ ${ent.count > 1 ? `<span class="entity-item-count">×${ent.count}</span>` : ''}
341
+ </div>
342
+ `).join('');
343
+ }
344
+
345
+ function displaySentiment(sentData) {
346
+ const overviewEl = $('#sentimentOverview');
347
+
348
+ if (!sentData) {
349
+ overviewEl.innerHTML = '<p class="placeholder">Sentiment analysis not available.</p>';
350
+ return;
351
+ }
352
+
353
+ const score = sentData.overall_compound;
354
+ const label = sentData.overall_label;
355
+ const posW = Math.round(sentData.overall_positive * 100);
356
+ const neuW = Math.round(sentData.overall_neutral * 100);
357
+ const negW = Math.round(sentData.overall_negative * 100);
358
+
359
+ // Label color
360
+ let labelColor;
361
+ if (score >= 0.05) labelColor = 'var(--accent-green)';
362
+ else if (score <= -0.05) labelColor = 'var(--accent-red)';
363
+ else labelColor = 'var(--text-muted)';
364
+
365
+ let html = `
366
+ <div class="sentiment-gauge-container">
367
+ <div class="sentiment-label-display" style="color: ${labelColor}">${label}</div>
368
+ <div class="sentiment-score">${score >= 0 ? '+' : ''}${score.toFixed(3)}</div>
369
+ <div class="sentiment-bar-container">
370
+ <div class="sentiment-bar">
371
+ <div class="sentiment-bar-positive" style="width: ${posW}%"></div>
372
+ <div class="sentiment-bar-neutral" style="width: ${neuW}%"></div>
373
+ <div class="sentiment-bar-negative" style="width: ${negW}%"></div>
374
+ </div>
375
+ <div class="sentiment-bar-labels">
376
+ <span><span class="dot dot-pos"></span> Positive ${posW}%</span>
377
+ <span><span class="dot dot-neu"></span> Neutral ${neuW}%</span>
378
+ <span><span class="dot dot-neg"></span> Negative ${negW}%</span>
379
+ </div>
380
+ </div>
381
+ </div>
382
+ `;
383
+
384
+ // Sentence breakdown
385
+ if (sentData.sentence_breakdown && sentData.sentence_breakdown.length > 0) {
386
+ html += `
387
+ <div class="sentiment-sentences">
388
+ <h4>Sentence-Level Breakdown (top ${Math.min(sentData.sentence_breakdown.length, 20)})</h4>
389
+ ${sentData.sentence_breakdown.slice(0, 20).map(s => {
390
+ let cls = 'sent-neutral';
391
+ if (s.compound >= 0.05) cls = 'sent-positive';
392
+ else if (s.compound <= -0.05) cls = 'sent-negative';
393
+ return `
394
+ <div class="sentence-item">
395
+ <span class="sentence-sentiment-badge ${cls}">${s.label}</span>
396
+ <span class="sentence-text">${escapeHtml(s.text)}</span>
397
+ </div>
398
+ `;
399
+ }).join('')}
400
+ </div>
401
+ `;
402
+ }
403
+
404
+ overviewEl.innerHTML = html;
405
+ }
406
+
407
+ function displayMetadata(meta) {
408
+ const metaEl = $('#metadataContent');
409
+
410
+ if (!meta) {
411
+ metaEl.innerHTML = '<p class="placeholder">No metadata available.</p>';
412
+ return;
413
+ }
414
+
415
+ const rows = [
416
+ ['Title', meta.title],
417
+ ['Author', meta.author],
418
+ ['File Type', meta.file_type],
419
+ ['Page Count', meta.page_count],
420
+ ['Word Count', meta.word_count?.toLocaleString()],
421
+ ['Character Count', meta.character_count?.toLocaleString()],
422
+ ['Created', meta.creation_date],
423
+ ['Modified', meta.modification_date],
424
+ ];
425
+
426
+ // Add extra metadata
427
+ if (meta.extra) {
428
+ for (const [key, value] of Object.entries(meta.extra)) {
429
+ if (value && value !== '' && value !== 0 && value !== false) {
430
+ const label = key.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
431
+ rows.push([label, String(value)]);
432
+ }
433
+ }
434
+ }
435
+
436
+ metaEl.innerHTML = `
437
+ <table class="metadata-table">
438
+ ${rows.filter(([, v]) => v && v !== 'None' && v !== 'null' && v !== '')
439
+ .map(([k, v]) => `<tr><td>${k}</td><td>${escapeHtml(String(v))}</td></tr>`)
440
+ .join('')}
441
+ </table>
442
+ `;
443
+ }
444
+
445
+ // ===== Tabs =====
446
+ function initTabs() {
447
+ $$('.tab').forEach(tab => {
448
+ tab.addEventListener('click', () => {
449
+ activateTab(tab.dataset.tab);
450
+ });
451
+ });
452
+ }
453
+
454
+ function activateTab(tabName) {
455
+ $$('.tab').forEach(t => t.classList.remove('active'));
456
+ $$('.tab-panel').forEach(p => p.classList.remove('active'));
457
+
458
+ const tab = $(`.tab[data-tab="${tabName}"]`);
459
+ const panel = $(`#panel${tabName.charAt(0).toUpperCase() + tabName.slice(1)}`);
460
+
461
+ if (tab) tab.classList.add('active');
462
+ if (panel) panel.classList.add('active');
463
+ }
464
+
465
+ // ===== Buttons =====
466
+ function initButtons() {
467
+ // New upload
468
+ $('#btnNewUpload').addEventListener('click', () => {
469
+ resetAll();
470
+ showSection('upload');
471
+ });
472
+
473
+ // Back to upload (without full reset if possible, or just same as New)
474
+ $('#btnBackToUpload').addEventListener('click', () => {
475
+ // We reset anyway for now to avoid data conflicts,
476
+ // but user specifically asked for "Back"
477
+ resetAll();
478
+ showSection('upload');
479
+ });
480
+
481
+ // Cancel processing
482
+ $('#btnCancelProcessing').addEventListener('click', () => {
483
+ if (pollInterval) {
484
+ clearInterval(pollInterval);
485
+ pollInterval = null;
486
+ }
487
+ showSection('upload');
488
+ showToast('Processing cancelled', 'info');
489
+ });
490
+
491
+ // Copy buttons
492
+ $('#btnCopyText').addEventListener('click', () => {
493
+ copyToClipboard($('#extractedText').textContent, '#btnCopyText');
494
+ });
495
+
496
+ $('#btnCopySummary').addEventListener('click', () => {
497
+ copyToClipboard($('#summaryContent').textContent, '#btnCopySummary');
498
+ });
499
+
500
+ // Download button
501
+ $('#btnDownloadText').addEventListener('click', () => {
502
+ if (currentTaskId) {
503
+ window.location.href = `/api/download/${currentTaskId}`;
504
+ } else {
505
+ showToast('No active document to download', 'error');
506
+ }
507
+ });
508
+ }
509
+
510
+ async function copyToClipboard(text, btnSelector) {
511
+ try {
512
+ await navigator.clipboard.writeText(text);
513
+ const btn = $(btnSelector);
514
+ btn.classList.add('copied');
515
+ const originalHTML = btn.innerHTML;
516
+ btn.innerHTML = `<svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M3 8l3 3 7-7" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/></svg> Copied!`;
517
+ setTimeout(() => {
518
+ btn.classList.remove('copied');
519
+ btn.innerHTML = originalHTML;
520
+ }, 2000);
521
+ } catch (e) {
522
+ showToast('Failed to copy to clipboard', 'error');
523
+ }
524
+ }
525
+
526
+ // ===== UI Helpers =====
527
+ function showSection(sectionId) {
528
+ [uploadSection, processingSection, resultsSection].forEach(s => s.classList.add('hidden'));
529
+
530
+ if (sectionId === 'upload') {
531
+ uploadSection.classList.remove('hidden');
532
+ } else if (sectionId === 'processing') {
533
+ processingSection.classList.remove('hidden');
534
+ } else if (sectionId === 'results') {
535
+ resultsSection.classList.remove('hidden');
536
+ }
537
+ }
538
+ function resetProcessingSteps() {
539
+ ['stepExtract', 'stepSummary', 'stepEntities', 'stepSentiment'].forEach(id => {
540
+ const el = $(`#${id}`);
541
+ el.classList.remove('active', 'done');
542
+ el.querySelector('.step-status').textContent = '⏳';
543
+ });
544
+ }
545
+
546
+ function updateStep(stepId, state) {
547
+ const el = $(`#${stepId}`);
548
+ el.classList.remove('active', 'done');
549
+ el.classList.add(state);
550
+ el.querySelector('.step-status').textContent = state === 'done' ? '✅' : '⚡';
551
+ }
552
+
553
+ function resetAll() {
554
+ currentTaskId = null;
555
+ if (pollInterval) {
556
+ clearInterval(pollInterval);
557
+ pollInterval = null;
558
+ }
559
+ fileInput.value = '';
560
+ $('#extractedText').innerHTML = '<p class="placeholder">No text extracted yet.</p>';
561
+ $('#summaryContent').innerHTML = '<p class="placeholder">No summary available.</p>';
562
+ $('#summaryStats').classList.add('hidden');
563
+ $('#entityCategories').innerHTML = '<p class="placeholder">No entities detected.</p>';
564
+ $('#entityList').innerHTML = '';
565
+ $('#sentimentOverview').innerHTML = '<p class="placeholder">No sentiment data available.</p>';
566
+ $('#metadataContent').innerHTML = '<p class="placeholder">No metadata available.</p>';
567
+ }
568
+
569
+ function showToast(message, type = 'info') {
570
+ const icons = { info: 'ℹ️', error: '❌', success: '✅' };
571
+ const toast = document.createElement('div');
572
+ toast.className = `toast toast-${type}`;
573
+ toast.innerHTML = `<span class="toast-icon">${icons[type]}</span><span>${escapeHtml(message)}</span>`;
574
+ toastContainer.appendChild(toast);
575
+
576
+ setTimeout(() => {
577
+ if (toast.parentNode) toast.remove();
578
+ }, 4000);
579
+ }
580
+
581
+ function escapeHtml(text) {
582
+ if (!text) return '';
583
+ const div = document.createElement('div');
584
+ div.textContent = text;
585
+ return div.innerHTML;
586
+ }
static/index.html ADDED
@@ -0,0 +1,268 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Alldocex — Intelligent Document Processing</title>
7
+ <meta name="description" content="Extract, analyse, and summarize content from PDF, DOCX, and image files using AI-powered document processing.">
8
+ <link rel="preconnect" href="https://fonts.googleapis.com">
9
+ <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
10
+ <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
11
+ <link rel="stylesheet" href="/static/styles.css">
12
+ </head>
13
+ <body>
14
+ <!-- subtle background -->
15
+ <div class="bg-orbs"></div>
16
+
17
+ <div class="app-container">
18
+ <!-- Header -->
19
+ <header class="header">
20
+ <div class="logo">
21
+ <div class="logo-icon">
22
+ <svg width="32" height="32" viewBox="0 0 32 32" fill="none">
23
+ <rect x="4" y="2" width="18" height="24" rx="3" stroke="currentColor" stroke-width="2.5"/>
24
+ <path d="M10 8h8M10 12h10M10 16h6" stroke="currentColor" stroke-width="2" stroke-linecap="round"/>
25
+ <circle cx="22" cy="22" r="8" fill="var(--accent-blue)" opacity="0.9"/>
26
+ <path d="M20 22l1.5 1.5L24 20" stroke="white" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
27
+ </svg>
28
+ </div>
29
+ <div>
30
+ <h1>Alldocex</h1>
31
+ <p class="logo-subtitle">Intelligent Document Processing</p>
32
+ </div>
33
+ </div>
34
+ </header>
35
+
36
+ <!-- Main Content -->
37
+ <main class="main-content">
38
+ <!-- Upload Section -->
39
+ <section class="upload-section" id="uploadSection">
40
+ <div class="upload-zone" id="dropZone">
41
+ <div class="upload-icon">
42
+ <svg width="64" height="64" viewBox="0 0 64 64" fill="none">
43
+ <path d="M32 44V20M32 20L22 30M32 20L42 30" stroke="var(--accent-blue)" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
44
+ <path d="M12 40v6a6 6 0 006 6h28a6 6 0 006-6v-6" stroke="var(--accent-blue)" stroke-width="3" stroke-linecap="round"/>
45
+ </svg>
46
+ </div>
47
+ <h2 class="upload-title">Drop your document here</h2>
48
+ <p class="upload-subtitle">or click to browse files</p>
49
+ <div class="upload-formats">
50
+ <span class="format-badge">PDF</span>
51
+ <span class="format-badge">DOCX</span>
52
+ <span class="format-badge">PNG</span>
53
+ <span class="format-badge">JPG</span>
54
+ <span class="format-badge">TIFF</span>
55
+ <span class="format-badge">BMP</span>
56
+ </div>
57
+ <p class="upload-limit">Maximum file size: 50MB</p>
58
+ <input type="file" id="fileInput" accept=".pdf,.docx,.png,.jpg,.jpeg,.tiff,.bmp,.webp" hidden>
59
+ </div>
60
+
61
+ <div class="url-section">
62
+ <div class="divider">
63
+ <span>OR</span>
64
+ </div>
65
+ <div class="url-input-container">
66
+ <div class="url-icon-subtle">
67
+ <svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07l-1.72 1.71"></path><path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 0 0 7.07 7.07l1.71-1.71"></path></svg>
68
+ </div>
69
+ <input type="text" id="urlInput" placeholder="Paste a web URL here (e.g. https://wikipedia.org/...)">
70
+ <button class="btn-url" id="btnExtractUrl">
71
+ Summarize URL
72
+ </button>
73
+ </div>
74
+ </div>
75
+ </section>
76
+
77
+ <!-- Processing Indicator -->
78
+ <section class="processing-section hidden" id="processingSection">
79
+ <div class="processing-card">
80
+ <div class="processing-spinner">
81
+ <div class="spinner-ring"></div>
82
+ <div class="spinner-ring ring-inner"></div>
83
+ </div>
84
+ <h3 class="processing-title" id="processingTitle">Processing document...</h3>
85
+ <p class="processing-subtitle" id="processingSubtitle">Extracting text and running AI analysis</p>
86
+ <div class="processing-steps">
87
+ <div class="step" id="stepExtract">
88
+ <span class="step-icon">📄</span>
89
+ <span>Text Extraction</span>
90
+ <span class="step-status">⏳</span>
91
+ </div>
92
+ <div class="step" id="stepSummary">
93
+ <span class="step-icon">📝</span>
94
+ <span>Summarization</span>
95
+ <span class="step-status">⏳</span>
96
+ </div>
97
+ <div class="step" id="stepEntities">
98
+ <span class="step-icon">🏷️</span>
99
+ <span>Entity Recognition</span>
100
+ <span class="step-status">⏳</span>
101
+ </div>
102
+ <div class="step" id="stepSentiment">
103
+ <span class="step-icon">💭</span>
104
+ <span>Sentiment Analysis</span>
105
+ <span class="step-status">⏳</span>
106
+ </div>
107
+ </div>
108
+ <!-- Cancel Button -->
109
+ <button class="btn-cancel" id="btnCancelProcessing" title="Cancel and return to upload">
110
+ <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M13.5 4.5l-9 9M4.5 4.5l9 9" stroke="currentColor" stroke-width="1.8" stroke-linecap="round" stroke-linejoin="round"/></svg>
111
+ Cancel
112
+ </button>
113
+ </div>
114
+ </section>
115
+
116
+ <!-- Results Section -->
117
+ <section class="results-section hidden" id="resultsSection">
118
+ <!-- File Info Bar -->
119
+ <div class="file-info-bar" id="fileInfoBar">
120
+ <div class="file-info-left">
121
+ <span class="file-type-icon" id="fileTypeIcon">📄</span>
122
+ <div>
123
+ <h3 class="file-name" id="fileName">document.pdf</h3>
124
+ <p class="file-meta" id="fileMeta">PDF • 2.3 MB • 15 pages</p>
125
+ </div>
126
+ </div>
127
+ <div class="file-info-right">
128
+ <span class="processing-time" id="processingTime">⏱ 1.2s</span>
129
+ <button class="btn-back" id="btnBackToUpload" title="Back to upload">
130
+ <svg width="18" height="18" viewBox="0 0 20 20" fill="none">
131
+ <path d="M15 10H5M10 15l-5-5 5-5" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
132
+ </svg>
133
+ Back
134
+ </button>
135
+ <button class="btn-new" id="btnNewUpload" title="Upload new document">
136
+ <svg width="18" height="18" viewBox="0 0 20 20" fill="none">
137
+ <path d="M10 4v12M4 10h12" stroke="currentColor" stroke-width="2" stroke-linecap="round"/>
138
+ </svg>
139
+ New Upload
140
+ </button>
141
+ </div>
142
+ </div>
143
+
144
+ <!-- Tabs -->
145
+ <div class="tabs">
146
+ <button class="tab active" data-tab="extracted" id="tabExtracted">
147
+ <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M3 3h12v12H3z" stroke="currentColor" stroke-width="1.5" rx="2"/><path d="M6 7h6M6 10h4" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
148
+ Extracted Text
149
+ </button>
150
+ <button class="tab" data-tab="summary" id="tabSummary">
151
+ <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M3 5h12M3 9h8M3 13h10" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
152
+ Summary
153
+ </button>
154
+ <button class="tab" data-tab="entities" id="tabEntities">
155
+ <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="7" cy="7" r="4" stroke="currentColor" stroke-width="1.5"/><path d="M14 14l-3-3" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
156
+ Entities
157
+ </button>
158
+ <button class="tab" data-tab="sentiment" id="tabSentiment">
159
+ <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="9" cy="9" r="7" stroke="currentColor" stroke-width="1.5"/><path d="M6 11c.5 1 1.5 2 3 2s2.5-1 3-2" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/><circle cx="6.5" cy="7.5" r="1" fill="currentColor"/><circle cx="11.5" cy="7.5" r="1" fill="currentColor"/></svg>
160
+ Sentiment
161
+ </button>
162
+ <button class="tab" data-tab="metadata" id="tabMetadata">
163
+ <svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="9" cy="9" r="7" stroke="currentColor" stroke-width="1.5"/><path d="M9 6v3l2 2" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
164
+ Metadata
165
+ </button>
166
+ </div>
167
+
168
+ <!-- Tab Content -->
169
+ <div class="tab-content">
170
+ <!-- Extracted Text -->
171
+ <div class="tab-panel active" id="panelExtracted">
172
+ <div class="panel-header">
173
+ <h3>Extracted Text</h3>
174
+ <div class="panel-actions">
175
+ <button class="btn-copy" id="btnCopyText" title="Copy to clipboard">
176
+ <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><rect x="5" y="5" width="9" height="9" rx="1.5" stroke="currentColor" stroke-width="1.5"/><path d="M3 11V3a1 1 0 011-1h8" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
177
+ Copy
178
+ </button>
179
+ <button class="btn-download" id="btnDownloadText" title="Download as .txt">
180
+ <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M8 2v9M4 7l4 4 4-4M2 14h12" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
181
+ Download
182
+ </button>
183
+ </div>
184
+ </div>
185
+ <div class="text-content" id="extractedText">
186
+ <p class="placeholder">No text extracted yet.</p>
187
+ </div>
188
+ </div>
189
+
190
+ <!-- Summary -->
191
+ <div class="tab-panel" id="panelSummary">
192
+ <div class="panel-header">
193
+ <h3>AI Summary</h3>
194
+ <button class="btn-copy" id="btnCopySummary" title="Copy to clipboard">
195
+ <svg width="16" height="16" viewBox="0 0 16 16" fill="none"><rect x="5" y="5" width="9" height="9" rx="1.5" stroke="currentColor" stroke-width="1.5"/><path d="M3 11V3a1 1 0 011-1h8" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
196
+ Copy
197
+ </button>
198
+ </div>
199
+ <div class="summary-content" id="summaryContent">
200
+ <p class="placeholder">No summary available.</p>
201
+ </div>
202
+ <div class="summary-stats hidden" id="summaryStats">
203
+ <div class="stat-card">
204
+ <span class="stat-value" id="statOriginalLen">0</span>
205
+ <span class="stat-label">Original chars</span>
206
+ </div>
207
+ <div class="stat-card">
208
+ <span class="stat-value" id="statSummaryLen">0</span>
209
+ <span class="stat-label">Summary chars</span>
210
+ </div>
211
+ <div class="stat-card">
212
+ <span class="stat-value" id="statCompression">0%</span>
213
+ <span class="stat-label">Compression</span>
214
+ </div>
215
+ <div class="stat-card">
216
+ <span class="stat-value" id="statAlgorithm">—</span>
217
+ <span class="stat-label">Algorithm</span>
218
+ </div>
219
+ </div>
220
+ </div>
221
+
222
+ <!-- Entities -->
223
+ <div class="tab-panel" id="panelEntities">
224
+ <div class="panel-header">
225
+ <h3>Named Entities</h3>
226
+ <span class="entity-count" id="entityCount">0 entities found</span>
227
+ </div>
228
+ <div class="entity-categories" id="entityCategories">
229
+ <p class="placeholder">No entities detected.</p>
230
+ </div>
231
+ <div class="entity-list" id="entityList"></div>
232
+ </div>
233
+
234
+ <!-- Sentiment -->
235
+ <div class="tab-panel" id="panelSentiment">
236
+ <div class="panel-header">
237
+ <h3>Sentiment Analysis</h3>
238
+ </div>
239
+ <div class="sentiment-overview" id="sentimentOverview">
240
+ <p class="placeholder">No sentiment data available.</p>
241
+ </div>
242
+ </div>
243
+
244
+ <!-- Metadata -->
245
+ <div class="tab-panel" id="panelMetadata">
246
+ <div class="panel-header">
247
+ <h3>Document Metadata</h3>
248
+ </div>
249
+ <div class="metadata-content" id="metadataContent">
250
+ <p class="placeholder">No metadata available.</p>
251
+ </div>
252
+ </div>
253
+ </div>
254
+ </section>
255
+ </main>
256
+
257
+ <!-- Footer -->
258
+ <footer class="footer">
259
+ <p>Alldocex v1.0 — Powered by FastAPI, spaCy, VADER & Tesseract OCR</p>
260
+ </footer>
261
+ </div>
262
+
263
+ <!-- Toast Container -->
264
+ <div class="toast-container" id="toastContainer"></div>
265
+
266
+ <script src="/static/app.js"></script>
267
+ </body>
268
+ </html>
static/styles.css ADDED
@@ -0,0 +1,1156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ /* --- CSS Variables / Design Tokens (Corporate Blue & White) --- */
2
+ :root {
3
+ /* Primary Colors */
4
+ --bg-primary: #f8fafc;
5
+ --bg-secondary: #ffffff;
6
+ --bg-card: #ffffff;
7
+ --bg-subtle: #f1f5f9;
8
+
9
+ /* Borders & Accents */
10
+ --border-light: #e2e8f0;
11
+ --border-accent: rgba(37, 99, 235, 0.2);
12
+
13
+ /* Text */
14
+ --text-primary: #1e293b;
15
+ --text-secondary: #475569;
16
+ --text-muted: #94a3b8;
17
+
18
+ /* Corporate Blue Palette */
19
+ --accent-blue-deep: #1e40af;
20
+ --accent-blue: #2563eb;
21
+ --accent-blue-light: #60a5fa;
22
+ --accent-blue-subtle: #dbeafe;
23
+
24
+ /* Functional Colors */
25
+ --accent-green: #10b981;
26
+ --accent-yellow: #f59e0b;
27
+ --accent-red: #ef4444;
28
+
29
+ /* Gradients */
30
+ --gradient-primary: linear-gradient(135deg, #1e40af, #2563eb);
31
+ --gradient-professional: linear-gradient(135deg, #2563eb, #60a5fa);
32
+
33
+ /* Shadows (Clean & Soft) */
34
+ --shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.1);
35
+ --shadow-md: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
36
+ --shadow-lg: 0 10px 15px -3px rgba(0, 0, 0, 0.1), 0 4px 6px -2px rgba(0, 0, 0, 0.05);
37
+ --shadow-glow: 0 0 20px rgba(37, 99, 235, 0.1);
38
+
39
+ /* Radius */
40
+ --radius-sm: 6px;
41
+ --radius-md: 8px;
42
+ --radius-lg: 12px;
43
+ --radius-xl: 16px;
44
+
45
+ /* Font */
46
+ --font-main: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
47
+
48
+ /* Transitions */
49
+ --transition-fast: 0.15s ease;
50
+ --transition-normal: 0.25s ease;
51
+ --transition-smooth: 0.4s cubic-bezier(0.4, 0, 0.2, 1);
52
+ }
53
+
54
+ /* --- Reset & Base --- */
55
+ *, *::before, *::after {
56
+ box-sizing: border-box;
57
+ margin: 0;
58
+ padding: 0;
59
+ }
60
+
61
+ html {
62
+ font-size: 16px;
63
+ scroll-behavior: smooth;
64
+ }
65
+
66
+ body {
67
+ font-family: var(--font-main);
68
+ background: var(--bg-primary);
69
+ color: var(--text-primary);
70
+ min-height: 100vh;
71
+ line-height: 1.6;
72
+ -webkit-font-smoothing: antialiased;
73
+ }
74
+
75
+ /* --- Background Decorations --- */
76
+ .bg-orbs {
77
+ position: fixed;
78
+ inset: 0;
79
+ z-index: -1;
80
+ background: radial-gradient(circle at top right, var(--accent-blue-subtle), transparent 400px),
81
+ radial-gradient(circle at bottom left, #f1f5f9, transparent 400px);
82
+ opacity: 0.5;
83
+ }
84
+
85
+ /* --- App Container --- */
86
+ .app-container {
87
+ max-width: 1100px;
88
+ margin: 0 auto;
89
+ padding: 32px 20px;
90
+ min-height: 100vh;
91
+ display: flex;
92
+ flex-direction: column;
93
+ }
94
+
95
+ /* --- Header --- */
96
+ .header {
97
+ display: flex;
98
+ align-items: center;
99
+ justify-content: space-between;
100
+ padding: 20px 32px;
101
+ background: var(--bg-secondary);
102
+ border: 1px solid var(--border-light);
103
+ border-radius: var(--radius-lg);
104
+ box-shadow: var(--shadow-sm);
105
+ margin-bottom: 32px;
106
+ }
107
+
108
+ .logo {
109
+ display: flex;
110
+ align-items: center;
111
+ gap: 16px;
112
+ }
113
+
114
+ .logo-icon {
115
+ display: flex;
116
+ align-items: center;
117
+ justify-content: center;
118
+ width: 44px;
119
+ height: 44px;
120
+ background: var(--accent-blue-subtle);
121
+ border-radius: var(--radius-md);
122
+ color: var(--accent-blue);
123
+ }
124
+
125
+ .logo h1 {
126
+ font-size: 1.5rem;
127
+ font-weight: 800;
128
+ color: var(--accent-blue-deep);
129
+ letter-spacing: -0.5px;
130
+ }
131
+
132
+ .logo-subtitle {
133
+ font-size: 0.75rem;
134
+ color: var(--text-secondary);
135
+ font-weight: 500;
136
+ text-transform: uppercase;
137
+ letter-spacing: 0.5px;
138
+ }
139
+
140
+ /* --- Main Content --- */
141
+ .main-content {
142
+ flex: 1;
143
+ }
144
+
145
+ /* --- Upload Section --- */
146
+ .upload-zone {
147
+ display: flex;
148
+ flex-direction: column;
149
+ align-items: center;
150
+ justify-content: center;
151
+ padding: 64px 40px;
152
+ background: var(--bg-secondary);
153
+ border: 2px dashed var(--border-light);
154
+ border-radius: var(--radius-xl);
155
+ cursor: pointer;
156
+ transition: var(--transition-smooth);
157
+ box-shadow: var(--shadow-sm);
158
+ }
159
+
160
+ .upload-zone:hover {
161
+ border-color: var(--accent-blue-light);
162
+ background: var(--accent-blue-subtle);
163
+ box-shadow: var(--shadow-glow);
164
+ transform: translateY(-2px);
165
+ }
166
+
167
+ .upload-zone.drag-over {
168
+ border-color: var(--accent-blue);
169
+ background: var(--accent-blue-subtle);
170
+ box-shadow: 0 0 20px rgba(37,99,235, 0.15);
171
+ }
172
+
173
+ .upload-icon {
174
+ margin-bottom: 24px;
175
+ color: var(--accent-blue);
176
+ }
177
+
178
+ .upload-title {
179
+ font-size: 1.5rem;
180
+ font-weight: 700;
181
+ margin-bottom: 8px;
182
+ color: var(--text-primary);
183
+ }
184
+
185
+ .upload-subtitle {
186
+ color: var(--text-secondary);
187
+ font-size: 0.95rem;
188
+ margin-bottom: 24px;
189
+ }
190
+
191
+ .upload-formats {
192
+ display: flex;
193
+ gap: 8px;
194
+ flex-wrap: wrap;
195
+ justify-content: center;
196
+ margin-bottom: 16px;
197
+ }
198
+
199
+ .format-badge {
200
+ padding: 6px 14px;
201
+ font-size: 0.75rem;
202
+ font-weight: 600;
203
+ color: var(--accent-blue);
204
+ background: var(--accent-blue-subtle);
205
+ border-radius: 100px;
206
+ text-transform: uppercase;
207
+ cursor: pointer;
208
+ transition: var(--transition-fast);
209
+ }
210
+
211
+ .format-badge:hover {
212
+ background: var(--accent-blue);
213
+ color: white;
214
+ transform: translateY(-1px);
215
+ box-shadow: var(--shadow-sm);
216
+ }
217
+
218
+ .upload-limit {
219
+ font-size: 0.75rem;
220
+ color: var(--text-muted);
221
+ }
222
+
223
+ /* --- URL Section --- */
224
+ .url-section {
225
+ margin-top: 32px;
226
+ width: 100%;
227
+ }
228
+
229
+ .divider {
230
+ display: flex;
231
+ align-items: center;
232
+ text-align: center;
233
+ margin-bottom: 24px;
234
+ color: var(--text-muted);
235
+ font-size: 0.75rem;
236
+ font-weight: 600;
237
+ letter-spacing: 1px;
238
+ }
239
+
240
+ .divider::before, .divider::after {
241
+ content: '';
242
+ flex: 1;
243
+ border-bottom: 1px solid var(--border-light);
244
+ }
245
+
246
+ .divider:not(:empty)::before {
247
+ margin-right: 16px;
248
+ }
249
+
250
+ .divider:not(:empty)::after {
251
+ margin-left: 16px;
252
+ }
253
+
254
+ .url-input-container {
255
+ display: flex;
256
+ gap: 12px;
257
+ background: var(--bg-secondary);
258
+ border: 1px solid var(--border-light);
259
+ padding: 8px;
260
+ border-radius: var(--radius-lg);
261
+ box-shadow: var(--shadow-sm);
262
+ transition: var(--transition-normal);
263
+ }
264
+
265
+ .url-input-container:focus-within {
266
+ border-color: var(--accent-blue);
267
+ box-shadow: var(--shadow-glow);
268
+ }
269
+
270
+ .url-icon-subtle {
271
+ display: flex;
272
+ align-items: center;
273
+ padding-left: 12px;
274
+ color: var(--text-muted);
275
+ }
276
+
277
+ .url-input-container input {
278
+ flex: 1;
279
+ border: none;
280
+ background: transparent;
281
+ font-family: var(--font-main);
282
+ font-size: 0.95rem;
283
+ color: var(--text-primary);
284
+ outline: none;
285
+ }
286
+
287
+ .btn-url {
288
+ background: var(--gradient-primary);
289
+ color: white;
290
+ border: none;
291
+ padding: 10px 24px;
292
+ border-radius: var(--radius-md);
293
+ font-family: var(--font-main);
294
+ font-size: 0.85rem;
295
+ font-weight: 600;
296
+ cursor: pointer;
297
+ transition: var(--transition-normal);
298
+ white-space: nowrap;
299
+ }
300
+
301
+ .btn-url:hover {
302
+ transform: translateY(-1px);
303
+ box-shadow: 0 4px 12px rgba(37, 99, 235, 0.3);
304
+ }
305
+
306
+ /* --- Transitions & Animations --- */
307
+ @keyframes fadeInUp {
308
+ from { opacity: 0; transform: translateY(15px); }
309
+ to { opacity: 1; transform: translateY(0); }
310
+ }
311
+
312
+ @keyframes fadeIn {
313
+ from { opacity: 0; }
314
+ to { opacity: 1; }
315
+ }
316
+
317
+ .upload-section, .processing-section, .results-section {
318
+ animation: fadeInUp 0.5s var(--transition-smooth);
319
+ }
320
+
321
+ .hidden {
322
+ display: none !important;
323
+ }
324
+
325
+ /* --- Processing Section --- */
326
+ .processing-card {
327
+ display: flex;
328
+ flex-direction: column;
329
+ align-items: center;
330
+ padding: 60px 40px;
331
+ background: var(--bg-secondary);
332
+ border: 1px solid var(--border-light);
333
+ border-radius: var(--radius-xl);
334
+ box-shadow: var(--shadow-md);
335
+ }
336
+
337
+ .processing-spinner {
338
+ position: relative;
339
+ width: 80px;
340
+ height: 80px;
341
+ margin-bottom: 24px;
342
+ }
343
+
344
+ .spinner-ring {
345
+ position: absolute;
346
+ inset: 0;
347
+ border-radius: 50%;
348
+ border: 3px solid #e2e8f0;
349
+ border-top-color: var(--accent-blue);
350
+ animation: spin 1s linear infinite;
351
+ }
352
+
353
+ .ring-inner {
354
+ inset: 10px;
355
+ border-top-color: var(--accent-blue-light);
356
+ animation-direction: reverse;
357
+ animation-duration: 0.8s;
358
+ }
359
+
360
+ @keyframes spin {
361
+ to { transform: rotate(360deg); }
362
+ }
363
+
364
+ .processing-title {
365
+ font-size: 1.2rem;
366
+ font-weight: 600;
367
+ margin-bottom: 6px;
368
+ }
369
+
370
+ .processing-subtitle {
371
+ color: var(--text-secondary);
372
+ font-size: 0.85rem;
373
+ margin-bottom: 30px;
374
+ }
375
+
376
+ .processing-steps {
377
+ display: flex;
378
+ flex-direction: column;
379
+ gap: 12px;
380
+ width: 100%;
381
+ max-width: 360px;
382
+ }
383
+
384
+ .step {
385
+ display: flex;
386
+ align-items: center;
387
+ gap: 12px;
388
+ padding: 10px 16px;
389
+ background: var(--bg-primary);
390
+ border-radius: var(--radius-sm);
391
+ border: 1px solid var(--border-light);
392
+ font-size: 0.85rem;
393
+ color: var(--text-secondary);
394
+ transition: var(--transition-normal);
395
+ }
396
+
397
+ .step.active {
398
+ border-color: var(--accent-blue);
399
+ color: var(--accent-blue-deep);
400
+ background: var(--accent-blue-subtle);
401
+ }
402
+
403
+ .step.done {
404
+ border-color: rgba(16, 185, 129, 0.3);
405
+ color: var(--accent-green);
406
+ }
407
+
408
+ .step-icon {
409
+ font-size: 1.1rem;
410
+ }
411
+
412
+ .step-status {
413
+ margin-left: auto;
414
+ font-size: 0.9rem;
415
+ }
416
+
417
+ /* --- Results Section --- */
418
+ .results-section {
419
+ animation: fadeInUp 0.5s ease;
420
+ }
421
+
422
+ /* File Info Bar */
423
+ .file-info-bar {
424
+ display: flex;
425
+ align-items: center;
426
+ justify-content: space-between;
427
+ padding: 16px 24px;
428
+ background: var(--bg-secondary);
429
+ border: 1px solid var(--border-light);
430
+ border-radius: var(--radius-lg);
431
+ box-shadow: var(--shadow-sm);
432
+ margin-bottom: 20px;
433
+ }
434
+
435
+ .file-info-left {
436
+ display: flex;
437
+ align-items: center;
438
+ gap: 14px;
439
+ }
440
+
441
+ .file-type-icon {
442
+ font-size: 2rem;
443
+ }
444
+
445
+ .file-name {
446
+ font-size: 1rem;
447
+ font-weight: 600;
448
+ }
449
+
450
+ .file-meta {
451
+ font-size: 0.8rem;
452
+ color: var(--text-secondary);
453
+ }
454
+
455
+ .file-info-right {
456
+ display: flex;
457
+ align-items: center;
458
+ gap: 12px;
459
+ }
460
+
461
+ .processing-time {
462
+ font-size: 0.8rem;
463
+ color: var(--accent-blue);
464
+ padding: 4px 12px;
465
+ background: var(--accent-blue-subtle);
466
+ border-radius: 100px;
467
+ }
468
+
469
+ .btn-new, .btn-back {
470
+ display: flex;
471
+ align-items: center;
472
+ gap: 8px;
473
+ padding: 8px 18px;
474
+ border: none;
475
+ border-radius: var(--radius-sm);
476
+ font-family: var(--font-main);
477
+ font-size: 0.85rem;
478
+ font-weight: 600;
479
+ cursor: pointer;
480
+ transition: var(--transition-normal);
481
+ }
482
+
483
+ .btn-new {
484
+ background: var(--gradient-primary);
485
+ color: white;
486
+ }
487
+
488
+ .btn-back {
489
+ background: var(--bg-glass-strong);
490
+ color: var(--text-secondary);
491
+ border: 1px solid var(--border-glass);
492
+ }
493
+
494
+ .btn-new:hover, .btn-back:hover {
495
+ transform: translateY(-1px);
496
+ box-shadow: var(--shadow-md);
497
+ }
498
+
499
+ .btn-new:hover {
500
+ box-shadow: 0 4px 20px rgba(139, 92, 246, 0.4);
501
+ }
502
+
503
+ .btn-back:hover {
504
+ color: var(--text-primary);
505
+ border-color: var(--border-accent);
506
+ }
507
+
508
+ .btn-cancel {
509
+ margin-top: 24px;
510
+ display: flex;
511
+ align-items: center;
512
+ gap: 8px;
513
+ padding: 10px 24px;
514
+ background: rgba(239, 68, 68, 0.1);
515
+ border: 1px solid rgba(239, 68, 68, 0.2);
516
+ border-radius: var(--radius-sm);
517
+ color: var(--accent-red);
518
+ font-family: var(--font-main);
519
+ font-size: 0.85rem;
520
+ font-weight: 600;
521
+ cursor: pointer;
522
+ transition: var(--transition-normal);
523
+ }
524
+
525
+ .btn-cancel:hover {
526
+ background: rgba(239, 68, 68, 0.2);
527
+ border-color: rgba(239, 68, 68, 0.4);
528
+ transform: translateY(-1px);
529
+ }
530
+
531
+ /* --- Tabs --- */
532
+ .tabs {
533
+ display: flex;
534
+ gap: 4px;
535
+ padding: 4px;
536
+ background: var(--bg-subtle);
537
+ border: 1px solid var(--border-light);
538
+ border-radius: var(--radius-lg);
539
+ margin-bottom: 20px;
540
+ overflow-x: auto;
541
+ }
542
+
543
+ .tab {
544
+ display: flex;
545
+ align-items: center;
546
+ gap: 8px;
547
+ padding: 10px 18px;
548
+ background: transparent;
549
+ border: 1px solid transparent;
550
+ border-radius: var(--radius-md);
551
+ color: var(--text-secondary);
552
+ font-family: var(--font-main);
553
+ font-size: 0.82rem;
554
+ font-weight: 500;
555
+ cursor: pointer;
556
+ transition: var(--transition-normal);
557
+ white-space: nowrap;
558
+ flex: 1;
559
+ justify-content: center;
560
+ }
561
+
562
+ .tab:hover {
563
+ color: var(--accent-blue);
564
+ background: #fff;
565
+ }
566
+
567
+ .tab.active {
568
+ color: var(--accent-blue);
569
+ background: #fff;
570
+ border-color: var(--accent-blue);
571
+ box-shadow: var(--shadow-sm);
572
+ }
573
+
574
+ .tab svg {
575
+ opacity: 0.7;
576
+ flex-shrink: 0;
577
+ }
578
+
579
+ .tab.active svg {
580
+ opacity: 1;
581
+ }
582
+
583
+ /* --- Tab Panels --- */
584
+ .tab-content {
585
+ position: relative;
586
+ }
587
+
588
+ .tab-panel {
589
+ display: none;
590
+ animation: fadeIn 0.3s ease;
591
+ }
592
+
593
+ .tab-panel.active {
594
+ display: block;
595
+ }
596
+
597
+ .panel-header {
598
+ display: flex;
599
+ align-items: center;
600
+ justify-content: space-between;
601
+ margin-bottom: 16px;
602
+ }
603
+
604
+ .panel-header h3 {
605
+ font-size: 1.1rem;
606
+ font-weight: 600;
607
+ }
608
+
609
+ .panel-actions {
610
+ display: flex;
611
+ gap: 8px;
612
+ }
613
+
614
+ .btn-copy, .btn-download {
615
+ display: flex;
616
+ align-items: center;
617
+ gap: 6px;
618
+ padding: 6px 14px;
619
+ background: var(--bg-glass-strong);
620
+ border: 1px solid var(--border-glass);
621
+ border-radius: var(--radius-sm);
622
+ color: var(--text-secondary);
623
+ font-family: var(--font-main);
624
+ font-size: 0.8rem;
625
+ cursor: pointer;
626
+ transition: var(--transition-normal);
627
+ }
628
+
629
+ .btn-copy:hover, .btn-download:hover {
630
+ color: var(--accent-cyan);
631
+ border-color: rgba(6, 182, 212, 0.3);
632
+ background: rgba(6, 182, 212, 0.05);
633
+ }
634
+
635
+ .btn-copy.copied {
636
+ color: var(--accent-green);
637
+ border-color: rgba(16, 185, 129, 0.3);
638
+ }
639
+
640
+ /* --- Text Content --- */
641
+ .text-content, .summary-content {
642
+ padding: 24px;
643
+ background: #ffffff;
644
+ border: 1px solid var(--border-light);
645
+ border-radius: var(--radius-lg);
646
+ color: var(--text-primary);
647
+ box-shadow: var(--shadow-sm);
648
+ max-height: 500px;
649
+ overflow-y: auto;
650
+ font-size: 0.9rem;
651
+ line-height: 1.8;
652
+ white-space: pre-wrap;
653
+ word-wrap: break-word;
654
+ }
655
+
656
+ .summary-content {
657
+ border-left: 4px solid var(--accent-blue);
658
+ background: #fafbff;
659
+ font-size: 0.95rem;
660
+ }
661
+
662
+ .placeholder {
663
+ color: var(--text-muted);
664
+ font-style: italic;
665
+ text-align: center;
666
+ padding: 30px 0;
667
+ }
668
+
669
+ /* --- Summary Stats --- */
670
+ .summary-stats {
671
+ display: grid;
672
+ grid-template-columns: repeat(4, 1fr);
673
+ gap: 12px;
674
+ margin-top: 16px;
675
+ }
676
+
677
+ .stat-card {
678
+ display: flex;
679
+ flex-direction: column;
680
+ align-items: center;
681
+ gap: 4px;
682
+ padding: 16px 12px;
683
+ background: var(--bg-glass);
684
+ border: 1px solid var(--border-glass);
685
+ border-radius: var(--radius-md);
686
+ text-align: center;
687
+ }
688
+
689
+ .stat-value {
690
+ font-size: 1.2rem;
691
+ font-weight: 700;
692
+ background: var(--gradient-primary);
693
+ -webkit-background-clip: text;
694
+ -webkit-text-fill-color: transparent;
695
+ background-clip: text;
696
+ }
697
+
698
+ .stat-label {
699
+ font-size: 0.7rem;
700
+ color: var(--text-muted);
701
+ text-transform: uppercase;
702
+ letter-spacing: 0.5px;
703
+ }
704
+
705
+ /* --- Entities --- */
706
+ .entity-count {
707
+ font-size: 0.85rem;
708
+ color: var(--accent-cyan);
709
+ padding: 4px 12px;
710
+ background: rgba(6, 182, 212, 0.1);
711
+ border-radius: 100px;
712
+ }
713
+
714
+ .entity-categories {
715
+ display: flex;
716
+ flex-wrap: wrap;
717
+ gap: 8px;
718
+ margin-bottom: 20px;
719
+ }
720
+
721
+ .entity-category-badge {
722
+ display: flex;
723
+ align-items: center;
724
+ gap: 6px;
725
+ padding: 6px 14px;
726
+ background: var(--bg-glass);
727
+ border: 1px solid var(--border-glass);
728
+ border-radius: 100px;
729
+ font-size: 0.78rem;
730
+ font-weight: 500;
731
+ cursor: pointer;
732
+ transition: var(--transition-normal);
733
+ }
734
+
735
+ .entity-category-badge:hover {
736
+ background: var(--bg-glass-strong);
737
+ }
738
+
739
+ .entity-category-badge .cat-dot {
740
+ width: 8px;
741
+ height: 8px;
742
+ border-radius: 50%;
743
+ }
744
+
745
+ .entity-category-badge .cat-count {
746
+ font-weight: 700;
747
+ margin-left: 4px;
748
+ }
749
+
750
+ .entity-list {
751
+ display: grid;
752
+ grid-template-columns: repeat(auto-fill, minmax(280px, 1fr));
753
+ gap: 10px;
754
+ }
755
+
756
+ .entity-item {
757
+ display: flex;
758
+ align-items: center;
759
+ justify-content: space-between;
760
+ padding: 12px 16px;
761
+ background: var(--bg-glass);
762
+ border: 1px solid var(--border-glass);
763
+ border-radius: var(--radius-md);
764
+ transition: var(--transition-normal);
765
+ }
766
+
767
+ .entity-item:hover {
768
+ background: var(--bg-glass-strong);
769
+ border-color: var(--border-accent);
770
+ }
771
+
772
+ .entity-item-left {
773
+ display: flex;
774
+ align-items: center;
775
+ gap: 10px;
776
+ min-width: 0;
777
+ }
778
+
779
+ .entity-type-badge {
780
+ padding: 2px 8px;
781
+ font-size: 0.65rem;
782
+ font-weight: 700;
783
+ border-radius: 4px;
784
+ letter-spacing: 0.5px;
785
+ text-transform: uppercase;
786
+ white-space: nowrap;
787
+ flex-shrink: 0;
788
+ }
789
+
790
+ .entity-text {
791
+ font-size: 0.88rem;
792
+ font-weight: 500;
793
+ white-space: nowrap;
794
+ overflow: hidden;
795
+ text-overflow: ellipsis;
796
+ }
797
+
798
+ .entity-item-count {
799
+ font-size: 0.75rem;
800
+ color: var(--text-muted);
801
+ padding: 2px 8px;
802
+ background: var(--bg-glass-strong);
803
+ border-radius: 100px;
804
+ flex-shrink: 0;
805
+ margin-left: 8px;
806
+ }
807
+
808
+ /* Entity color mapping */
809
+ .badge-PERSON { background: rgba(236, 72, 153, 0.15); color: var(--accent-pink); }
810
+ .badge-ORG { background: rgba(59, 130, 246, 0.15); color: var(--accent-blue); }
811
+ .badge-GPE { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
812
+ .badge-DATE { background: rgba(245, 158, 11, 0.15); color: var(--accent-yellow); }
813
+ .badge-MONEY { background: rgba(139, 92, 246, 0.15); color: var(--accent-purple); }
814
+ .badge-EVENT { background: rgba(6, 182, 212, 0.15); color: var(--accent-cyan); }
815
+ .badge-PRODUCT { background: rgba(251, 146, 60, 0.15); color: #fb923c; }
816
+ .badge-LAW { background: rgba(168, 85, 247, 0.15); color: #a855f7; }
817
+ .badge-NORP { background: rgba(244, 114, 182, 0.15); color: #f472b6; }
818
+ .badge-EMAIL { background: rgba(6, 182, 212, 0.15); color: var(--accent-cyan); }
819
+ .badge-PHONE { background: rgba(59, 130, 246, 0.15); color: var(--accent-blue); }
820
+ .badge-URL { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
821
+ .badge-TIME { background: rgba(245, 158, 11, 0.15); color: var(--accent-yellow); }
822
+ .badge-PERCENT { background: rgba(139, 92, 246, 0.15); color: var(--accent-purple); }
823
+ .badge-CARDINAL { background: rgba(100, 116, 139, 0.15); color: #94a3b8; }
824
+ .badge-ORDINAL { background: rgba(100, 116, 139, 0.15); color: #94a3b8; }
825
+ .badge-QUANTITY { background: rgba(251, 146, 60, 0.15); color: #fb923c; }
826
+
827
+ /* --- Sentiment --- */
828
+ .sentiment-overview {
829
+ display: flex;
830
+ flex-direction: column;
831
+ gap: 20px;
832
+ }
833
+
834
+ .sentiment-gauge-container {
835
+ display: flex;
836
+ flex-direction: column;
837
+ align-items: center;
838
+ gap: 16px;
839
+ padding: 30px;
840
+ background: var(--bg-glass);
841
+ border: 1px solid var(--border-glass);
842
+ border-radius: var(--radius-lg);
843
+ backdrop-filter: blur(20px);
844
+ }
845
+
846
+ .sentiment-label-display {
847
+ font-size: 1.6rem;
848
+ font-weight: 800;
849
+ letter-spacing: -0.5px;
850
+ }
851
+
852
+ .sentiment-score {
853
+ font-size: 3rem;
854
+ font-weight: 800;
855
+ background: var(--gradient-primary);
856
+ -webkit-background-clip: text;
857
+ -webkit-text-fill-color: transparent;
858
+ background-clip: text;
859
+ line-height: 1;
860
+ }
861
+
862
+ .sentiment-bar-container {
863
+ width: 100%;
864
+ max-width: 500px;
865
+ }
866
+
867
+ .sentiment-bar {
868
+ width: 100%;
869
+ height: 12px;
870
+ border-radius: 6px;
871
+ background: var(--bg-glass-strong);
872
+ overflow: hidden;
873
+ display: flex;
874
+ }
875
+
876
+ .sentiment-bar-positive {
877
+ background: var(--accent-green);
878
+ transition: width 0.8s ease;
879
+ }
880
+
881
+ .sentiment-bar-neutral {
882
+ background: var(--text-muted);
883
+ transition: width 0.8s ease;
884
+ }
885
+
886
+ .sentiment-bar-negative {
887
+ background: var(--accent-red);
888
+ transition: width 0.8s ease;
889
+ }
890
+
891
+ .sentiment-bar-labels {
892
+ display: flex;
893
+ justify-content: space-between;
894
+ margin-top: 8px;
895
+ font-size: 0.75rem;
896
+ color: var(--text-secondary);
897
+ }
898
+
899
+ .sentiment-bar-labels span {
900
+ display: flex;
901
+ align-items: center;
902
+ gap: 6px;
903
+ }
904
+
905
+ .sentiment-bar-labels .dot {
906
+ width: 8px;
907
+ height: 8px;
908
+ border-radius: 50%;
909
+ }
910
+
911
+ .dot-pos { background: var(--accent-green); }
912
+ .dot-neu { background: var(--text-muted); }
913
+ .dot-neg { background: var(--accent-red); }
914
+
915
+ .sentiment-sentences {
916
+ background: var(--bg-glass);
917
+ border: 1px solid var(--border-glass);
918
+ border-radius: var(--radius-lg);
919
+ padding: 20px;
920
+ max-height: 400px;
921
+ overflow-y: auto;
922
+ }
923
+
924
+ .sentiment-sentences h4 {
925
+ font-size: 0.9rem;
926
+ font-weight: 600;
927
+ margin-bottom: 12px;
928
+ color: var(--text-secondary);
929
+ }
930
+
931
+ .sentence-item {
932
+ display: flex;
933
+ align-items: flex-start;
934
+ gap: 12px;
935
+ padding: 10px 0;
936
+ border-bottom: 1px solid var(--border-glass);
937
+ font-size: 0.85rem;
938
+ }
939
+
940
+ .sentence-item:last-child {
941
+ border-bottom: none;
942
+ }
943
+
944
+ .sentence-sentiment-badge {
945
+ padding: 2px 8px;
946
+ font-size: 0.65rem;
947
+ font-weight: 700;
948
+ border-radius: 4px;
949
+ white-space: nowrap;
950
+ flex-shrink: 0;
951
+ margin-top: 2px;
952
+ }
953
+
954
+ .sent-positive { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
955
+ .sent-negative { background: rgba(239, 68, 68, 0.15); color: var(--accent-red); }
956
+ .sent-neutral { background: rgba(100, 116, 139, 0.15); color: var(--text-muted); }
957
+
958
+ .sentence-text {
959
+ color: var(--text-secondary);
960
+ line-height: 1.5;
961
+ }
962
+
963
+ /* --- Metadata --- */
964
+ .metadata-content {
965
+ background: var(--bg-glass);
966
+ border: 1px solid var(--border-glass);
967
+ border-radius: var(--radius-lg);
968
+ overflow: hidden;
969
+ }
970
+
971
+ .metadata-table {
972
+ width: 100%;
973
+ border-collapse: collapse;
974
+ }
975
+
976
+ .metadata-table tr {
977
+ border-bottom: 1px solid var(--border-glass);
978
+ }
979
+
980
+ .metadata-table tr:last-child {
981
+ border-bottom: none;
982
+ }
983
+
984
+ .metadata-table td {
985
+ padding: 12px 20px;
986
+ font-size: 0.88rem;
987
+ }
988
+
989
+ .metadata-table td:first-child {
990
+ font-weight: 600;
991
+ color: var(--text-secondary);
992
+ width: 200px;
993
+ white-space: nowrap;
994
+ }
995
+
996
+ .metadata-table td:last-child {
997
+ color: var(--text-primary);
998
+ }
999
+
1000
+ /* --- Footer --- */
1001
+ .footer {
1002
+ text-align: center;
1003
+ padding: 24px 0 12px;
1004
+ font-size: 0.75rem;
1005
+ color: var(--text-muted);
1006
+ }
1007
+
1008
+ /* --- Toast Notifications --- */
1009
+ .toast-container {
1010
+ position: fixed;
1011
+ bottom: 24px;
1012
+ right: 24px;
1013
+ z-index: 1000;
1014
+ display: flex;
1015
+ flex-direction: column;
1016
+ gap: 10px;
1017
+ }
1018
+
1019
+ .toast {
1020
+ display: flex;
1021
+ align-items: center;
1022
+ gap: 10px;
1023
+ padding: 14px 20px;
1024
+ background: var(--bg-card);
1025
+ border: 1px solid var(--border-glass);
1026
+ border-radius: var(--radius-md);
1027
+ backdrop-filter: blur(20px);
1028
+ color: var(--text-primary);
1029
+ font-size: 0.85rem;
1030
+ box-shadow: var(--shadow-lg);
1031
+ animation: slideInRight 0.3s ease, fadeOut 0.5s ease 3.5s forwards;
1032
+ max-width: 380px;
1033
+ }
1034
+
1035
+ .toast.toast-error {
1036
+ border-color: rgba(239, 68, 68, 0.3);
1037
+ }
1038
+
1039
+ .toast.toast-success {
1040
+ border-color: rgba(16, 185, 129, 0.3);
1041
+ }
1042
+
1043
+ .toast-icon {
1044
+ font-size: 1.2rem;
1045
+ flex-shrink: 0;
1046
+ }
1047
+
1048
+ /* --- Utility Classes --- */
1049
+ .hidden {
1050
+ display: none !important;
1051
+ }
1052
+
1053
+ /* --- Animations --- */
1054
+ @keyframes fadeInUp {
1055
+ from {
1056
+ opacity: 0;
1057
+ transform: translateY(20px);
1058
+ }
1059
+ to {
1060
+ opacity: 1;
1061
+ transform: translateY(0);
1062
+ }
1063
+ }
1064
+
1065
+ @keyframes fadeIn {
1066
+ from { opacity: 0; }
1067
+ to { opacity: 1; }
1068
+ }
1069
+
1070
+ @keyframes slideInRight {
1071
+ from {
1072
+ opacity: 0;
1073
+ transform: translateX(40px);
1074
+ }
1075
+ to {
1076
+ opacity: 1;
1077
+ transform: translateX(0);
1078
+ }
1079
+ }
1080
+
1081
+ @keyframes fadeOut {
1082
+ from { opacity: 1; }
1083
+ to { opacity: 0; }
1084
+ }
1085
+
1086
+ /* --- Responsive --- */
1087
+ @media (max-width: 768px) {
1088
+ .app-container {
1089
+ padding: 12px;
1090
+ }
1091
+
1092
+ .header {
1093
+ flex-direction: column;
1094
+ gap: 12px;
1095
+ padding: 14px;
1096
+ }
1097
+
1098
+ .upload-zone {
1099
+ padding: 40px 24px;
1100
+ }
1101
+
1102
+ .upload-title {
1103
+ font-size: 1.1rem;
1104
+ }
1105
+
1106
+ .tabs {
1107
+ overflow-x: auto;
1108
+ }
1109
+
1110
+ .tab {
1111
+ padding: 8px 12px;
1112
+ font-size: 0.75rem;
1113
+ }
1114
+
1115
+ .tab svg {
1116
+ display: none;
1117
+ }
1118
+
1119
+ .summary-stats {
1120
+ grid-template-columns: repeat(2, 1fr);
1121
+ }
1122
+
1123
+ .entity-list {
1124
+ grid-template-columns: 1fr;
1125
+ }
1126
+
1127
+ .file-info-bar {
1128
+ flex-direction: column;
1129
+ gap: 12px;
1130
+ align-items: flex-start;
1131
+ }
1132
+
1133
+ .file-info-right {
1134
+ width: 100%;
1135
+ justify-content: flex-end;
1136
+ }
1137
+
1138
+ .metadata-table td:first-child {
1139
+ width: 140px;
1140
+ }
1141
+
1142
+ .sentiment-score {
1143
+ font-size: 2rem;
1144
+ }
1145
+ }
1146
+
1147
+ @media (max-width: 480px) {
1148
+ .summary-stats {
1149
+ grid-template-columns: 1fr 1fr;
1150
+ }
1151
+
1152
+ .format-badge {
1153
+ font-size: 0.65rem;
1154
+ padding: 3px 8px;
1155
+ }
1156
+ }
test_api.py ADDED
@@ -0,0 +1,78 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Quick API test script for Alldocex."""
2
+ import requests
3
+ import time
4
+ import json
5
+
6
+ BASE_URL = "http://localhost:8000"
7
+
8
+ # Upload the test document
9
+ print("Uploading test_document.docx...")
10
+ with open("test_document.docx", "rb") as f:
11
+ res = requests.post(
12
+ f"{BASE_URL}/api/upload",
13
+ files={"file": ("test_document.docx", f, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}
14
+ )
15
+
16
+ data = res.json()
17
+ print(f"Upload response: {data['status']} - File ID: {data['file_id']}")
18
+ task_id = data["file_id"]
19
+
20
+ # Poll for results
21
+ print("Waiting for processing...")
22
+ for i in range(30):
23
+ time.sleep(1)
24
+ res = requests.get(f"{BASE_URL}/api/status/{task_id}")
25
+ result = res.json()
26
+ status = result["status"]
27
+ print(f" Poll {i+1}: {status}")
28
+ if status in ("completed", "error"):
29
+ break
30
+
31
+ print(f"\n{'='*50}")
32
+ print(f"STATUS: {result['status']}")
33
+ print(f"Processing time: {round(result.get('processing_time_ms', 0), 1)} ms")
34
+ print(f"{'='*50}")
35
+
36
+ # Extraction
37
+ if result.get("extraction"):
38
+ ext = result["extraction"]
39
+ print(f"\n--- EXTRACTION ---")
40
+ print(f"Success: {ext['success']}")
41
+ print(f"Word count: {ext['metadata']['word_count']}")
42
+ print(f"Char count: {ext['metadata']['character_count']}")
43
+ print(f"File type: {ext['metadata']['file_type']}")
44
+ print(f"First 300 chars:\n{ext['raw_text'][:300]}")
45
+
46
+ # Summary
47
+ if result.get("summary"):
48
+ s = result["summary"]
49
+ print(f"\n--- SUMMARY ---")
50
+ print(f"Algorithm: {s['algorithm']}")
51
+ print(f"Original length: {s['original_length']}")
52
+ print(f"Summary length: {s['summary_length']}")
53
+ print(f"Compression: {round((1 - s['compression_ratio']) * 100, 1)}%")
54
+ print(f"Summary:\n{s['summary'][:500]}")
55
+
56
+ # Entities
57
+ if result.get("entities"):
58
+ e = result["entities"]
59
+ print(f"\n--- ENTITIES ---")
60
+ print(f"Total entities: {e['total_entities']}")
61
+ print(f"Categories: {json.dumps(e['entity_counts'], indent=2)}")
62
+ for ent in e["entities"][:20]:
63
+ print(f" [{ent['label']:8s}] {ent['text']} (x{ent['count']})")
64
+
65
+ # Sentiment
66
+ if result.get("sentiment"):
67
+ sent = result["sentiment"]
68
+ print(f"\n--- SENTIMENT ---")
69
+ print(f"Label: {sent['overall_label']}")
70
+ print(f"Compound: {sent['overall_compound']}")
71
+ print(f"Positive: {sent['overall_positive']}")
72
+ print(f"Negative: {sent['overall_negative']}")
73
+ print(f"Neutral: {sent['overall_neutral']}")
74
+ print(f"Sentence breakdowns: {len(sent['sentence_breakdown'])}")
75
+ for sb in sent["sentence_breakdown"][:5]:
76
+ print(f" [{sb['label']:15s}] {sb['text'][:80]}...")
77
+
78
+ print("\n=== TEST COMPLETE ===")
test_simple.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Simple API test - writes results to file."""
2
+ import requests, time, json
3
+
4
+ BASE = "http://localhost:8000"
5
+ out = []
6
+
7
+ with open("test_document.docx", "rb") as f:
8
+ res = requests.post(f"{BASE}/api/upload", files={"file": ("test_document.docx", f)})
9
+ data = res.json()
10
+ task_id = data["file_id"]
11
+ out.append(f"Upload: {data['status']} (ID: {task_id})")
12
+
13
+ for i in range(30):
14
+ time.sleep(1)
15
+ res = requests.get(f"{BASE}/api/status/{task_id}")
16
+ result = res.json()
17
+ if result["status"] in ("completed", "error"):
18
+ break
19
+
20
+ out.append(f"Status: {result['status']}")
21
+ out.append(f"Time: {round(result.get('processing_time_ms', 0))}ms")
22
+
23
+ if result.get("extraction"):
24
+ e = result["extraction"]
25
+ out.append(f"\nEXTRACTION: success={e['success']}, words={e['metadata']['word_count']}")
26
+ out.append(f"Text preview: {e['raw_text'][:200]}...")
27
+
28
+ if result.get("summary"):
29
+ s = result["summary"]
30
+ out.append(f"\nSUMMARY ({s['algorithm']}): compression={round((1-s['compression_ratio'])*100)}%")
31
+ out.append(s["summary"][:400])
32
+
33
+ if result.get("entities"):
34
+ ent = result["entities"]
35
+ out.append(f"\nENTITIES: {ent['total_entities']} found")
36
+ out.append(f"Categories: {json.dumps(ent['entity_counts'])}")
37
+ for e in ent["entities"][:15]:
38
+ out.append(f" [{e['label']}] {e['text']} (x{e['count']})")
39
+
40
+ if result.get("sentiment"):
41
+ s = result["sentiment"]
42
+ out.append(f"\nSENTIMENT: {s['overall_label']} (compound={s['overall_compound']})")
43
+ out.append(f"Pos={s['overall_positive']} Neg={s['overall_negative']} Neu={s['overall_neutral']}")
44
+
45
+ out.append("\nDONE")
46
+ text = "\n".join(out)
47
+ with open("test_output.txt", "w", encoding="utf-8") as f:
48
+ f.write(text)
49
+ print(text)