Spaces:
Sleeping
Sleeping
krishnachoudhary-hclguvi commited on
Deploy text extraction API files
Browse files- DEPLOYMENT.md +64 -0
- Dockerfile +40 -0
- README.md +64 -10
- analyzers/__init__.py +1 -0
- analyzers/ner_extractor.py +145 -0
- analyzers/sentiment.py +103 -0
- analyzers/summarizer.py +98 -0
- config.py +77 -0
- docker-compose.yml +13 -0
- extractors/__init__.py +1 -0
- extractors/docx_extractor.py +95 -0
- extractors/ocr_extractor.py +245 -0
- extractors/pdf_extractor.py +79 -0
- extractors/url_extractor.py +108 -0
- main.py +307 -0
- models/__init__.py +1 -0
- models/schemas.py +116 -0
- requirements.txt +16 -0
- static/app.js +586 -0
- static/index.html +268 -0
- static/styles.css +1156 -0
- test_api.py +78 -0
- test_simple.py +49 -0
DEPLOYMENT.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Alldocex - Deployment Guide
|
| 2 |
+
|
| 3 |
+
This guide provides three main options for deploying the Alldocex application to a production environment.
|
| 4 |
+
|
| 5 |
+
## 🏗️ Option 1: Docker (Recommended)
|
| 6 |
+
|
| 7 |
+
Docker is the best choice because it packages all the AI models, dependencies, and system libraries into a single container.
|
| 8 |
+
|
| 9 |
+
### 1. Build the image
|
| 10 |
+
```bash
|
| 11 |
+
docker build -t alldocex-app .
|
| 12 |
+
```
|
| 13 |
+
|
| 14 |
+
### 2. Run with Docker Compose
|
| 15 |
+
```bash
|
| 16 |
+
docker-compose up -d
|
| 17 |
+
```
|
| 18 |
+
The application will be available at `http://localhost:8000`.
|
| 19 |
+
|
| 20 |
+
---
|
| 21 |
+
|
| 22 |
+
## ☁️ Option 2: Cloud Deployment (Render / Railway / Fly.io)
|
| 23 |
+
|
| 24 |
+
### **Render Deployment (Recommended)**
|
| 25 |
+
1. **Connect GitHub**: Push your code to a GitHub repository.
|
| 26 |
+
2. **Create Web Service**: Select "Web Service" in Render.
|
| 27 |
+
3. **Docker Environment**: Render will automatically detect the `Dockerfile`.
|
| 28 |
+
4. **Resource Plan**: Ensure you select a plan with at least **4GB RAM** (e.g., Starter or Pro).
|
| 29 |
+
5. **Environment Variables**: Add `PORT = 8000` if required.
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
## 🖥️ Option 3: Manual Deployment (Ubuntu/Debian Server)
|
| 34 |
+
|
| 35 |
+
If you are deploying directly to a Linux VPS (without Docker):
|
| 36 |
+
|
| 37 |
+
### 1. Install System Dependencies
|
| 38 |
+
```bash
|
| 39 |
+
sudo apt-get update
|
| 40 |
+
sudo apt-get install -y tesseract-ocr libgl1-mesa-glx libglib2.0-0 build-essential python3-venv
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
### 2. Set Up Environment
|
| 44 |
+
```bash
|
| 45 |
+
python3 -m venv venv
|
| 46 |
+
source venv/bin/activate
|
| 47 |
+
pip install -r requirements.txt
|
| 48 |
+
python -m spacy download en_core_web_sm
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
### 3. Run with Gunicorn (Production Server)
|
| 52 |
+
```bash
|
| 53 |
+
pip install gunicorn
|
| 54 |
+
gunicorn -w 4 -k uvicorn.workers.UvicornWorker main:app --bind 0.0.0.0:8000
|
| 55 |
+
```
|
| 56 |
+
|
| 57 |
+
---
|
| 58 |
+
|
| 59 |
+
## ⚠️ Important Considerations
|
| 60 |
+
|
| 61 |
+
* **RAM**: AI models (EasyOCR, Torch, spaCy) are memory-intensive. Do NOT deploy on a "Free Tier" instance with only 512MB or 1GB of RAM.
|
| 62 |
+
* **Disk Space**: The first time you run the app, it will download several hundred megabytes of model weights.
|
| 63 |
+
* **Permissions**: Ensure the `uploads/` directory has write permissions for the user running the application.
|
| 64 |
+
* **Reverse Proxy**: For public deployment, it is highly recommended to use **Nginx** as a reverse proxy with SSL (Let's Encrypt).
|
Dockerfile
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Use a lean Python 3.10 image as base
|
| 2 |
+
FROM python:3.10-slim
|
| 3 |
+
|
| 4 |
+
# Set environment variables
|
| 5 |
+
ENV PYTHONDONTWRITEBYTECODE=1
|
| 6 |
+
ENV PYTHONUNBUFFERED=1
|
| 7 |
+
ENV DEBIAN_FRONTEND=noninteractive
|
| 8 |
+
|
| 9 |
+
# Set working directory
|
| 10 |
+
WORKDIR /app
|
| 11 |
+
|
| 12 |
+
# Install system dependencies for OCR and NLP
|
| 13 |
+
RUN apt-get update && apt-get install -y \
|
| 14 |
+
tesseract-ocr \
|
| 15 |
+
libgl1-mesa-glx \
|
| 16 |
+
libglib2.0-0 \
|
| 17 |
+
build-essential \
|
| 18 |
+
&& apt-get clean \
|
| 19 |
+
&& rm -rf /var/lib/apt/lists/*
|
| 20 |
+
|
| 21 |
+
# Copy requirements file
|
| 22 |
+
COPY requirements.txt .
|
| 23 |
+
|
| 24 |
+
# Install Python dependencies
|
| 25 |
+
RUN pip install --no-cache-dir -r requirements.txt
|
| 26 |
+
|
| 27 |
+
# Download spaCy model during build to improve runtime performance
|
| 28 |
+
RUN python -m spacy download en_core_web_sm
|
| 29 |
+
|
| 30 |
+
# Create uploads directory
|
| 31 |
+
RUN mkdir -p /app/uploads
|
| 32 |
+
|
| 33 |
+
# Copy the rest of the application code
|
| 34 |
+
COPY . .
|
| 35 |
+
|
| 36 |
+
# Expose the API port for Hugging Face Spaces
|
| 37 |
+
EXPOSE 7860
|
| 38 |
+
|
| 39 |
+
# Start the application using Uvicorn
|
| 40 |
+
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]
|
README.md
CHANGED
|
@@ -1,11 +1,65 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Alldocex — Intelligent Document Processing System
|
| 2 |
+
|
| 3 |
+

|
| 4 |
+

|
| 5 |
+
|
| 6 |
+
**Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
|
| 7 |
+
|
| 8 |
+
## 🚀 Key Features
|
| 9 |
+
|
| 10 |
+
* **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
|
| 11 |
+
* **Layout-Aware PDF Engine**: Uses advanced 'layout' mode to preserve columns, tables, and physical text positioning.
|
| 12 |
+
* **Intelligent OCR**: Powered by **EasyOCR** (Deep Learning based) for superior accuracy in scanned documents.
|
| 13 |
+
* **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
|
| 14 |
+
* **AI Analysis Suite**:
|
| 15 |
+
* **Extractive Summarization**: Condenses long documents into key highlights.
|
| 16 |
+
* **Named Entity Recognition (NER)**: Detects People, Organizations, Dates, and more via **spaCy**.
|
| 17 |
+
* **Sentiment Analysis**: Analyzes emotional tone using the **VADER** algorithm.
|
| 18 |
+
* **Downloadable Results**: Export extracted text as clean `.txt` files.
|
| 19 |
+
* **Corporate UI**: A professional Blue & White dashboard with smooth animations and intuitive navigation.
|
| 20 |
+
|
| 21 |
+
## 🛠️ Technology Stack
|
| 22 |
+
|
| 23 |
+
* **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
|
| 24 |
+
* **PDF Processing**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) (Layout Mode)
|
| 25 |
+
* **OCR**: [EasyOCR](https://github.com/JaidedAI/EasyOCR) & [Tesseract](https://github.com/tesseract-ocr/tesseract)
|
| 26 |
+
* **NLP**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
|
| 27 |
+
* **Frontend**: Vanilla HTML5, CSS3 (Modern UI), and JavaScript (ES6+)
|
| 28 |
+
|
| 29 |
+
## 📦 Installation
|
| 30 |
|
| 31 |
+
### 1. Clone the repository
|
| 32 |
+
```bash
|
| 33 |
+
git clone <your-repo-url>
|
| 34 |
+
cd guvi-extraction
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
### 2. Install dependencies
|
| 38 |
+
```bash
|
| 39 |
+
pip install -r requirements.txt
|
| 40 |
+
```
|
| 41 |
+
|
| 42 |
+
### 3. Install NLP model
|
| 43 |
+
```bash
|
| 44 |
+
python -m spacy download en_core_web_sm
|
| 45 |
+
```
|
| 46 |
+
|
| 47 |
+
## 🏃 Getting Started
|
| 48 |
+
|
| 49 |
+
1. Start the backend server:
|
| 50 |
+
```bash
|
| 51 |
+
python main.py
|
| 52 |
+
```
|
| 53 |
+
2. Open your browser and navigate to:
|
| 54 |
+
`http://localhost:7860
|
| 55 |
+
`
|
| 56 |
+
|
| 57 |
+
## 📘 Usage
|
| 58 |
+
|
| 59 |
+
1. **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
|
| 60 |
+
2. **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
|
| 61 |
+
3. **URL Entry**: Paste a web link to summarize online articles instantly.
|
| 62 |
+
4. **Download**: Once processing is complete, use the **Download** button to save the extracted text.
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
---
|
analyzers/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# Analyzers package
|
analyzers/ner_extractor.py
ADDED
|
@@ -0,0 +1,145 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Named Entity Recognition using spaCy.
|
| 3 |
+
Extracts persons, organizations, dates, monetary amounts, locations, and more.
|
| 4 |
+
Also uses regex patterns for additional entity types.
|
| 5 |
+
"""
|
| 6 |
+
import re
|
| 7 |
+
from collections import Counter
|
| 8 |
+
from typing import List, Dict
|
| 9 |
+
from models.schemas import Entity, EntityResult
|
| 10 |
+
from config import SPACY_MODEL, NER_ENTITY_TYPES
|
| 11 |
+
|
| 12 |
+
# Try to load spaCy model
|
| 13 |
+
try:
|
| 14 |
+
import spacy
|
| 15 |
+
nlp = spacy.load(SPACY_MODEL)
|
| 16 |
+
SPACY_AVAILABLE = True
|
| 17 |
+
except (ImportError, OSError):
|
| 18 |
+
SPACY_AVAILABLE = False
|
| 19 |
+
nlp = None
|
| 20 |
+
|
| 21 |
+
# Entity label descriptions
|
| 22 |
+
LABEL_DESCRIPTIONS = {
|
| 23 |
+
"PERSON": "Person name",
|
| 24 |
+
"ORG": "Organization",
|
| 25 |
+
"GPE": "Country / City / State",
|
| 26 |
+
"DATE": "Date or period",
|
| 27 |
+
"MONEY": "Monetary value",
|
| 28 |
+
"TIME": "Time expression",
|
| 29 |
+
"PERCENT": "Percentage",
|
| 30 |
+
"EVENT": "Named event",
|
| 31 |
+
"PRODUCT": "Product name",
|
| 32 |
+
"LAW": "Law or regulation",
|
| 33 |
+
"NORP": "Nationality / Group",
|
| 34 |
+
"FAC": "Facility / Building",
|
| 35 |
+
"LOC": "Non-GPE location",
|
| 36 |
+
"WORK_OF_ART": "Title of work",
|
| 37 |
+
"LANGUAGE": "Language name",
|
| 38 |
+
"CARDINAL": "Number",
|
| 39 |
+
"ORDINAL": "Ordinal number",
|
| 40 |
+
"QUANTITY": "Measurement",
|
| 41 |
+
"EMAIL": "Email address",
|
| 42 |
+
"PHONE": "Phone number",
|
| 43 |
+
"URL": "Web URL",
|
| 44 |
+
}
|
| 45 |
+
|
| 46 |
+
# Regex patterns for additional entity types
|
| 47 |
+
REGEX_PATTERNS = {
|
| 48 |
+
"EMAIL": r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
|
| 49 |
+
"PHONE": r'(?:\+?\d{1,3}[-.\s]?)?\(?\d{2,4}\)?[-.\s]?\d{3,4}[-.\s]?\d{3,4}',
|
| 50 |
+
"URL": r'https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+[/\w\-._~:/?#\[\]@!$&\'()*+,;=%]*',
|
| 51 |
+
}
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def _extract_regex_entities(text: str) -> List[Entity]:
|
| 55 |
+
"""Extract entities using regex patterns."""
|
| 56 |
+
entities = []
|
| 57 |
+
for label, pattern in REGEX_PATTERNS.items():
|
| 58 |
+
matches = re.findall(pattern, text)
|
| 59 |
+
if matches:
|
| 60 |
+
counted = Counter(matches)
|
| 61 |
+
for match_text, count in counted.most_common():
|
| 62 |
+
entities.append(Entity(
|
| 63 |
+
text=match_text,
|
| 64 |
+
label=label,
|
| 65 |
+
label_description=LABEL_DESCRIPTIONS.get(label, label),
|
| 66 |
+
count=count,
|
| 67 |
+
))
|
| 68 |
+
return entities
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def _extract_spacy_entities(text: str) -> List[Entity]:
|
| 72 |
+
"""Extract entities using spaCy NER."""
|
| 73 |
+
if not SPACY_AVAILABLE or nlp is None:
|
| 74 |
+
return []
|
| 75 |
+
|
| 76 |
+
# Process text (handle long texts by chunking)
|
| 77 |
+
max_length = 100000
|
| 78 |
+
if len(text) > max_length:
|
| 79 |
+
text = text[:max_length]
|
| 80 |
+
|
| 81 |
+
doc = nlp(text)
|
| 82 |
+
|
| 83 |
+
# Collect and deduplicate entities
|
| 84 |
+
entity_map: Dict[str, Dict] = {}
|
| 85 |
+
|
| 86 |
+
for ent in doc.ents:
|
| 87 |
+
if ent.label_ not in NER_ENTITY_TYPES:
|
| 88 |
+
continue
|
| 89 |
+
|
| 90 |
+
clean_text = ent.text.strip()
|
| 91 |
+
if not clean_text or len(clean_text) < 2:
|
| 92 |
+
continue
|
| 93 |
+
|
| 94 |
+
key = f"{ent.label_}:{clean_text.lower()}"
|
| 95 |
+
if key in entity_map:
|
| 96 |
+
entity_map[key]["count"] += 1
|
| 97 |
+
entity_map[key]["positions"].append(ent.start_char)
|
| 98 |
+
else:
|
| 99 |
+
entity_map[key] = {
|
| 100 |
+
"text": clean_text,
|
| 101 |
+
"label": ent.label_,
|
| 102 |
+
"label_description": LABEL_DESCRIPTIONS.get(ent.label_, ent.label_),
|
| 103 |
+
"count": 1,
|
| 104 |
+
"positions": [ent.start_char],
|
| 105 |
+
}
|
| 106 |
+
|
| 107 |
+
# Convert to Entity objects and sort by count
|
| 108 |
+
entities = [
|
| 109 |
+
Entity(**data)
|
| 110 |
+
for data in sorted(entity_map.values(), key=lambda x: x["count"], reverse=True)
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
return entities
|
| 114 |
+
|
| 115 |
+
|
| 116 |
+
def extract_entities(text: str) -> EntityResult:
|
| 117 |
+
"""
|
| 118 |
+
Extract named entities from text using spaCy and regex patterns.
|
| 119 |
+
|
| 120 |
+
Args:
|
| 121 |
+
text: The input text to analyze.
|
| 122 |
+
|
| 123 |
+
Returns:
|
| 124 |
+
EntityResult with all found entities and statistics.
|
| 125 |
+
"""
|
| 126 |
+
if not text.strip():
|
| 127 |
+
return EntityResult(entities=[], entity_counts={}, total_entities=0)
|
| 128 |
+
|
| 129 |
+
# Get entities from both sources
|
| 130 |
+
spacy_entities = _extract_spacy_entities(text)
|
| 131 |
+
regex_entities = _extract_regex_entities(text)
|
| 132 |
+
|
| 133 |
+
# Combine (spaCy entities first, then regex)
|
| 134 |
+
all_entities = spacy_entities + regex_entities
|
| 135 |
+
|
| 136 |
+
# Count by category
|
| 137 |
+
entity_counts: Dict[str, int] = {}
|
| 138 |
+
for ent in all_entities:
|
| 139 |
+
entity_counts[ent.label] = entity_counts.get(ent.label, 0) + ent.count
|
| 140 |
+
|
| 141 |
+
return EntityResult(
|
| 142 |
+
entities=all_entities,
|
| 143 |
+
entity_counts=entity_counts,
|
| 144 |
+
total_entities=sum(ent.count for ent in all_entities),
|
| 145 |
+
)
|
analyzers/sentiment.py
ADDED
|
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Sentiment analysis using NLTK's VADER (Valence Aware Dictionary and sEntiment Reasoner).
|
| 3 |
+
Provides both overall and sentence-level sentiment analysis.
|
| 4 |
+
"""
|
| 5 |
+
import nltk
|
| 6 |
+
from nltk.sentiment.vader import SentimentIntensityAnalyzer
|
| 7 |
+
from nltk.tokenize import sent_tokenize
|
| 8 |
+
from models.schemas import SentimentResult, SentimentBreakdown
|
| 9 |
+
from config import SENTIMENT_THRESHOLDS
|
| 10 |
+
from typing import List
|
| 11 |
+
|
| 12 |
+
# Download required NLTK data
|
| 13 |
+
try:
|
| 14 |
+
nltk.data.find("sentiment/vader_lexicon.zip")
|
| 15 |
+
except LookupError:
|
| 16 |
+
nltk.download("vader_lexicon", quiet=True)
|
| 17 |
+
|
| 18 |
+
try:
|
| 19 |
+
nltk.data.find("tokenizers/punkt_tab")
|
| 20 |
+
except LookupError:
|
| 21 |
+
nltk.download("punkt_tab", quiet=True)
|
| 22 |
+
|
| 23 |
+
# Initialize analyzer
|
| 24 |
+
sia = SentimentIntensityAnalyzer()
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
def _get_sentiment_label(compound: float) -> str:
|
| 28 |
+
"""Convert compound score to human-readable label."""
|
| 29 |
+
if compound >= 0.5:
|
| 30 |
+
return "Very Positive"
|
| 31 |
+
elif compound >= SENTIMENT_THRESHOLDS["positive"]:
|
| 32 |
+
return "Positive"
|
| 33 |
+
elif compound <= -0.5:
|
| 34 |
+
return "Very Negative"
|
| 35 |
+
elif compound <= SENTIMENT_THRESHOLDS["negative"]:
|
| 36 |
+
return "Negative"
|
| 37 |
+
else:
|
| 38 |
+
return "Neutral"
|
| 39 |
+
|
| 40 |
+
|
| 41 |
+
def analyze_sentiment(text: str) -> SentimentResult:
|
| 42 |
+
"""
|
| 43 |
+
Perform sentiment analysis on the given text.
|
| 44 |
+
|
| 45 |
+
Returns overall sentiment scores and sentence-level breakdown.
|
| 46 |
+
|
| 47 |
+
Args:
|
| 48 |
+
text: The input text to analyze.
|
| 49 |
+
|
| 50 |
+
Returns:
|
| 51 |
+
SentimentResult with overall and per-sentence sentiment analysis.
|
| 52 |
+
"""
|
| 53 |
+
if not text.strip():
|
| 54 |
+
return SentimentResult(
|
| 55 |
+
overall_compound=0.0,
|
| 56 |
+
overall_positive=0.0,
|
| 57 |
+
overall_negative=0.0,
|
| 58 |
+
overall_neutral=1.0,
|
| 59 |
+
overall_label="Neutral",
|
| 60 |
+
sentence_breakdown=[],
|
| 61 |
+
confidence=0.0,
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
# Overall sentiment
|
| 65 |
+
overall_scores = sia.polarity_scores(text)
|
| 66 |
+
|
| 67 |
+
# Sentence-level breakdown
|
| 68 |
+
sentences = sent_tokenize(text)
|
| 69 |
+
sentence_breakdown: List[SentimentBreakdown] = []
|
| 70 |
+
|
| 71 |
+
# Limit to first 50 sentences for performance
|
| 72 |
+
for sent in sentences[:50]:
|
| 73 |
+
sent = sent.strip()
|
| 74 |
+
if not sent or len(sent) < 5:
|
| 75 |
+
continue
|
| 76 |
+
|
| 77 |
+
scores = sia.polarity_scores(sent)
|
| 78 |
+
sentence_breakdown.append(SentimentBreakdown(
|
| 79 |
+
text=sent[:200], # Truncate very long sentences
|
| 80 |
+
compound=round(scores["compound"], 4),
|
| 81 |
+
positive=round(scores["pos"], 4),
|
| 82 |
+
negative=round(scores["neg"], 4),
|
| 83 |
+
neutral=round(scores["neu"], 4),
|
| 84 |
+
label=_get_sentiment_label(scores["compound"]),
|
| 85 |
+
))
|
| 86 |
+
|
| 87 |
+
# Calculate confidence based on consistency of sentence sentiments
|
| 88 |
+
if sentence_breakdown:
|
| 89 |
+
compounds = [sb.compound for sb in sentence_breakdown]
|
| 90 |
+
avg_magnitude = sum(abs(c) for c in compounds) / len(compounds)
|
| 91 |
+
confidence = min(avg_magnitude * 2, 1.0) # Scale to 0-1
|
| 92 |
+
else:
|
| 93 |
+
confidence = abs(overall_scores["compound"])
|
| 94 |
+
|
| 95 |
+
return SentimentResult(
|
| 96 |
+
overall_compound=round(overall_scores["compound"], 4),
|
| 97 |
+
overall_positive=round(overall_scores["pos"], 4),
|
| 98 |
+
overall_negative=round(overall_scores["neg"], 4),
|
| 99 |
+
overall_neutral=round(overall_scores["neu"], 4),
|
| 100 |
+
overall_label=_get_sentiment_label(overall_scores["compound"]),
|
| 101 |
+
sentence_breakdown=sentence_breakdown,
|
| 102 |
+
confidence=round(confidence, 4),
|
| 103 |
+
)
|
analyzers/summarizer.py
ADDED
|
@@ -0,0 +1,98 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Extractive text summarization using sumy library.
|
| 3 |
+
Uses LexRank algorithm by default for graph-based sentence ranking.
|
| 4 |
+
"""
|
| 5 |
+
from sumy.parsers.plaintext import PlaintextParser
|
| 6 |
+
from sumy.nlp.tokenizers import Tokenizer
|
| 7 |
+
from sumy.summarizers.lex_rank import LexRankSummarizer
|
| 8 |
+
from sumy.summarizers.lsa import LsaSummarizer
|
| 9 |
+
from sumy.summarizers.luhn import LuhnSummarizer
|
| 10 |
+
from sumy.nlp.stemmers import Stemmer
|
| 11 |
+
from sumy.utils import get_stop_words
|
| 12 |
+
from models.schemas import SummaryResult
|
| 13 |
+
from config import SUMMARY_SENTENCE_COUNT, SUMMARY_ALGORITHM
|
| 14 |
+
|
| 15 |
+
LANGUAGE = "english"
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def _get_summarizer(algorithm: str):
|
| 19 |
+
"""Get the appropriate summarizer based on algorithm name."""
|
| 20 |
+
stemmer = Stemmer(LANGUAGE)
|
| 21 |
+
|
| 22 |
+
if algorithm == "lsa":
|
| 23 |
+
summarizer = LsaSummarizer(stemmer)
|
| 24 |
+
elif algorithm == "luhn":
|
| 25 |
+
summarizer = LuhnSummarizer(stemmer)
|
| 26 |
+
else: # default to lex-rank
|
| 27 |
+
summarizer = LexRankSummarizer(stemmer)
|
| 28 |
+
|
| 29 |
+
summarizer.stop_words = get_stop_words(LANGUAGE)
|
| 30 |
+
return summarizer
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def summarize_text(text: str, sentence_count: int = None, algorithm: str = None) -> SummaryResult:
|
| 34 |
+
"""
|
| 35 |
+
Generate an extractive summary of the given text.
|
| 36 |
+
|
| 37 |
+
Args:
|
| 38 |
+
text: The input text to summarize.
|
| 39 |
+
sentence_count: Number of sentences in the summary (default from config).
|
| 40 |
+
algorithm: Summarization algorithm to use (default from config).
|
| 41 |
+
|
| 42 |
+
Returns:
|
| 43 |
+
SummaryResult with the summary and statistics.
|
| 44 |
+
"""
|
| 45 |
+
if sentence_count is None:
|
| 46 |
+
sentence_count = SUMMARY_SENTENCE_COUNT
|
| 47 |
+
if algorithm is None:
|
| 48 |
+
algorithm = SUMMARY_ALGORITHM
|
| 49 |
+
|
| 50 |
+
# Handle short texts
|
| 51 |
+
sentences_in_text = [s.strip() for s in text.replace("\n", " ").split(".") if s.strip()]
|
| 52 |
+
if len(sentences_in_text) <= sentence_count:
|
| 53 |
+
# Text is already short enough
|
| 54 |
+
clean_text = " ".join(text.split())
|
| 55 |
+
return SummaryResult(
|
| 56 |
+
summary=clean_text,
|
| 57 |
+
original_length=len(text),
|
| 58 |
+
summary_length=len(clean_text),
|
| 59 |
+
compression_ratio=1.0,
|
| 60 |
+
sentence_count=len(sentences_in_text),
|
| 61 |
+
algorithm=algorithm,
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
try:
|
| 65 |
+
# Parse the text
|
| 66 |
+
parser = PlaintextParser.from_string(text, Tokenizer(LANGUAGE))
|
| 67 |
+
summarizer = _get_summarizer(algorithm)
|
| 68 |
+
|
| 69 |
+
# Generate summary
|
| 70 |
+
summary_sentences = summarizer(parser.document, sentence_count)
|
| 71 |
+
summary = " ".join(str(sentence) for sentence in summary_sentences)
|
| 72 |
+
|
| 73 |
+
if not summary.strip():
|
| 74 |
+
# Fallback: return first N sentences
|
| 75 |
+
summary = ". ".join(sentences_in_text[:sentence_count]) + "."
|
| 76 |
+
|
| 77 |
+
compression_ratio = len(summary) / len(text) if len(text) > 0 else 1.0
|
| 78 |
+
|
| 79 |
+
return SummaryResult(
|
| 80 |
+
summary=summary,
|
| 81 |
+
original_length=len(text),
|
| 82 |
+
summary_length=len(summary),
|
| 83 |
+
compression_ratio=round(compression_ratio, 4),
|
| 84 |
+
sentence_count=sentence_count,
|
| 85 |
+
algorithm=algorithm,
|
| 86 |
+
)
|
| 87 |
+
|
| 88 |
+
except Exception as e:
|
| 89 |
+
# Fallback: return first few sentences
|
| 90 |
+
fallback = ". ".join(sentences_in_text[:sentence_count]) + "."
|
| 91 |
+
return SummaryResult(
|
| 92 |
+
summary=fallback,
|
| 93 |
+
original_length=len(text),
|
| 94 |
+
summary_length=len(fallback),
|
| 95 |
+
compression_ratio=round(len(fallback) / len(text), 4) if len(text) > 0 else 1.0,
|
| 96 |
+
sentence_count=sentence_count,
|
| 97 |
+
algorithm=f"{algorithm} (fallback)",
|
| 98 |
+
)
|
config.py
ADDED
|
@@ -0,0 +1,77 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Configuration settings for the Document Processing System.
|
| 3 |
+
"""
|
| 4 |
+
import os
|
| 5 |
+
import shutil
|
| 6 |
+
|
| 7 |
+
# --- Paths ---
|
| 8 |
+
BASE_DIR = os.path.dirname(os.path.abspath(__file__))
|
| 9 |
+
UPLOAD_DIR = os.path.join(BASE_DIR, "uploads")
|
| 10 |
+
STATIC_DIR = os.path.join(BASE_DIR, "static")
|
| 11 |
+
|
| 12 |
+
# Create uploads directory if it doesn't exist
|
| 13 |
+
os.makedirs(UPLOAD_DIR, exist_ok=True)
|
| 14 |
+
|
| 15 |
+
# --- File Upload Settings ---
|
| 16 |
+
MAX_FILE_SIZE_MB = 50
|
| 17 |
+
MAX_FILE_SIZE_BYTES = MAX_FILE_SIZE_MB * 1024 * 1024
|
| 18 |
+
ALLOWED_EXTENSIONS = {
|
| 19 |
+
"pdf": "application/pdf",
|
| 20 |
+
"docx": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
|
| 21 |
+
"png": "image/png",
|
| 22 |
+
"jpg": "image/jpeg",
|
| 23 |
+
"jpeg": "image/jpeg",
|
| 24 |
+
"tiff": "image/tiff",
|
| 25 |
+
"bmp": "image/bmp",
|
| 26 |
+
"webp": "image/webp",
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
# --- OCR Configuration ---
|
| 30 |
+
# EasyOCR settings
|
| 31 |
+
EASYOCR_LANGS = ["en"] # Languages to support
|
| 32 |
+
EASYOCR_GPU = False # Set to True if NVIDIA GPU is available and CUDA is installed
|
| 33 |
+
|
| 34 |
+
# Keep Tesseract as fallback if needed, but prioritize EasyOCR for accuracy
|
| 35 |
+
def find_tesseract():
|
| 36 |
+
"""Auto-detect Tesseract installation path on Windows."""
|
| 37 |
+
import shutil
|
| 38 |
+
tesseract_in_path = shutil.which("tesseract")
|
| 39 |
+
if tesseract_in_path:
|
| 40 |
+
return tesseract_in_path
|
| 41 |
+
|
| 42 |
+
common_paths = [
|
| 43 |
+
r"C:\Program Files\Tesseract-OCR\tesseract.exe",
|
| 44 |
+
r"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe",
|
| 45 |
+
r"C:\Users\{}\AppData\Local\Tesseract-OCR\tesseract.exe".format(os.getenv("USERNAME", "")),
|
| 46 |
+
]
|
| 47 |
+
for path in common_paths:
|
| 48 |
+
if os.path.isfile(path):
|
| 49 |
+
return path
|
| 50 |
+
return None
|
| 51 |
+
|
| 52 |
+
TESSERACT_CMD = find_tesseract()
|
| 53 |
+
TESSERACT_LANG = "eng"
|
| 54 |
+
|
| 55 |
+
def check_ocr_availability():
|
| 56 |
+
"""Check if any OCR engine is available."""
|
| 57 |
+
try:
|
| 58 |
+
import easyocr
|
| 59 |
+
return "available"
|
| 60 |
+
except ImportError:
|
| 61 |
+
if TESSERACT_CMD:
|
| 62 |
+
return "tesseract-only"
|
| 63 |
+
return "not-found"
|
| 64 |
+
|
| 65 |
+
# --- Summarization Settings ---
|
| 66 |
+
SUMMARY_SENTENCE_COUNT = 5
|
| 67 |
+
SUMMARY_ALGORITHM = "lex-rank" # Options: lex-rank, lsa, luhn, edmundson
|
| 68 |
+
|
| 69 |
+
# --- NER Settings ---
|
| 70 |
+
SPACY_MODEL = "en_core_web_sm"
|
| 71 |
+
NER_ENTITY_TYPES = ["PERSON", "ORG", "DATE", "MONEY", "GPE", "EVENT", "PRODUCT", "LAW", "NORP"]
|
| 72 |
+
|
| 73 |
+
# --- Sentiment Settings ---
|
| 74 |
+
SENTIMENT_THRESHOLDS = {
|
| 75 |
+
"positive": 0.05,
|
| 76 |
+
"negative": -0.05,
|
| 77 |
+
}
|
docker-compose.yml
ADDED
|
@@ -0,0 +1,13 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version: '3.8'
|
| 2 |
+
|
| 3 |
+
services:
|
| 4 |
+
alldocex:
|
| 5 |
+
build: .
|
| 6 |
+
container_name: alldocex-app
|
| 7 |
+
ports:
|
| 8 |
+
- "7860:7860"
|
| 9 |
+
volumes:
|
| 10 |
+
- ./uploads:/app/uploads
|
| 11 |
+
environment:
|
| 12 |
+
- PORT=8000
|
| 13 |
+
restart: always
|
extractors/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# Extractors package
|
extractors/docx_extractor.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
DOCX text extraction using python-docx.
|
| 3 |
+
Extracts text preserving paragraph structure, tables, and document properties.
|
| 4 |
+
"""
|
| 5 |
+
import time
|
| 6 |
+
import os
|
| 7 |
+
from docx import Document
|
| 8 |
+
from models.schemas import ExtractionResult, DocumentMetadata
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def extract_docx(file_path: str) -> ExtractionResult:
|
| 12 |
+
"""Extract text and metadata from a DOCX file."""
|
| 13 |
+
start_time = time.time()
|
| 14 |
+
|
| 15 |
+
try:
|
| 16 |
+
doc = Document(file_path)
|
| 17 |
+
|
| 18 |
+
# Extract paragraphs
|
| 19 |
+
paragraphs = []
|
| 20 |
+
for para in doc.paragraphs:
|
| 21 |
+
text = para.text.strip()
|
| 22 |
+
if text:
|
| 23 |
+
# Preserve heading structure
|
| 24 |
+
if para.style and para.style.name.startswith("Heading"):
|
| 25 |
+
level = para.style.name.replace("Heading ", "").strip()
|
| 26 |
+
prefix = "#" * int(level) if level.isdigit() else "##"
|
| 27 |
+
paragraphs.append(f"{prefix} {text}")
|
| 28 |
+
else:
|
| 29 |
+
paragraphs.append(text)
|
| 30 |
+
|
| 31 |
+
# Extract tables
|
| 32 |
+
tables_text = []
|
| 33 |
+
for table_idx, table in enumerate(doc.tables):
|
| 34 |
+
table_data = []
|
| 35 |
+
for row in table.rows:
|
| 36 |
+
row_data = [cell.text.strip() for cell in row.cells]
|
| 37 |
+
table_data.append(" | ".join(row_data))
|
| 38 |
+
if table_data:
|
| 39 |
+
tables_text.append(f"\n[Table {table_idx + 1}]\n" + "\n".join(table_data))
|
| 40 |
+
|
| 41 |
+
# Combine all text
|
| 42 |
+
full_text = "\n\n".join(paragraphs)
|
| 43 |
+
if tables_text:
|
| 44 |
+
full_text += "\n\n" + "\n".join(tables_text)
|
| 45 |
+
|
| 46 |
+
# Extract metadata from core properties
|
| 47 |
+
props = doc.core_properties
|
| 48 |
+
metadata = DocumentMetadata(
|
| 49 |
+
title=props.title or os.path.basename(file_path),
|
| 50 |
+
author=props.author or "Unknown",
|
| 51 |
+
creation_date=str(props.created) if props.created else "",
|
| 52 |
+
modification_date=str(props.modified) if props.modified else "",
|
| 53 |
+
page_count=None, # DOCX doesn't expose page count easily
|
| 54 |
+
word_count=len(full_text.split()) if full_text else 0,
|
| 55 |
+
character_count=len(full_text),
|
| 56 |
+
file_type="DOCX",
|
| 57 |
+
extra={
|
| 58 |
+
"category": props.category or "",
|
| 59 |
+
"comments": props.comments or "",
|
| 60 |
+
"last_modified_by": props.last_modified_by or "",
|
| 61 |
+
"revision": props.revision,
|
| 62 |
+
"subject": props.subject or "",
|
| 63 |
+
"keywords": props.keywords or "",
|
| 64 |
+
"paragraph_count": len(doc.paragraphs),
|
| 65 |
+
"table_count": len(doc.tables),
|
| 66 |
+
}
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
elapsed = (time.time() - start_time) * 1000
|
| 70 |
+
|
| 71 |
+
if not full_text.strip():
|
| 72 |
+
return ExtractionResult(
|
| 73 |
+
raw_text="",
|
| 74 |
+
metadata=metadata,
|
| 75 |
+
success=False,
|
| 76 |
+
error_message="No text content found in the DOCX file.",
|
| 77 |
+
extraction_time_ms=elapsed,
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
return ExtractionResult(
|
| 81 |
+
raw_text=full_text,
|
| 82 |
+
metadata=metadata,
|
| 83 |
+
success=True,
|
| 84 |
+
extraction_time_ms=elapsed,
|
| 85 |
+
)
|
| 86 |
+
|
| 87 |
+
except Exception as e:
|
| 88 |
+
elapsed = (time.time() - start_time) * 1000
|
| 89 |
+
return ExtractionResult(
|
| 90 |
+
raw_text="",
|
| 91 |
+
metadata=DocumentMetadata(file_type="DOCX"),
|
| 92 |
+
success=False,
|
| 93 |
+
error_message=f"DOCX extraction failed: {str(e)}",
|
| 94 |
+
extraction_time_ms=elapsed,
|
| 95 |
+
)
|
extractors/ocr_extractor.py
ADDED
|
@@ -0,0 +1,245 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Image OCR extraction using EasyOCR (primary) and Tesseract (fallback).
|
| 3 |
+
Includes advanced image preprocessing for maximum accuracy.
|
| 4 |
+
"""
|
| 5 |
+
import time
|
| 6 |
+
import os
|
| 7 |
+
import numpy as np
|
| 8 |
+
from PIL import Image, ImageEnhance, ImageFilter, ImageOps
|
| 9 |
+
from models.schemas import ExtractionResult, DocumentMetadata
|
| 10 |
+
import config
|
| 11 |
+
|
| 12 |
+
# --- OCR Engine Detection ---
|
| 13 |
+
|
| 14 |
+
try:
|
| 15 |
+
import easyocr
|
| 16 |
+
EASYOCR_AVAILABLE = True
|
| 17 |
+
except ImportError:
|
| 18 |
+
EASYOCR_AVAILABLE = False
|
| 19 |
+
|
| 20 |
+
try:
|
| 21 |
+
import pytesseract
|
| 22 |
+
TESSERACT_AVAILABLE = True
|
| 23 |
+
except ImportError:
|
| 24 |
+
TESSERACT_AVAILABLE = False
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
# Global reader instance for EasyOCR (lazy loaded)
|
| 28 |
+
_EASY_READER = None
|
| 29 |
+
|
| 30 |
+
def get_easyocr_reader():
|
| 31 |
+
"""Get or create the EasyOCR reader instance."""
|
| 32 |
+
global _EASY_READER
|
| 33 |
+
if _EASY_READER is None and EASYOCR_AVAILABLE:
|
| 34 |
+
try:
|
| 35 |
+
# Initialize with configured languages and GPU setting
|
| 36 |
+
_EASY_READER = easyocr.Reader(config.EASYOCR_LANGS, gpu=config.EASYOCR_GPU)
|
| 37 |
+
except Exception as e:
|
| 38 |
+
print(f"Error initializing EasyOCR: {e}")
|
| 39 |
+
return None
|
| 40 |
+
return _EASY_READER
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
def _configure_tesseract():
|
| 44 |
+
"""Configure tesseract path from config."""
|
| 45 |
+
if config.TESSERACT_CMD and TESSERACT_AVAILABLE:
|
| 46 |
+
pytesseract.pytesseract.tesseract_cmd = config.TESSERACT_CMD
|
| 47 |
+
return True
|
| 48 |
+
elif TESSERACT_AVAILABLE:
|
| 49 |
+
try:
|
| 50 |
+
pytesseract.get_tesseract_version()
|
| 51 |
+
return True
|
| 52 |
+
except Exception:
|
| 53 |
+
return False
|
| 54 |
+
return False
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def _preprocess_image(image: Image.Image) -> Image.Image:
|
| 58 |
+
"""Preprocess image for maximum OCR accuracy."""
|
| 59 |
+
# 1. Convert to grayscale
|
| 60 |
+
if image.mode != "L":
|
| 61 |
+
image = image.convert("L")
|
| 62 |
+
|
| 63 |
+
# 2. Dynamic Contrast / Lighting correction
|
| 64 |
+
image = ImageOps.autocontrast(image)
|
| 65 |
+
|
| 66 |
+
# 3. Resize to optimal DPI (approx 300)
|
| 67 |
+
width, height = image.size
|
| 68 |
+
if width < 1500 or height < 1500:
|
| 69 |
+
scale = max(1800 / width, 1800 / height, 2.0)
|
| 70 |
+
new_size = (int(width * scale), int(height * scale))
|
| 71 |
+
image = image.resize(new_size, Image.Resampling.LANCZOS)
|
| 72 |
+
|
| 73 |
+
# 4. Sharpening (Unsharp Mask equivalent)
|
| 74 |
+
image = image.filter(ImageFilter.SHARPEN)
|
| 75 |
+
enhancer = ImageEnhance.Contrast(image)
|
| 76 |
+
image = enhancer.enhance(1.8)
|
| 77 |
+
|
| 78 |
+
# 5. Denoising
|
| 79 |
+
image = image.filter(ImageFilter.MedianFilter(size=3))
|
| 80 |
+
|
| 81 |
+
return image
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
def _reconstruct_from_boxes(results: list) -> str:
|
| 85 |
+
""" Reconstruct text layout from bounding boxes.
|
| 86 |
+
Sort by top, then group by 'lines' based on y-coordinate.
|
| 87 |
+
"""
|
| 88 |
+
if not results:
|
| 89 |
+
return ""
|
| 90 |
+
|
| 91 |
+
# Sort results by top y-coordinate
|
| 92 |
+
results.sort(key=lambda x: x[0][0][1])
|
| 93 |
+
|
| 94 |
+
lines = []
|
| 95 |
+
if results:
|
| 96 |
+
current_line = [results[0]]
|
| 97 |
+
for i in range(1, len(results)):
|
| 98 |
+
# If the current block's mid-y is within the previous block's height range
|
| 99 |
+
prev_box = results[i-1][0]
|
| 100 |
+
curr_box = results[i][0]
|
| 101 |
+
|
| 102 |
+
prev_y_center = (prev_box[0][1] + prev_box[2][1]) / 2
|
| 103 |
+
curr_y_center = (curr_box[0][1] + curr_box[2][1]) / 2
|
| 104 |
+
|
| 105 |
+
# Threshold for 'same line' is approx 1/3 of the box height
|
| 106 |
+
height = prev_box[2][1] - prev_box[0][1]
|
| 107 |
+
if abs(curr_y_center - prev_y_center) < (height * 0.5):
|
| 108 |
+
current_line.append(results[i])
|
| 109 |
+
else:
|
| 110 |
+
lines.append(current_line)
|
| 111 |
+
current_line = [results[i]]
|
| 112 |
+
lines.append(current_line)
|
| 113 |
+
|
| 114 |
+
final_text = []
|
| 115 |
+
for line in lines:
|
| 116 |
+
# Sort each line by left x-coordinate
|
| 117 |
+
line.sort(key=lambda x: x[0][0][0])
|
| 118 |
+
line_text = []
|
| 119 |
+
for i, res in enumerate(line):
|
| 120 |
+
# Add relative spacing based on horizontal gap
|
| 121 |
+
if i > 0:
|
| 122 |
+
prev_right = line[i-1][0][1][0]
|
| 123 |
+
curr_left = res[0][0][0]
|
| 124 |
+
gap = curr_left - prev_right
|
| 125 |
+
# If gap is significant, add spaces
|
| 126 |
+
char_width = (res[0][1][0] - res[0][0][0]) / (len(res[1]) or 1)
|
| 127 |
+
num_spaces = int(gap / (char_width * 1.5))
|
| 128 |
+
line_text.append(" " * max(1, num_spaces))
|
| 129 |
+
|
| 130 |
+
line_text.append(res[1])
|
| 131 |
+
final_text.append(" ".join(line_text))
|
| 132 |
+
|
| 133 |
+
return "\n".join(final_text)
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
def extract_image(file_path: str) -> ExtractionResult:
|
| 137 |
+
"""Extract text from an image using the best available OCR engine."""
|
| 138 |
+
start_time = time.time()
|
| 139 |
+
|
| 140 |
+
# 1. Check for EasyOCR (Preferred)
|
| 141 |
+
if EASYOCR_AVAILABLE:
|
| 142 |
+
try:
|
| 143 |
+
reader = get_easyocr_reader()
|
| 144 |
+
if reader:
|
| 145 |
+
# EasyOCR works well with both original and preprocessed images
|
| 146 |
+
# We'll use a slightly preprocessed version for consistency
|
| 147 |
+
# Perform OCR with layout awareness
|
| 148 |
+
# Adjusting thresholds for better numeric and tabular capture
|
| 149 |
+
results = reader.readtext(
|
| 150 |
+
file_path,
|
| 151 |
+
detail=1,
|
| 152 |
+
paragraph=False, # We want individual boxes for layout reconstruction
|
| 153 |
+
width_ths=0.7, # Better for long numbers/strings
|
| 154 |
+
height_ths=0.7,
|
| 155 |
+
contrast_ths=0.3
|
| 156 |
+
)
|
| 157 |
+
|
| 158 |
+
# Reconstruct full layout from bounding boxes
|
| 159 |
+
text = _reconstruct_from_boxes(results)
|
| 160 |
+
|
| 161 |
+
if text.strip():
|
| 162 |
+
elapsed = (time.time() - start_time) * 1000
|
| 163 |
+
metadata = DocumentMetadata(
|
| 164 |
+
title=os.path.basename(file_path),
|
| 165 |
+
page_count=1,
|
| 166 |
+
word_count=len(text.split()),
|
| 167 |
+
character_count=len(text),
|
| 168 |
+
file_type="Image (EasyOCR)",
|
| 169 |
+
extra={
|
| 170 |
+
"image_width": original_size[0],
|
| 171 |
+
"image_height": original_size[1],
|
| 172 |
+
"ocr_engine": "EasyOCR",
|
| 173 |
+
"accuracy": "High (Deep Learning)"
|
| 174 |
+
}
|
| 175 |
+
)
|
| 176 |
+
return ExtractionResult(
|
| 177 |
+
raw_text=text.strip(),
|
| 178 |
+
metadata=metadata,
|
| 179 |
+
success=True,
|
| 180 |
+
extraction_time_ms=elapsed
|
| 181 |
+
)
|
| 182 |
+
except Exception as e:
|
| 183 |
+
print(f"EasyOCR extraction failed, falling back to Tesseract: {e}")
|
| 184 |
+
|
| 185 |
+
# 2. Fallback to Tesseract
|
| 186 |
+
if TESSERACT_AVAILABLE and _configure_tesseract():
|
| 187 |
+
try:
|
| 188 |
+
image = Image.open(file_path)
|
| 189 |
+
original_size = image.size
|
| 190 |
+
processed_image = _preprocess_image(image)
|
| 191 |
+
|
| 192 |
+
custom_config = f"--oem 3 --psm 6 -l {config.TESSERACT_LANG}"
|
| 193 |
+
text = pytesseract.image_to_string(processed_image, config=custom_config)
|
| 194 |
+
|
| 195 |
+
# Confidence
|
| 196 |
+
try:
|
| 197 |
+
data = pytesseract.image_to_data(processed_image, config=custom_config, output_type=pytesseract.Output.DICT)
|
| 198 |
+
confidences = [int(c) for c in data["conf"] if int(c) > 0]
|
| 199 |
+
avg_confidence = sum(confidences) / len(confidences) if confidences else 0
|
| 200 |
+
except Exception:
|
| 201 |
+
avg_confidence = 0
|
| 202 |
+
|
| 203 |
+
elapsed = (time.time() - start_time) * 1000
|
| 204 |
+
if text.strip():
|
| 205 |
+
metadata = DocumentMetadata(
|
| 206 |
+
title=os.path.basename(file_path),
|
| 207 |
+
page_count=1,
|
| 208 |
+
word_count=len(text.split()),
|
| 209 |
+
character_count=len(text),
|
| 210 |
+
file_type="Image (Tesseract)",
|
| 211 |
+
extra={
|
| 212 |
+
"image_width": original_size[0],
|
| 213 |
+
"image_height": original_size[1],
|
| 214 |
+
"ocr_confidence": round(avg_confidence, 2),
|
| 215 |
+
"ocr_engine": "Tesseract"
|
| 216 |
+
}
|
| 217 |
+
)
|
| 218 |
+
return ExtractionResult(
|
| 219 |
+
raw_text=text.strip(),
|
| 220 |
+
metadata=metadata,
|
| 221 |
+
success=True,
|
| 222 |
+
extraction_time_ms=elapsed
|
| 223 |
+
)
|
| 224 |
+
except Exception as e:
|
| 225 |
+
print(f"Tesseract extraction failed: {e}")
|
| 226 |
+
|
| 227 |
+
# 3. Failure cases
|
| 228 |
+
elapsed = (time.time() - start_time) * 1000
|
| 229 |
+
|
| 230 |
+
if not EASYOCR_AVAILABLE and not TESSERACT_AVAILABLE:
|
| 231 |
+
error_msg = "No OCR libraries installed. Please run 'pip install easyocr'."
|
| 232 |
+
elif not EASYOCR_AVAILABLE and TESSERACT_AVAILABLE:
|
| 233 |
+
error_msg = "EasyOCR is not installed, and Tesseract binary was not found or failed. Please run 'pip install easyocr' for best results."
|
| 234 |
+
elif EASYOCR_AVAILABLE and not TESSERACT_AVAILABLE:
|
| 235 |
+
error_msg = "EasyOCR failed to extract text, and Tesseract is not installed."
|
| 236 |
+
else:
|
| 237 |
+
error_msg = "OCR extraction failed. Both EasyOCR and Tesseract engines were unable to extract text from this image."
|
| 238 |
+
|
| 239 |
+
return ExtractionResult(
|
| 240 |
+
raw_text="",
|
| 241 |
+
metadata=DocumentMetadata(file_type="Image (OCR)"),
|
| 242 |
+
success=False,
|
| 243 |
+
error_message=error_msg,
|
| 244 |
+
extraction_time_ms=elapsed,
|
| 245 |
+
)
|
extractors/pdf_extractor.py
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
PDF text extraction using PyMuPDF (fitz).
|
| 3 |
+
Extracts text with layout preservation and document metadata.
|
| 4 |
+
"""
|
| 5 |
+
import fitz # PyMuPDF
|
| 6 |
+
import time
|
| 7 |
+
import os
|
| 8 |
+
from models.schemas import ExtractionResult, DocumentMetadata
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def extract_pdf(file_path: str) -> ExtractionResult:
|
| 12 |
+
"""Extract text and metadata from a PDF file."""
|
| 13 |
+
start_time = time.time()
|
| 14 |
+
|
| 15 |
+
try:
|
| 16 |
+
doc = fitz.open(file_path)
|
| 17 |
+
|
| 18 |
+
# Extract text from all pages with full layout preservation
|
| 19 |
+
pages_text = []
|
| 20 |
+
for page_num in range(len(doc)):
|
| 21 |
+
page = doc[page_num]
|
| 22 |
+
# "layout" mode preserves the physical positioning of text (columns, tables, etc.)
|
| 23 |
+
# This ensures the "pointer position" matches the original PDF look.
|
| 24 |
+
text = page.get_text("layout")
|
| 25 |
+
if text.strip():
|
| 26 |
+
pages_text.append(f"--- Page {page_num + 1} ---\n{text}")
|
| 27 |
+
|
| 28 |
+
full_text = "\n\n".join(pages_text)
|
| 29 |
+
|
| 30 |
+
# Extract metadata
|
| 31 |
+
meta = doc.metadata
|
| 32 |
+
metadata = DocumentMetadata(
|
| 33 |
+
title=meta.get("title", "") or os.path.basename(file_path),
|
| 34 |
+
author=meta.get("author", "") or "Unknown",
|
| 35 |
+
creation_date=meta.get("creationDate", ""),
|
| 36 |
+
modification_date=meta.get("modDate", ""),
|
| 37 |
+
page_count=len(doc),
|
| 38 |
+
word_count=len(full_text.split()) if full_text else 0,
|
| 39 |
+
character_count=len(full_text),
|
| 40 |
+
file_type="PDF",
|
| 41 |
+
extra={
|
| 42 |
+
"producer": meta.get("producer", ""),
|
| 43 |
+
"creator": meta.get("creator", ""),
|
| 44 |
+
"subject": meta.get("subject", ""),
|
| 45 |
+
"keywords": meta.get("keywords", ""),
|
| 46 |
+
"format": meta.get("format", ""),
|
| 47 |
+
"encryption": doc.is_encrypted,
|
| 48 |
+
}
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
doc.close()
|
| 52 |
+
|
| 53 |
+
elapsed = (time.time() - start_time) * 1000
|
| 54 |
+
|
| 55 |
+
if not full_text.strip():
|
| 56 |
+
return ExtractionResult(
|
| 57 |
+
raw_text="",
|
| 58 |
+
metadata=metadata,
|
| 59 |
+
success=False,
|
| 60 |
+
error_message="No extractable text found in PDF. The document may contain only images — try uploading as an image for OCR processing.",
|
| 61 |
+
extraction_time_ms=elapsed,
|
| 62 |
+
)
|
| 63 |
+
|
| 64 |
+
return ExtractionResult(
|
| 65 |
+
raw_text=full_text,
|
| 66 |
+
metadata=metadata,
|
| 67 |
+
success=True,
|
| 68 |
+
extraction_time_ms=elapsed,
|
| 69 |
+
)
|
| 70 |
+
|
| 71 |
+
except Exception as e:
|
| 72 |
+
elapsed = (time.time() - start_time) * 1000
|
| 73 |
+
return ExtractionResult(
|
| 74 |
+
raw_text="",
|
| 75 |
+
metadata=DocumentMetadata(file_type="PDF"),
|
| 76 |
+
success=False,
|
| 77 |
+
error_message=f"PDF extraction failed: {str(e)}",
|
| 78 |
+
extraction_time_ms=elapsed,
|
| 79 |
+
)
|
extractors/url_extractor.py
ADDED
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Web content extraction from URLs using requests and BeautifulSoup4.
|
| 3 |
+
Extracts title and main text content from HTML pages.
|
| 4 |
+
"""
|
| 5 |
+
import time
|
| 6 |
+
import requests
|
| 7 |
+
from bs4 import BeautifulSoup
|
| 8 |
+
from models.schemas import ExtractionResult, DocumentMetadata
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def extract_url(url: str) -> ExtractionResult:
|
| 12 |
+
"""Fetch and extract text content from a web URL."""
|
| 13 |
+
start_time = time.time()
|
| 14 |
+
|
| 15 |
+
try:
|
| 16 |
+
# 1. Fetch content
|
| 17 |
+
headers = {
|
| 18 |
+
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
|
| 19 |
+
}
|
| 20 |
+
response = requests.get(url, headers=headers, timeout=10)
|
| 21 |
+
response.raise_for_status()
|
| 22 |
+
|
| 23 |
+
# 2. Parse HTML
|
| 24 |
+
soup = BeautifulSoup(response.text, 'html.parser')
|
| 25 |
+
|
| 26 |
+
# 3. Remove script and style elements
|
| 27 |
+
for script_or_style in soup(["script", "style", "nav", "footer", "header", "aside"]):
|
| 28 |
+
script_or_style.decompose()
|
| 29 |
+
|
| 30 |
+
# 4. Get text
|
| 31 |
+
# Try to find the title
|
| 32 |
+
title = soup.title.string.strip() if soup.title else url
|
| 33 |
+
|
| 34 |
+
# Get main text - simple heuristic: look for <article> or just <body>
|
| 35 |
+
content_area = soup.find('article') or soup.body
|
| 36 |
+
if not content_area:
|
| 37 |
+
content_area = soup
|
| 38 |
+
|
| 39 |
+
# Extract text while preserving some paragraph structure
|
| 40 |
+
lines = []
|
| 41 |
+
for element in content_area.find_all(['p', 'h1', 'h2', 'h3', 'h4', 'li']):
|
| 42 |
+
text = element.get_text().strip()
|
| 43 |
+
if text:
|
| 44 |
+
if element.name.startswith('h'):
|
| 45 |
+
prefix = '#' * int(element.name[1])
|
| 46 |
+
lines.append(f"\n{prefix} {text}\n")
|
| 47 |
+
else:
|
| 48 |
+
lines.append(text)
|
| 49 |
+
|
| 50 |
+
full_text = "\n\n".join(lines)
|
| 51 |
+
if not full_text.strip():
|
| 52 |
+
# Fallback to general text extraction
|
| 53 |
+
full_text = soup.get_text(separator='\n\n', strip=True)
|
| 54 |
+
|
| 55 |
+
# 5. Build metadata
|
| 56 |
+
metadata = DocumentMetadata(
|
| 57 |
+
title=title,
|
| 58 |
+
author="Web Content",
|
| 59 |
+
creation_date="",
|
| 60 |
+
modification_date="",
|
| 61 |
+
page_count=None,
|
| 62 |
+
word_count=len(full_text.split()),
|
| 63 |
+
character_count=len(full_text),
|
| 64 |
+
file_type="URL",
|
| 65 |
+
extra={
|
| 66 |
+
"url": url,
|
| 67 |
+
"domain": url.split('/')[2] if '//' in url else url.split('/')[0],
|
| 68 |
+
"status_code": response.status_code,
|
| 69 |
+
"content_type": response.headers.get('Content-Type', '')
|
| 70 |
+
}
|
| 71 |
+
)
|
| 72 |
+
|
| 73 |
+
elapsed = (time.time() - start_time) * 1000
|
| 74 |
+
|
| 75 |
+
if not full_text.strip():
|
| 76 |
+
return ExtractionResult(
|
| 77 |
+
raw_text="",
|
| 78 |
+
metadata=metadata,
|
| 79 |
+
success=False,
|
| 80 |
+
error_message="Could not extract any meaningful text from the provided URL.",
|
| 81 |
+
extraction_time_ms=elapsed,
|
| 82 |
+
)
|
| 83 |
+
|
| 84 |
+
return ExtractionResult(
|
| 85 |
+
raw_text=full_text,
|
| 86 |
+
metadata=metadata,
|
| 87 |
+
success=True,
|
| 88 |
+
extraction_time_ms=elapsed,
|
| 89 |
+
)
|
| 90 |
+
|
| 91 |
+
except requests.exceptions.RequestException as e:
|
| 92 |
+
elapsed = (time.time() - start_time) * 1000
|
| 93 |
+
return ExtractionResult(
|
| 94 |
+
raw_text="",
|
| 95 |
+
metadata=DocumentMetadata(file_type="URL", extra={"url": url}),
|
| 96 |
+
success=False,
|
| 97 |
+
error_message=f"Failed to fetch URL: {str(e)}",
|
| 98 |
+
extraction_time_ms=elapsed,
|
| 99 |
+
)
|
| 100 |
+
except Exception as e:
|
| 101 |
+
elapsed = (time.time() - start_time) * 1000
|
| 102 |
+
return ExtractionResult(
|
| 103 |
+
raw_text="",
|
| 104 |
+
metadata=DocumentMetadata(file_type="URL", extra={"url": url}),
|
| 105 |
+
success=False,
|
| 106 |
+
error_message=f"Web extraction failed: {str(e)}",
|
| 107 |
+
extraction_time_ms=elapsed,
|
| 108 |
+
)
|
main.py
ADDED
|
@@ -0,0 +1,307 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Intelligent Document Processing System
|
| 3 |
+
FastAPI backend with async document processing.
|
| 4 |
+
"""
|
| 5 |
+
import os
|
| 6 |
+
import uuid
|
| 7 |
+
import time
|
| 8 |
+
import asyncio
|
| 9 |
+
from typing import Dict
|
| 10 |
+
from fastapi import FastAPI, UploadFile, File, HTTPException
|
| 11 |
+
from fastapi.staticfiles import StaticFiles
|
| 12 |
+
from fastapi.responses import FileResponse, JSONResponse
|
| 13 |
+
from fastapi.middleware.cors import CORSMiddleware
|
| 14 |
+
|
| 15 |
+
from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
|
| 16 |
+
from models.schemas import (
|
| 17 |
+
UploadResponse, ProcessingResult, TaskStatus,
|
| 18 |
+
ExtractionResult, DocumentMetadata,
|
| 19 |
+
)
|
| 20 |
+
from extractors.pdf_extractor import extract_pdf
|
| 21 |
+
from extractors.docx_extractor import extract_docx
|
| 22 |
+
from extractors.ocr_extractor import extract_image
|
| 23 |
+
from extractors.url_extractor import extract_url
|
| 24 |
+
from analyzers.summarizer import summarize_text
|
| 25 |
+
from analyzers.ner_extractor import extract_entities
|
| 26 |
+
from analyzers.sentiment import analyze_sentiment
|
| 27 |
+
|
| 28 |
+
# --- App Setup ---
|
| 29 |
+
app = FastAPI(
|
| 30 |
+
title="Alldocex - Intelligent Document Processing",
|
| 31 |
+
description="Extract, analyse, and summarize content from PDF, DOCX, and image files using AI.",
|
| 32 |
+
version="1.0.0",
|
| 33 |
+
)
|
| 34 |
+
|
| 35 |
+
app.add_middleware(
|
| 36 |
+
CORSMiddleware,
|
| 37 |
+
allow_origins=["*"],
|
| 38 |
+
allow_credentials=True,
|
| 39 |
+
allow_methods=["*"],
|
| 40 |
+
allow_headers=["*"],
|
| 41 |
+
)
|
| 42 |
+
|
| 43 |
+
# In-memory task store
|
| 44 |
+
tasks: Dict[str, ProcessingResult] = {}
|
| 45 |
+
|
| 46 |
+
# --- Utility Functions ---
|
| 47 |
+
|
| 48 |
+
def _human_readable_size(size_bytes: int) -> str:
|
| 49 |
+
"""Convert bytes to human readable string."""
|
| 50 |
+
for unit in ["B", "KB", "MB", "GB"]:
|
| 51 |
+
if size_bytes < 1024:
|
| 52 |
+
return f"{size_bytes:.1f} {unit}"
|
| 53 |
+
size_bytes /= 1024
|
| 54 |
+
return f"{size_bytes:.1f} TB"
|
| 55 |
+
|
| 56 |
+
|
| 57 |
+
def _get_file_type(filename: str) -> str:
|
| 58 |
+
"""Determine file type from extension."""
|
| 59 |
+
ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
|
| 60 |
+
if ext == "pdf":
|
| 61 |
+
return "pdf"
|
| 62 |
+
elif ext == "docx":
|
| 63 |
+
return "docx"
|
| 64 |
+
elif ext in ("png", "jpg", "jpeg", "tiff", "bmp", "webp"):
|
| 65 |
+
return "image"
|
| 66 |
+
return "unknown"
|
| 67 |
+
|
| 68 |
+
|
| 69 |
+
def _process_document(file_path: str, file_type: str, task_id: str):
|
| 70 |
+
"""
|
| 71 |
+
Process a document: extract text, then run all analyzers.
|
| 72 |
+
This runs in a thread pool to avoid blocking the event loop.
|
| 73 |
+
"""
|
| 74 |
+
start_time = time.time()
|
| 75 |
+
task = tasks[task_id]
|
| 76 |
+
task.status = TaskStatus.PROCESSING
|
| 77 |
+
|
| 78 |
+
try:
|
| 79 |
+
# Step 1: Extract text based on file type
|
| 80 |
+
if file_type == "pdf":
|
| 81 |
+
extraction = extract_pdf(file_path)
|
| 82 |
+
elif file_type == "docx":
|
| 83 |
+
extraction = extract_docx(file_path)
|
| 84 |
+
elif file_type == "image":
|
| 85 |
+
extraction = extract_image(file_path)
|
| 86 |
+
elif file_type == "url":
|
| 87 |
+
# file_path is the URL string here
|
| 88 |
+
extraction = extract_url(file_path)
|
| 89 |
+
else:
|
| 90 |
+
raise ValueError(f"Unsupported file type: {file_type}")
|
| 91 |
+
|
| 92 |
+
task.extraction = extraction
|
| 93 |
+
|
| 94 |
+
if not extraction.success or not extraction.raw_text.strip():
|
| 95 |
+
task.status = TaskStatus.COMPLETED
|
| 96 |
+
task.error_message = extraction.error_message or "No text could be extracted."
|
| 97 |
+
task.processing_time_ms = (time.time() - start_time) * 1000
|
| 98 |
+
return
|
| 99 |
+
|
| 100 |
+
raw_text = extraction.raw_text
|
| 101 |
+
|
| 102 |
+
# Step 2: Summarization
|
| 103 |
+
try:
|
| 104 |
+
task.summary = summarize_text(raw_text)
|
| 105 |
+
except Exception as e:
|
| 106 |
+
print(f"Summarization error: {e}")
|
| 107 |
+
|
| 108 |
+
# Step 3: Named Entity Recognition
|
| 109 |
+
try:
|
| 110 |
+
task.entities = extract_entities(raw_text)
|
| 111 |
+
except Exception as e:
|
| 112 |
+
print(f"NER error: {e}")
|
| 113 |
+
|
| 114 |
+
# Step 4: Sentiment Analysis
|
| 115 |
+
try:
|
| 116 |
+
task.sentiment = analyze_sentiment(raw_text)
|
| 117 |
+
except Exception as e:
|
| 118 |
+
print(f"Sentiment error: {e}")
|
| 119 |
+
|
| 120 |
+
task.status = TaskStatus.COMPLETED
|
| 121 |
+
task.processing_time_ms = (time.time() - start_time) * 1000
|
| 122 |
+
|
| 123 |
+
except Exception as e:
|
| 124 |
+
task.status = TaskStatus.ERROR
|
| 125 |
+
task.error_message = str(e)
|
| 126 |
+
task.processing_time_ms = (time.time() - start_time) * 1000
|
| 127 |
+
|
| 128 |
+
finally:
|
| 129 |
+
# Clean up uploaded file
|
| 130 |
+
try:
|
| 131 |
+
if os.path.exists(file_path):
|
| 132 |
+
os.remove(file_path)
|
| 133 |
+
except Exception:
|
| 134 |
+
pass
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
# --- API Endpoints ---
|
| 138 |
+
|
| 139 |
+
@app.post("/api/upload", response_model=ProcessingResult)
|
| 140 |
+
async def upload_and_process(file: UploadFile = File(...)):
|
| 141 |
+
"""
|
| 142 |
+
Upload a document and start processing.
|
| 143 |
+
Supports PDF, DOCX, PNG, JPG, JPEG, TIFF, BMP, WEBP.
|
| 144 |
+
"""
|
| 145 |
+
# Validate file extension
|
| 146 |
+
filename = file.filename or "unknown"
|
| 147 |
+
ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
|
| 148 |
+
if ext not in ALLOWED_EXTENSIONS:
|
| 149 |
+
raise HTTPException(
|
| 150 |
+
status_code=400,
|
| 151 |
+
detail=f"Unsupported file type: .{ext}. Supported: {', '.join(ALLOWED_EXTENSIONS.keys())}"
|
| 152 |
+
)
|
| 153 |
+
|
| 154 |
+
# Read file content
|
| 155 |
+
content = await file.read()
|
| 156 |
+
file_size = len(content)
|
| 157 |
+
|
| 158 |
+
# Validate file size
|
| 159 |
+
if file_size > MAX_FILE_SIZE_BYTES:
|
| 160 |
+
raise HTTPException(
|
| 161 |
+
status_code=400,
|
| 162 |
+
detail=f"File too large. Maximum size: {MAX_FILE_SIZE_BYTES // (1024*1024)}MB"
|
| 163 |
+
)
|
| 164 |
+
|
| 165 |
+
if file_size == 0:
|
| 166 |
+
raise HTTPException(status_code=400, detail="Empty file uploaded.")
|
| 167 |
+
|
| 168 |
+
# Save file
|
| 169 |
+
file_id = str(uuid.uuid4())[:8]
|
| 170 |
+
safe_filename = f"{file_id}_{filename}"
|
| 171 |
+
file_path = os.path.join(UPLOAD_DIR, safe_filename)
|
| 172 |
+
|
| 173 |
+
with open(file_path, "wb") as f:
|
| 174 |
+
f.write(content)
|
| 175 |
+
|
| 176 |
+
# Determine file type
|
| 177 |
+
file_type = _get_file_type(filename)
|
| 178 |
+
|
| 179 |
+
# Create task
|
| 180 |
+
task = ProcessingResult.create_pending(
|
| 181 |
+
file_id=file_id,
|
| 182 |
+
filename=filename,
|
| 183 |
+
file_type=file_type,
|
| 184 |
+
)
|
| 185 |
+
tasks[file_id] = task
|
| 186 |
+
|
| 187 |
+
# Start async processing in a thread
|
| 188 |
+
asyncio.get_event_loop().run_in_executor(
|
| 189 |
+
None, _process_document, file_path, file_type, file_id
|
| 190 |
+
)
|
| 191 |
+
|
| 192 |
+
return task
|
| 193 |
+
|
| 194 |
+
|
| 195 |
+
@app.post("/api/extract/url", response_model=ProcessingResult)
|
| 196 |
+
async def extract_from_url(data: Dict[str, str]):
|
| 197 |
+
"""
|
| 198 |
+
Extract content from a web URL and process it.
|
| 199 |
+
"""
|
| 200 |
+
url = data.get("url")
|
| 201 |
+
if not url:
|
| 202 |
+
raise HTTPException(status_code=400, detail="URL is required.")
|
| 203 |
+
|
| 204 |
+
if not url.startswith(("http://", "https://")):
|
| 205 |
+
raise HTTPException(status_code=400, detail="Invalid URL format. Must start with http:// or https://")
|
| 206 |
+
|
| 207 |
+
# Create task
|
| 208 |
+
file_id = str(uuid.uuid4())[:8]
|
| 209 |
+
# For URLs, we use the domain as the "filename"
|
| 210 |
+
filename = url.split('/')[2] if '//' in url else url.split('/')[0]
|
| 211 |
+
|
| 212 |
+
task = ProcessingResult.create_pending(
|
| 213 |
+
file_id=file_id,
|
| 214 |
+
filename=filename,
|
| 215 |
+
file_type="url",
|
| 216 |
+
)
|
| 217 |
+
tasks[file_id] = task
|
| 218 |
+
|
| 219 |
+
# Start async processing in a thread
|
| 220 |
+
asyncio.get_event_loop().run_in_executor(
|
| 221 |
+
None, _process_document, url, "url", file_id
|
| 222 |
+
)
|
| 223 |
+
|
| 224 |
+
return task
|
| 225 |
+
|
| 226 |
+
|
| 227 |
+
@app.get("/api/status/{task_id}")
|
| 228 |
+
async def get_task_status(task_id: str):
|
| 229 |
+
"""Get the processing status and results for a task."""
|
| 230 |
+
if task_id not in tasks:
|
| 231 |
+
raise HTTPException(status_code=404, detail="Task not found.")
|
| 232 |
+
return tasks[task_id]
|
| 233 |
+
|
| 234 |
+
|
| 235 |
+
@app.get("/api/download/{task_id}")
|
| 236 |
+
async def download_results(task_id: str):
|
| 237 |
+
"""Download the extracted text as a .txt file."""
|
| 238 |
+
if task_id not in tasks:
|
| 239 |
+
raise HTTPException(status_code=404, detail="Task not found.")
|
| 240 |
+
|
| 241 |
+
task = tasks[task_id]
|
| 242 |
+
if not task.extraction or not task.extraction.raw_text:
|
| 243 |
+
raise HTTPException(status_code=400, detail="No text available for download.")
|
| 244 |
+
|
| 245 |
+
# Create temporary file path
|
| 246 |
+
filename = f"extracted_{task.filename}.txt"
|
| 247 |
+
temp_path = os.path.join(UPLOAD_DIR, filename)
|
| 248 |
+
|
| 249 |
+
try:
|
| 250 |
+
with open(temp_path, "w", encoding="utf-8") as f:
|
| 251 |
+
f.write(task.extraction.raw_text)
|
| 252 |
+
|
| 253 |
+
return FileResponse(
|
| 254 |
+
temp_path,
|
| 255 |
+
filename=filename,
|
| 256 |
+
media_type="text/plain",
|
| 257 |
+
background=None # Note: ideally we'd use BackgroundTask to delete this file later
|
| 258 |
+
)
|
| 259 |
+
except Exception as e:
|
| 260 |
+
raise HTTPException(status_code=500, detail=f"Failed to generate download: {str(e)}")
|
| 261 |
+
|
| 262 |
+
|
| 263 |
+
@app.get("/api/health")
|
| 264 |
+
async def health_check():
|
| 265 |
+
"""Health check endpoint."""
|
| 266 |
+
from config import check_ocr_availability
|
| 267 |
+
|
| 268 |
+
# Check OCR status
|
| 269 |
+
ocr_status = check_ocr_availability()
|
| 270 |
+
|
| 271 |
+
# Check spaCy
|
| 272 |
+
try:
|
| 273 |
+
import spacy
|
| 274 |
+
spacy.load("en_core_web_sm")
|
| 275 |
+
spacy_status = "available"
|
| 276 |
+
except Exception:
|
| 277 |
+
spacy_status = "not available"
|
| 278 |
+
|
| 279 |
+
return {
|
| 280 |
+
"status": "healthy",
|
| 281 |
+
"ocr": ocr_status,
|
| 282 |
+
"tesseract": "available" if ocr_status in ("available", "tesseract-only") else "not found",
|
| 283 |
+
"spacy": spacy_status,
|
| 284 |
+
"version": "1.1.0",
|
| 285 |
+
}
|
| 286 |
+
|
| 287 |
+
|
| 288 |
+
# --- Static Files ---
|
| 289 |
+
|
| 290 |
+
# Serve the main page
|
| 291 |
+
@app.get("/")
|
| 292 |
+
async def serve_index():
|
| 293 |
+
index_path = os.path.join(STATIC_DIR, "index.html")
|
| 294 |
+
if os.path.exists(index_path):
|
| 295 |
+
return FileResponse(index_path)
|
| 296 |
+
return JSONResponse({"message": "Alldocex API is running. Frontend not found."})
|
| 297 |
+
|
| 298 |
+
|
| 299 |
+
# Mount static files
|
| 300 |
+
app.mount("/static", StaticFiles(directory=STATIC_DIR), name="static")
|
| 301 |
+
|
| 302 |
+
|
| 303 |
+
if __name__ == "__main__":
|
| 304 |
+
import uvicorn
|
| 305 |
+
print("\n🚀 Alldocex - Intelligent Document Processing System")
|
| 306 |
+
print("📄 Open http://localhost:7860 in your browser\n")
|
| 307 |
+
uvicorn.run(app, host="0.0.0.0", port=7860)
|
models/__init__.py
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
# Models package
|
models/schemas.py
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Pydantic models for request/response schemas.
|
| 3 |
+
"""
|
| 4 |
+
from pydantic import BaseModel
|
| 5 |
+
from typing import Optional, List, Dict, Any
|
| 6 |
+
from enum import Enum
|
| 7 |
+
import time
|
| 8 |
+
import uuid
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
class TaskStatus(str, Enum):
|
| 12 |
+
PENDING = "pending"
|
| 13 |
+
PROCESSING = "processing"
|
| 14 |
+
COMPLETED = "completed"
|
| 15 |
+
ERROR = "error"
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class FileType(str, Enum):
|
| 19 |
+
PDF = "pdf"
|
| 20 |
+
DOCX = "docx"
|
| 21 |
+
IMAGE = "image"
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class UploadResponse(BaseModel):
|
| 25 |
+
file_id: str
|
| 26 |
+
filename: str
|
| 27 |
+
file_type: str
|
| 28 |
+
size_bytes: int
|
| 29 |
+
size_human: str
|
| 30 |
+
message: str
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
class DocumentMetadata(BaseModel):
|
| 34 |
+
title: Optional[str] = None
|
| 35 |
+
author: Optional[str] = None
|
| 36 |
+
creation_date: Optional[str] = None
|
| 37 |
+
modification_date: Optional[str] = None
|
| 38 |
+
page_count: Optional[int] = None
|
| 39 |
+
word_count: int = 0
|
| 40 |
+
character_count: int = 0
|
| 41 |
+
file_type: str = ""
|
| 42 |
+
extra: Dict[str, Any] = {}
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
class ExtractionResult(BaseModel):
|
| 46 |
+
raw_text: str
|
| 47 |
+
metadata: DocumentMetadata
|
| 48 |
+
success: bool = True
|
| 49 |
+
error_message: Optional[str] = None
|
| 50 |
+
extraction_time_ms: float = 0
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
class SummaryResult(BaseModel):
|
| 54 |
+
summary: str
|
| 55 |
+
original_length: int
|
| 56 |
+
summary_length: int
|
| 57 |
+
compression_ratio: float
|
| 58 |
+
sentence_count: int
|
| 59 |
+
algorithm: str
|
| 60 |
+
|
| 61 |
+
|
| 62 |
+
class Entity(BaseModel):
|
| 63 |
+
text: str
|
| 64 |
+
label: str
|
| 65 |
+
label_description: str
|
| 66 |
+
count: int = 1
|
| 67 |
+
positions: List[int] = []
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
class EntityResult(BaseModel):
|
| 71 |
+
entities: List[Entity]
|
| 72 |
+
entity_counts: Dict[str, int]
|
| 73 |
+
total_entities: int
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
class SentimentBreakdown(BaseModel):
|
| 77 |
+
text: str
|
| 78 |
+
compound: float
|
| 79 |
+
positive: float
|
| 80 |
+
negative: float
|
| 81 |
+
neutral: float
|
| 82 |
+
label: str
|
| 83 |
+
|
| 84 |
+
|
| 85 |
+
class SentimentResult(BaseModel):
|
| 86 |
+
overall_compound: float
|
| 87 |
+
overall_positive: float
|
| 88 |
+
overall_negative: float
|
| 89 |
+
overall_neutral: float
|
| 90 |
+
overall_label: str
|
| 91 |
+
sentence_breakdown: List[SentimentBreakdown]
|
| 92 |
+
confidence: float
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
class ProcessingResult(BaseModel):
|
| 96 |
+
file_id: str
|
| 97 |
+
filename: str
|
| 98 |
+
file_type: str
|
| 99 |
+
status: TaskStatus
|
| 100 |
+
extraction: Optional[ExtractionResult] = None
|
| 101 |
+
summary: Optional[SummaryResult] = None
|
| 102 |
+
entities: Optional[EntityResult] = None
|
| 103 |
+
sentiment: Optional[SentimentResult] = None
|
| 104 |
+
processing_time_ms: float = 0
|
| 105 |
+
error_message: Optional[str] = None
|
| 106 |
+
timestamp: float = 0
|
| 107 |
+
|
| 108 |
+
@staticmethod
|
| 109 |
+
def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
|
| 110 |
+
return ProcessingResult(
|
| 111 |
+
file_id=file_id,
|
| 112 |
+
filename=filename,
|
| 113 |
+
file_type=file_type,
|
| 114 |
+
status=TaskStatus.PENDING,
|
| 115 |
+
timestamp=time.time(),
|
| 116 |
+
)
|
requirements.txt
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
fastapi==0.115.0
|
| 2 |
+
uvicorn[standard]==0.30.0
|
| 3 |
+
python-multipart==0.0.9
|
| 4 |
+
PyMuPDF==1.24.0
|
| 5 |
+
python-docx==1.1.0
|
| 6 |
+
easyocr==1.7.1
|
| 7 |
+
torch
|
| 8 |
+
torchvision
|
| 9 |
+
pytesseract==0.3.13
|
| 10 |
+
Pillow==10.4.0
|
| 11 |
+
spacy==3.7.5
|
| 12 |
+
sumy==0.11.0
|
| 13 |
+
nltk==3.9.0
|
| 14 |
+
aiofiles==24.1.0
|
| 15 |
+
requests==2.32.3
|
| 16 |
+
beautifulsoup4==4.12.3
|
static/app.js
ADDED
|
@@ -0,0 +1,586 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/**
|
| 2 |
+
* Alldocex - Intelligent Document Processing
|
| 3 |
+
* Frontend application logic
|
| 4 |
+
*/
|
| 5 |
+
|
| 6 |
+
// ===== State =====
|
| 7 |
+
let currentTaskId = null;
|
| 8 |
+
let pollInterval = null;
|
| 9 |
+
|
| 10 |
+
// ===== DOM Elements =====
|
| 11 |
+
const $ = (sel) => document.querySelector(sel);
|
| 12 |
+
const $$ = (sel) => document.querySelectorAll(sel);
|
| 13 |
+
|
| 14 |
+
const dropZone = $('#dropZone');
|
| 15 |
+
const fileInput = $('#fileInput');
|
| 16 |
+
const uploadSection = $('#uploadSection');
|
| 17 |
+
const processingSection = $('#processingSection');
|
| 18 |
+
const resultsSection = $('#resultsSection');
|
| 19 |
+
const toastContainer = $('#toastContainer');
|
| 20 |
+
const btnExtractUrl = $('#btnExtractUrl');
|
| 21 |
+
const urlInput = $('#urlInput');
|
| 22 |
+
|
| 23 |
+
// ===== Init =====
|
| 24 |
+
document.addEventListener('DOMContentLoaded', () => {
|
| 25 |
+
initUpload();
|
| 26 |
+
initTabs();
|
| 27 |
+
initButtons();
|
| 28 |
+
});
|
| 29 |
+
|
| 30 |
+
// ===== Health Check =====
|
| 31 |
+
|
| 32 |
+
// ===== Upload =====
|
| 33 |
+
function initUpload() {
|
| 34 |
+
// Click to upload
|
| 35 |
+
dropZone.addEventListener('click', () => fileInput.click());
|
| 36 |
+
|
| 37 |
+
// File selected
|
| 38 |
+
fileInput.addEventListener('change', (e) => {
|
| 39 |
+
if (e.target.files.length > 0) {
|
| 40 |
+
handleFile(e.target.files[0]);
|
| 41 |
+
}
|
| 42 |
+
});
|
| 43 |
+
|
| 44 |
+
// URL input
|
| 45 |
+
btnExtractUrl.addEventListener('click', () => {
|
| 46 |
+
const url = urlInput.value.trim();
|
| 47 |
+
if (url) {
|
| 48 |
+
handleUrl(url);
|
| 49 |
+
} else {
|
| 50 |
+
showToast('Please enter a valid URL', 'error');
|
| 51 |
+
}
|
| 52 |
+
});
|
| 53 |
+
|
| 54 |
+
// Drag and drop
|
| 55 |
+
dropZone.addEventListener('dragover', (e) => {
|
| 56 |
+
e.preventDefault();
|
| 57 |
+
dropZone.classList.add('drag-over');
|
| 58 |
+
});
|
| 59 |
+
|
| 60 |
+
dropZone.addEventListener('dragleave', (e) => {
|
| 61 |
+
e.preventDefault();
|
| 62 |
+
dropZone.classList.remove('drag-over');
|
| 63 |
+
});
|
| 64 |
+
|
| 65 |
+
dropZone.addEventListener('drop', (e) => {
|
| 66 |
+
e.preventDefault();
|
| 67 |
+
dropZone.classList.remove('drag-over');
|
| 68 |
+
if (e.dataTransfer.files.length > 0) {
|
| 69 |
+
handleFile(e.dataTransfer.files[0]);
|
| 70 |
+
}
|
| 71 |
+
});
|
| 72 |
+
|
| 73 |
+
// Format badge filters
|
| 74 |
+
$$('.format-badge').forEach(badge => {
|
| 75 |
+
badge.addEventListener('click', (e) => {
|
| 76 |
+
e.stopPropagation(); // Don't trigger the main dropZone click
|
| 77 |
+
const format = badge.textContent.trim().toLowerCase();
|
| 78 |
+
openFilteredPicker(format);
|
| 79 |
+
});
|
| 80 |
+
});
|
| 81 |
+
}
|
| 82 |
+
|
| 83 |
+
function openFilteredPicker(format) {
|
| 84 |
+
const defaultAccept = fileInput.accept;
|
| 85 |
+
|
| 86 |
+
// Map of extensions
|
| 87 |
+
const extMap = {
|
| 88 |
+
pdf: '.pdf',
|
| 89 |
+
docx: '.docx',
|
| 90 |
+
png: '.png',
|
| 91 |
+
jpg: '.jpg,.jpeg',
|
| 92 |
+
jpeg: '.jpg,.jpeg',
|
| 93 |
+
tiff: '.tiff',
|
| 94 |
+
bmp: '.bmp',
|
| 95 |
+
webp: '.webp'
|
| 96 |
+
};
|
| 97 |
+
|
| 98 |
+
if (extMap[format]) {
|
| 99 |
+
fileInput.accept = extMap[format];
|
| 100 |
+
}
|
| 101 |
+
|
| 102 |
+
fileInput.click();
|
| 103 |
+
|
| 104 |
+
// Reset accept after a short delay so the main zone works normally
|
| 105 |
+
setTimeout(() => {
|
| 106 |
+
fileInput.accept = defaultAccept;
|
| 107 |
+
}, 500);
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
async function handleFile(file) {
|
| 111 |
+
// Validate extension
|
| 112 |
+
const validExts = ['pdf', 'docx', 'png', 'jpg', 'jpeg', 'tiff', 'bmp', 'webp'];
|
| 113 |
+
const ext = file.name.split('.').pop().toLowerCase();
|
| 114 |
+
if (!validExts.includes(ext)) {
|
| 115 |
+
showToast(`Unsupported file type: .${ext}`, 'error');
|
| 116 |
+
return;
|
| 117 |
+
}
|
| 118 |
+
|
| 119 |
+
// Validate size (20MB)
|
| 120 |
+
if (file.size > 20 * 1024 * 1024) {
|
| 121 |
+
showToast('File too large. Maximum size: 20MB', 'error');
|
| 122 |
+
return;
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
// Show processing UI
|
| 126 |
+
showSection('processing');
|
| 127 |
+
resetProcessingSteps();
|
| 128 |
+
|
| 129 |
+
// Upload
|
| 130 |
+
const formData = new FormData();
|
| 131 |
+
formData.append('file', file);
|
| 132 |
+
|
| 133 |
+
try {
|
| 134 |
+
const res = await fetch('/api/upload', {
|
| 135 |
+
method: 'POST',
|
| 136 |
+
body: formData,
|
| 137 |
+
});
|
| 138 |
+
|
| 139 |
+
if (!res.ok) {
|
| 140 |
+
const err = await res.json();
|
| 141 |
+
throw new Error(err.detail || 'Upload failed');
|
| 142 |
+
}
|
| 143 |
+
|
| 144 |
+
const data = await res.json();
|
| 145 |
+
currentTaskId = data.file_id;
|
| 146 |
+
|
| 147 |
+
// Start polling for results
|
| 148 |
+
updateStep('stepExtract', 'active');
|
| 149 |
+
startPolling(data.file_id);
|
| 150 |
+
|
| 151 |
+
} catch (e) {
|
| 152 |
+
showToast(e.message || 'Upload failed', 'error');
|
| 153 |
+
showSection('upload');
|
| 154 |
+
}
|
| 155 |
+
}
|
| 156 |
+
|
| 157 |
+
async function handleUrl(url) {
|
| 158 |
+
if (!url.startsWith('http')) {
|
| 159 |
+
showToast('URL must start with http:// or https://', 'error');
|
| 160 |
+
return;
|
| 161 |
+
}
|
| 162 |
+
|
| 163 |
+
try {
|
| 164 |
+
resetAll();
|
| 165 |
+
showSection('processing');
|
| 166 |
+
updateStep('stepExtract', 'active');
|
| 167 |
+
|
| 168 |
+
const response = await fetch('/api/extract/url', {
|
| 169 |
+
method: 'POST',
|
| 170 |
+
headers: { 'Content-Type': 'application/json' },
|
| 171 |
+
body: JSON.stringify({ url: url })
|
| 172 |
+
});
|
| 173 |
+
|
| 174 |
+
if (!response.ok) {
|
| 175 |
+
const error = await response.json();
|
| 176 |
+
throw new Error(error.detail || 'Failed to start URL extraction');
|
| 177 |
+
}
|
| 178 |
+
|
| 179 |
+
const data = await response.json();
|
| 180 |
+
currentTaskId = data.file_id;
|
| 181 |
+
|
| 182 |
+
// Polling results
|
| 183 |
+
startPolling(data.file_id);
|
| 184 |
+
} catch (error) {
|
| 185 |
+
showSection('upload');
|
| 186 |
+
showToast(error.message, 'error');
|
| 187 |
+
}
|
| 188 |
+
}
|
| 189 |
+
|
| 190 |
+
// ===== Polling =====
|
| 191 |
+
function startPolling(taskId) {
|
| 192 |
+
if (pollInterval) clearInterval(pollInterval);
|
| 193 |
+
|
| 194 |
+
pollInterval = setInterval(async () => {
|
| 195 |
+
try {
|
| 196 |
+
const res = await fetch(`/api/status/${taskId}`);
|
| 197 |
+
const data = await res.json();
|
| 198 |
+
|
| 199 |
+
if (data.status === 'processing') {
|
| 200 |
+
// Update steps based on available data
|
| 201 |
+
if (data.extraction) {
|
| 202 |
+
updateStep('stepExtract', 'done');
|
| 203 |
+
updateStep('stepSummary', 'active');
|
| 204 |
+
}
|
| 205 |
+
if (data.summary) {
|
| 206 |
+
updateStep('stepSummary', 'done');
|
| 207 |
+
updateStep('stepEntities', 'active');
|
| 208 |
+
}
|
| 209 |
+
if (data.entities) {
|
| 210 |
+
updateStep('stepEntities', 'done');
|
| 211 |
+
updateStep('stepSentiment', 'active');
|
| 212 |
+
}
|
| 213 |
+
if (data.sentiment) {
|
| 214 |
+
updateStep('stepSentiment', 'done');
|
| 215 |
+
}
|
| 216 |
+
}
|
| 217 |
+
|
| 218 |
+
if (data.status === 'completed' || data.status === 'error') {
|
| 219 |
+
clearInterval(pollInterval);
|
| 220 |
+
pollInterval = null;
|
| 221 |
+
|
| 222 |
+
// Mark all steps as done
|
| 223 |
+
updateStep('stepExtract', 'done');
|
| 224 |
+
updateStep('stepSummary', 'done');
|
| 225 |
+
updateStep('stepEntities', 'done');
|
| 226 |
+
updateStep('stepSentiment', 'done');
|
| 227 |
+
|
| 228 |
+
// Short delay to show completion
|
| 229 |
+
setTimeout(() => {
|
| 230 |
+
if (data.status === 'error' && !data.extraction) {
|
| 231 |
+
showToast(data.error_message || 'Processing failed', 'error');
|
| 232 |
+
showSection('upload');
|
| 233 |
+
} else {
|
| 234 |
+
displayResults(data);
|
| 235 |
+
showSection('results');
|
| 236 |
+
}
|
| 237 |
+
}, 600);
|
| 238 |
+
}
|
| 239 |
+
} catch (e) {
|
| 240 |
+
clearInterval(pollInterval);
|
| 241 |
+
pollInterval = null;
|
| 242 |
+
showToast('Lost connection to server', 'error');
|
| 243 |
+
showSection('upload');
|
| 244 |
+
}
|
| 245 |
+
}, 800);
|
| 246 |
+
}
|
| 247 |
+
|
| 248 |
+
// ===== Display Results =====
|
| 249 |
+
function displayResults(data) {
|
| 250 |
+
// File info bar
|
| 251 |
+
const typeIcons = { pdf: '📕', docx: '📘', image: '🖼️' };
|
| 252 |
+
$('#fileTypeIcon').textContent = typeIcons[data.file_type] || '📄';
|
| 253 |
+
$('#fileName').textContent = data.filename;
|
| 254 |
+
|
| 255 |
+
const meta = data.extraction?.metadata;
|
| 256 |
+
const parts = [data.file_type.toUpperCase()];
|
| 257 |
+
if (meta?.word_count) parts.push(`${meta.word_count.toLocaleString()} words`);
|
| 258 |
+
if (meta?.page_count) parts.push(`${meta.page_count} pages`);
|
| 259 |
+
$('#fileMeta').textContent = parts.join(' • ');
|
| 260 |
+
|
| 261 |
+
const timeSeconds = (data.processing_time_ms / 1000).toFixed(1);
|
| 262 |
+
$('#processingTime').textContent = `⏱ ${timeSeconds}s`;
|
| 263 |
+
|
| 264 |
+
// Extracted Text
|
| 265 |
+
const textEl = $('#extractedText');
|
| 266 |
+
if (data.extraction?.raw_text) {
|
| 267 |
+
textEl.textContent = data.extraction.raw_text;
|
| 268 |
+
} else {
|
| 269 |
+
textEl.innerHTML = `<p class="placeholder">${data.extraction?.error_message || 'No text extracted.'}</p>`;
|
| 270 |
+
}
|
| 271 |
+
|
| 272 |
+
// Summary
|
| 273 |
+
if (data.summary) {
|
| 274 |
+
$('#summaryContent').textContent = data.summary.summary;
|
| 275 |
+
$('#summaryStats').classList.remove('hidden');
|
| 276 |
+
$('#statOriginalLen').textContent = data.summary.original_length.toLocaleString();
|
| 277 |
+
$('#statSummaryLen').textContent = data.summary.summary_length.toLocaleString();
|
| 278 |
+
const pct = Math.round((1 - data.summary.compression_ratio) * 100);
|
| 279 |
+
$('#statCompression').textContent = `${pct}%`;
|
| 280 |
+
$('#statAlgorithm').textContent = data.summary.algorithm;
|
| 281 |
+
} else {
|
| 282 |
+
$('#summaryContent').innerHTML = '<p class="placeholder">Summarization not available.</p>';
|
| 283 |
+
$('#summaryStats').classList.add('hidden');
|
| 284 |
+
}
|
| 285 |
+
|
| 286 |
+
// Entities
|
| 287 |
+
displayEntities(data.entities);
|
| 288 |
+
|
| 289 |
+
// Sentiment
|
| 290 |
+
displaySentiment(data.sentiment);
|
| 291 |
+
|
| 292 |
+
// Metadata
|
| 293 |
+
displayMetadata(data.extraction?.metadata);
|
| 294 |
+
|
| 295 |
+
// Activate first tab
|
| 296 |
+
activateTab('extracted');
|
| 297 |
+
}
|
| 298 |
+
|
| 299 |
+
function displayEntities(entityData) {
|
| 300 |
+
const catEl = $('#entityCategories');
|
| 301 |
+
const listEl = $('#entityList');
|
| 302 |
+
const countEl = $('#entityCount');
|
| 303 |
+
|
| 304 |
+
if (!entityData || entityData.entities.length === 0) {
|
| 305 |
+
catEl.innerHTML = '<p class="placeholder">No entities detected in this document.</p>';
|
| 306 |
+
listEl.innerHTML = '';
|
| 307 |
+
countEl.textContent = '0 entities found';
|
| 308 |
+
return;
|
| 309 |
+
}
|
| 310 |
+
|
| 311 |
+
countEl.textContent = `${entityData.total_entities} entities found`;
|
| 312 |
+
|
| 313 |
+
// Category badges
|
| 314 |
+
const catColors = {
|
| 315 |
+
PERSON: '#ec4899', ORG: '#3b82f6', GPE: '#10b981', DATE: '#f59e0b',
|
| 316 |
+
MONEY: '#8b5cf6', EVENT: '#06b6d4', PRODUCT: '#fb923c', LAW: '#a855f7',
|
| 317 |
+
NORP: '#f472b6', EMAIL: '#06b6d4', PHONE: '#3b82f6', URL: '#10b981',
|
| 318 |
+
TIME: '#f59e0b', PERCENT: '#8b5cf6', CARDINAL: '#94a3b8',
|
| 319 |
+
};
|
| 320 |
+
|
| 321 |
+
catEl.innerHTML = Object.entries(entityData.entity_counts)
|
| 322 |
+
.sort((a, b) => b[1] - a[1])
|
| 323 |
+
.map(([label, count]) => `
|
| 324 |
+
<div class="entity-category-badge">
|
| 325 |
+
<span class="cat-dot" style="background: ${catColors[label] || '#94a3b8'}"></span>
|
| 326 |
+
${label}
|
| 327 |
+
<span class="cat-count">${count}</span>
|
| 328 |
+
</div>
|
| 329 |
+
`).join('');
|
| 330 |
+
|
| 331 |
+
// Entity list
|
| 332 |
+
listEl.innerHTML = entityData.entities
|
| 333 |
+
.slice(0, 100)
|
| 334 |
+
.map(ent => `
|
| 335 |
+
<div class="entity-item">
|
| 336 |
+
<div class="entity-item-left">
|
| 337 |
+
<span class="entity-type-badge badge-${ent.label}">${ent.label}</span>
|
| 338 |
+
<span class="entity-text" title="${escapeHtml(ent.text)}">${escapeHtml(ent.text)}</span>
|
| 339 |
+
</div>
|
| 340 |
+
${ent.count > 1 ? `<span class="entity-item-count">×${ent.count}</span>` : ''}
|
| 341 |
+
</div>
|
| 342 |
+
`).join('');
|
| 343 |
+
}
|
| 344 |
+
|
| 345 |
+
function displaySentiment(sentData) {
|
| 346 |
+
const overviewEl = $('#sentimentOverview');
|
| 347 |
+
|
| 348 |
+
if (!sentData) {
|
| 349 |
+
overviewEl.innerHTML = '<p class="placeholder">Sentiment analysis not available.</p>';
|
| 350 |
+
return;
|
| 351 |
+
}
|
| 352 |
+
|
| 353 |
+
const score = sentData.overall_compound;
|
| 354 |
+
const label = sentData.overall_label;
|
| 355 |
+
const posW = Math.round(sentData.overall_positive * 100);
|
| 356 |
+
const neuW = Math.round(sentData.overall_neutral * 100);
|
| 357 |
+
const negW = Math.round(sentData.overall_negative * 100);
|
| 358 |
+
|
| 359 |
+
// Label color
|
| 360 |
+
let labelColor;
|
| 361 |
+
if (score >= 0.05) labelColor = 'var(--accent-green)';
|
| 362 |
+
else if (score <= -0.05) labelColor = 'var(--accent-red)';
|
| 363 |
+
else labelColor = 'var(--text-muted)';
|
| 364 |
+
|
| 365 |
+
let html = `
|
| 366 |
+
<div class="sentiment-gauge-container">
|
| 367 |
+
<div class="sentiment-label-display" style="color: ${labelColor}">${label}</div>
|
| 368 |
+
<div class="sentiment-score">${score >= 0 ? '+' : ''}${score.toFixed(3)}</div>
|
| 369 |
+
<div class="sentiment-bar-container">
|
| 370 |
+
<div class="sentiment-bar">
|
| 371 |
+
<div class="sentiment-bar-positive" style="width: ${posW}%"></div>
|
| 372 |
+
<div class="sentiment-bar-neutral" style="width: ${neuW}%"></div>
|
| 373 |
+
<div class="sentiment-bar-negative" style="width: ${negW}%"></div>
|
| 374 |
+
</div>
|
| 375 |
+
<div class="sentiment-bar-labels">
|
| 376 |
+
<span><span class="dot dot-pos"></span> Positive ${posW}%</span>
|
| 377 |
+
<span><span class="dot dot-neu"></span> Neutral ${neuW}%</span>
|
| 378 |
+
<span><span class="dot dot-neg"></span> Negative ${negW}%</span>
|
| 379 |
+
</div>
|
| 380 |
+
</div>
|
| 381 |
+
</div>
|
| 382 |
+
`;
|
| 383 |
+
|
| 384 |
+
// Sentence breakdown
|
| 385 |
+
if (sentData.sentence_breakdown && sentData.sentence_breakdown.length > 0) {
|
| 386 |
+
html += `
|
| 387 |
+
<div class="sentiment-sentences">
|
| 388 |
+
<h4>Sentence-Level Breakdown (top ${Math.min(sentData.sentence_breakdown.length, 20)})</h4>
|
| 389 |
+
${sentData.sentence_breakdown.slice(0, 20).map(s => {
|
| 390 |
+
let cls = 'sent-neutral';
|
| 391 |
+
if (s.compound >= 0.05) cls = 'sent-positive';
|
| 392 |
+
else if (s.compound <= -0.05) cls = 'sent-negative';
|
| 393 |
+
return `
|
| 394 |
+
<div class="sentence-item">
|
| 395 |
+
<span class="sentence-sentiment-badge ${cls}">${s.label}</span>
|
| 396 |
+
<span class="sentence-text">${escapeHtml(s.text)}</span>
|
| 397 |
+
</div>
|
| 398 |
+
`;
|
| 399 |
+
}).join('')}
|
| 400 |
+
</div>
|
| 401 |
+
`;
|
| 402 |
+
}
|
| 403 |
+
|
| 404 |
+
overviewEl.innerHTML = html;
|
| 405 |
+
}
|
| 406 |
+
|
| 407 |
+
function displayMetadata(meta) {
|
| 408 |
+
const metaEl = $('#metadataContent');
|
| 409 |
+
|
| 410 |
+
if (!meta) {
|
| 411 |
+
metaEl.innerHTML = '<p class="placeholder">No metadata available.</p>';
|
| 412 |
+
return;
|
| 413 |
+
}
|
| 414 |
+
|
| 415 |
+
const rows = [
|
| 416 |
+
['Title', meta.title],
|
| 417 |
+
['Author', meta.author],
|
| 418 |
+
['File Type', meta.file_type],
|
| 419 |
+
['Page Count', meta.page_count],
|
| 420 |
+
['Word Count', meta.word_count?.toLocaleString()],
|
| 421 |
+
['Character Count', meta.character_count?.toLocaleString()],
|
| 422 |
+
['Created', meta.creation_date],
|
| 423 |
+
['Modified', meta.modification_date],
|
| 424 |
+
];
|
| 425 |
+
|
| 426 |
+
// Add extra metadata
|
| 427 |
+
if (meta.extra) {
|
| 428 |
+
for (const [key, value] of Object.entries(meta.extra)) {
|
| 429 |
+
if (value && value !== '' && value !== 0 && value !== false) {
|
| 430 |
+
const label = key.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
|
| 431 |
+
rows.push([label, String(value)]);
|
| 432 |
+
}
|
| 433 |
+
}
|
| 434 |
+
}
|
| 435 |
+
|
| 436 |
+
metaEl.innerHTML = `
|
| 437 |
+
<table class="metadata-table">
|
| 438 |
+
${rows.filter(([, v]) => v && v !== 'None' && v !== 'null' && v !== '')
|
| 439 |
+
.map(([k, v]) => `<tr><td>${k}</td><td>${escapeHtml(String(v))}</td></tr>`)
|
| 440 |
+
.join('')}
|
| 441 |
+
</table>
|
| 442 |
+
`;
|
| 443 |
+
}
|
| 444 |
+
|
| 445 |
+
// ===== Tabs =====
|
| 446 |
+
function initTabs() {
|
| 447 |
+
$$('.tab').forEach(tab => {
|
| 448 |
+
tab.addEventListener('click', () => {
|
| 449 |
+
activateTab(tab.dataset.tab);
|
| 450 |
+
});
|
| 451 |
+
});
|
| 452 |
+
}
|
| 453 |
+
|
| 454 |
+
function activateTab(tabName) {
|
| 455 |
+
$$('.tab').forEach(t => t.classList.remove('active'));
|
| 456 |
+
$$('.tab-panel').forEach(p => p.classList.remove('active'));
|
| 457 |
+
|
| 458 |
+
const tab = $(`.tab[data-tab="${tabName}"]`);
|
| 459 |
+
const panel = $(`#panel${tabName.charAt(0).toUpperCase() + tabName.slice(1)}`);
|
| 460 |
+
|
| 461 |
+
if (tab) tab.classList.add('active');
|
| 462 |
+
if (panel) panel.classList.add('active');
|
| 463 |
+
}
|
| 464 |
+
|
| 465 |
+
// ===== Buttons =====
|
| 466 |
+
function initButtons() {
|
| 467 |
+
// New upload
|
| 468 |
+
$('#btnNewUpload').addEventListener('click', () => {
|
| 469 |
+
resetAll();
|
| 470 |
+
showSection('upload');
|
| 471 |
+
});
|
| 472 |
+
|
| 473 |
+
// Back to upload (without full reset if possible, or just same as New)
|
| 474 |
+
$('#btnBackToUpload').addEventListener('click', () => {
|
| 475 |
+
// We reset anyway for now to avoid data conflicts,
|
| 476 |
+
// but user specifically asked for "Back"
|
| 477 |
+
resetAll();
|
| 478 |
+
showSection('upload');
|
| 479 |
+
});
|
| 480 |
+
|
| 481 |
+
// Cancel processing
|
| 482 |
+
$('#btnCancelProcessing').addEventListener('click', () => {
|
| 483 |
+
if (pollInterval) {
|
| 484 |
+
clearInterval(pollInterval);
|
| 485 |
+
pollInterval = null;
|
| 486 |
+
}
|
| 487 |
+
showSection('upload');
|
| 488 |
+
showToast('Processing cancelled', 'info');
|
| 489 |
+
});
|
| 490 |
+
|
| 491 |
+
// Copy buttons
|
| 492 |
+
$('#btnCopyText').addEventListener('click', () => {
|
| 493 |
+
copyToClipboard($('#extractedText').textContent, '#btnCopyText');
|
| 494 |
+
});
|
| 495 |
+
|
| 496 |
+
$('#btnCopySummary').addEventListener('click', () => {
|
| 497 |
+
copyToClipboard($('#summaryContent').textContent, '#btnCopySummary');
|
| 498 |
+
});
|
| 499 |
+
|
| 500 |
+
// Download button
|
| 501 |
+
$('#btnDownloadText').addEventListener('click', () => {
|
| 502 |
+
if (currentTaskId) {
|
| 503 |
+
window.location.href = `/api/download/${currentTaskId}`;
|
| 504 |
+
} else {
|
| 505 |
+
showToast('No active document to download', 'error');
|
| 506 |
+
}
|
| 507 |
+
});
|
| 508 |
+
}
|
| 509 |
+
|
| 510 |
+
async function copyToClipboard(text, btnSelector) {
|
| 511 |
+
try {
|
| 512 |
+
await navigator.clipboard.writeText(text);
|
| 513 |
+
const btn = $(btnSelector);
|
| 514 |
+
btn.classList.add('copied');
|
| 515 |
+
const originalHTML = btn.innerHTML;
|
| 516 |
+
btn.innerHTML = `<svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M3 8l3 3 7-7" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/></svg> Copied!`;
|
| 517 |
+
setTimeout(() => {
|
| 518 |
+
btn.classList.remove('copied');
|
| 519 |
+
btn.innerHTML = originalHTML;
|
| 520 |
+
}, 2000);
|
| 521 |
+
} catch (e) {
|
| 522 |
+
showToast('Failed to copy to clipboard', 'error');
|
| 523 |
+
}
|
| 524 |
+
}
|
| 525 |
+
|
| 526 |
+
// ===== UI Helpers =====
|
| 527 |
+
function showSection(sectionId) {
|
| 528 |
+
[uploadSection, processingSection, resultsSection].forEach(s => s.classList.add('hidden'));
|
| 529 |
+
|
| 530 |
+
if (sectionId === 'upload') {
|
| 531 |
+
uploadSection.classList.remove('hidden');
|
| 532 |
+
} else if (sectionId === 'processing') {
|
| 533 |
+
processingSection.classList.remove('hidden');
|
| 534 |
+
} else if (sectionId === 'results') {
|
| 535 |
+
resultsSection.classList.remove('hidden');
|
| 536 |
+
}
|
| 537 |
+
}
|
| 538 |
+
function resetProcessingSteps() {
|
| 539 |
+
['stepExtract', 'stepSummary', 'stepEntities', 'stepSentiment'].forEach(id => {
|
| 540 |
+
const el = $(`#${id}`);
|
| 541 |
+
el.classList.remove('active', 'done');
|
| 542 |
+
el.querySelector('.step-status').textContent = '⏳';
|
| 543 |
+
});
|
| 544 |
+
}
|
| 545 |
+
|
| 546 |
+
function updateStep(stepId, state) {
|
| 547 |
+
const el = $(`#${stepId}`);
|
| 548 |
+
el.classList.remove('active', 'done');
|
| 549 |
+
el.classList.add(state);
|
| 550 |
+
el.querySelector('.step-status').textContent = state === 'done' ? '✅' : '⚡';
|
| 551 |
+
}
|
| 552 |
+
|
| 553 |
+
function resetAll() {
|
| 554 |
+
currentTaskId = null;
|
| 555 |
+
if (pollInterval) {
|
| 556 |
+
clearInterval(pollInterval);
|
| 557 |
+
pollInterval = null;
|
| 558 |
+
}
|
| 559 |
+
fileInput.value = '';
|
| 560 |
+
$('#extractedText').innerHTML = '<p class="placeholder">No text extracted yet.</p>';
|
| 561 |
+
$('#summaryContent').innerHTML = '<p class="placeholder">No summary available.</p>';
|
| 562 |
+
$('#summaryStats').classList.add('hidden');
|
| 563 |
+
$('#entityCategories').innerHTML = '<p class="placeholder">No entities detected.</p>';
|
| 564 |
+
$('#entityList').innerHTML = '';
|
| 565 |
+
$('#sentimentOverview').innerHTML = '<p class="placeholder">No sentiment data available.</p>';
|
| 566 |
+
$('#metadataContent').innerHTML = '<p class="placeholder">No metadata available.</p>';
|
| 567 |
+
}
|
| 568 |
+
|
| 569 |
+
function showToast(message, type = 'info') {
|
| 570 |
+
const icons = { info: 'ℹ️', error: '❌', success: '✅' };
|
| 571 |
+
const toast = document.createElement('div');
|
| 572 |
+
toast.className = `toast toast-${type}`;
|
| 573 |
+
toast.innerHTML = `<span class="toast-icon">${icons[type]}</span><span>${escapeHtml(message)}</span>`;
|
| 574 |
+
toastContainer.appendChild(toast);
|
| 575 |
+
|
| 576 |
+
setTimeout(() => {
|
| 577 |
+
if (toast.parentNode) toast.remove();
|
| 578 |
+
}, 4000);
|
| 579 |
+
}
|
| 580 |
+
|
| 581 |
+
function escapeHtml(text) {
|
| 582 |
+
if (!text) return '';
|
| 583 |
+
const div = document.createElement('div');
|
| 584 |
+
div.textContent = text;
|
| 585 |
+
return div.innerHTML;
|
| 586 |
+
}
|
static/index.html
ADDED
|
@@ -0,0 +1,268 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
<!DOCTYPE html>
|
| 2 |
+
<html lang="en">
|
| 3 |
+
<head>
|
| 4 |
+
<meta charset="UTF-8">
|
| 5 |
+
<meta name="viewport" content="width=device-width, initial-scale=1.0">
|
| 6 |
+
<title>Alldocex — Intelligent Document Processing</title>
|
| 7 |
+
<meta name="description" content="Extract, analyse, and summarize content from PDF, DOCX, and image files using AI-powered document processing.">
|
| 8 |
+
<link rel="preconnect" href="https://fonts.googleapis.com">
|
| 9 |
+
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
|
| 10 |
+
<link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
|
| 11 |
+
<link rel="stylesheet" href="/static/styles.css">
|
| 12 |
+
</head>
|
| 13 |
+
<body>
|
| 14 |
+
<!-- subtle background -->
|
| 15 |
+
<div class="bg-orbs"></div>
|
| 16 |
+
|
| 17 |
+
<div class="app-container">
|
| 18 |
+
<!-- Header -->
|
| 19 |
+
<header class="header">
|
| 20 |
+
<div class="logo">
|
| 21 |
+
<div class="logo-icon">
|
| 22 |
+
<svg width="32" height="32" viewBox="0 0 32 32" fill="none">
|
| 23 |
+
<rect x="4" y="2" width="18" height="24" rx="3" stroke="currentColor" stroke-width="2.5"/>
|
| 24 |
+
<path d="M10 8h8M10 12h10M10 16h6" stroke="currentColor" stroke-width="2" stroke-linecap="round"/>
|
| 25 |
+
<circle cx="22" cy="22" r="8" fill="var(--accent-blue)" opacity="0.9"/>
|
| 26 |
+
<path d="M20 22l1.5 1.5L24 20" stroke="white" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
|
| 27 |
+
</svg>
|
| 28 |
+
</div>
|
| 29 |
+
<div>
|
| 30 |
+
<h1>Alldocex</h1>
|
| 31 |
+
<p class="logo-subtitle">Intelligent Document Processing</p>
|
| 32 |
+
</div>
|
| 33 |
+
</div>
|
| 34 |
+
</header>
|
| 35 |
+
|
| 36 |
+
<!-- Main Content -->
|
| 37 |
+
<main class="main-content">
|
| 38 |
+
<!-- Upload Section -->
|
| 39 |
+
<section class="upload-section" id="uploadSection">
|
| 40 |
+
<div class="upload-zone" id="dropZone">
|
| 41 |
+
<div class="upload-icon">
|
| 42 |
+
<svg width="64" height="64" viewBox="0 0 64 64" fill="none">
|
| 43 |
+
<path d="M32 44V20M32 20L22 30M32 20L42 30" stroke="var(--accent-blue)" stroke-width="3" stroke-linecap="round" stroke-linejoin="round"/>
|
| 44 |
+
<path d="M12 40v6a6 6 0 006 6h28a6 6 0 006-6v-6" stroke="var(--accent-blue)" stroke-width="3" stroke-linecap="round"/>
|
| 45 |
+
</svg>
|
| 46 |
+
</div>
|
| 47 |
+
<h2 class="upload-title">Drop your document here</h2>
|
| 48 |
+
<p class="upload-subtitle">or click to browse files</p>
|
| 49 |
+
<div class="upload-formats">
|
| 50 |
+
<span class="format-badge">PDF</span>
|
| 51 |
+
<span class="format-badge">DOCX</span>
|
| 52 |
+
<span class="format-badge">PNG</span>
|
| 53 |
+
<span class="format-badge">JPG</span>
|
| 54 |
+
<span class="format-badge">TIFF</span>
|
| 55 |
+
<span class="format-badge">BMP</span>
|
| 56 |
+
</div>
|
| 57 |
+
<p class="upload-limit">Maximum file size: 50MB</p>
|
| 58 |
+
<input type="file" id="fileInput" accept=".pdf,.docx,.png,.jpg,.jpeg,.tiff,.bmp,.webp" hidden>
|
| 59 |
+
</div>
|
| 60 |
+
|
| 61 |
+
<div class="url-section">
|
| 62 |
+
<div class="divider">
|
| 63 |
+
<span>OR</span>
|
| 64 |
+
</div>
|
| 65 |
+
<div class="url-input-container">
|
| 66 |
+
<div class="url-icon-subtle">
|
| 67 |
+
<svg width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"><path d="M10 13a5 5 0 0 0 7.54.54l3-3a5 5 0 0 0-7.07-7.07l-1.72 1.71"></path><path d="M14 11a5 5 0 0 0-7.54-.54l-3 3a5 5 0 0 0 7.07 7.07l1.71-1.71"></path></svg>
|
| 68 |
+
</div>
|
| 69 |
+
<input type="text" id="urlInput" placeholder="Paste a web URL here (e.g. https://wikipedia.org/...)">
|
| 70 |
+
<button class="btn-url" id="btnExtractUrl">
|
| 71 |
+
Summarize URL
|
| 72 |
+
</button>
|
| 73 |
+
</div>
|
| 74 |
+
</div>
|
| 75 |
+
</section>
|
| 76 |
+
|
| 77 |
+
<!-- Processing Indicator -->
|
| 78 |
+
<section class="processing-section hidden" id="processingSection">
|
| 79 |
+
<div class="processing-card">
|
| 80 |
+
<div class="processing-spinner">
|
| 81 |
+
<div class="spinner-ring"></div>
|
| 82 |
+
<div class="spinner-ring ring-inner"></div>
|
| 83 |
+
</div>
|
| 84 |
+
<h3 class="processing-title" id="processingTitle">Processing document...</h3>
|
| 85 |
+
<p class="processing-subtitle" id="processingSubtitle">Extracting text and running AI analysis</p>
|
| 86 |
+
<div class="processing-steps">
|
| 87 |
+
<div class="step" id="stepExtract">
|
| 88 |
+
<span class="step-icon">📄</span>
|
| 89 |
+
<span>Text Extraction</span>
|
| 90 |
+
<span class="step-status">⏳</span>
|
| 91 |
+
</div>
|
| 92 |
+
<div class="step" id="stepSummary">
|
| 93 |
+
<span class="step-icon">📝</span>
|
| 94 |
+
<span>Summarization</span>
|
| 95 |
+
<span class="step-status">⏳</span>
|
| 96 |
+
</div>
|
| 97 |
+
<div class="step" id="stepEntities">
|
| 98 |
+
<span class="step-icon">🏷️</span>
|
| 99 |
+
<span>Entity Recognition</span>
|
| 100 |
+
<span class="step-status">⏳</span>
|
| 101 |
+
</div>
|
| 102 |
+
<div class="step" id="stepSentiment">
|
| 103 |
+
<span class="step-icon">💭</span>
|
| 104 |
+
<span>Sentiment Analysis</span>
|
| 105 |
+
<span class="step-status">⏳</span>
|
| 106 |
+
</div>
|
| 107 |
+
</div>
|
| 108 |
+
<!-- Cancel Button -->
|
| 109 |
+
<button class="btn-cancel" id="btnCancelProcessing" title="Cancel and return to upload">
|
| 110 |
+
<svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M13.5 4.5l-9 9M4.5 4.5l9 9" stroke="currentColor" stroke-width="1.8" stroke-linecap="round" stroke-linejoin="round"/></svg>
|
| 111 |
+
Cancel
|
| 112 |
+
</button>
|
| 113 |
+
</div>
|
| 114 |
+
</section>
|
| 115 |
+
|
| 116 |
+
<!-- Results Section -->
|
| 117 |
+
<section class="results-section hidden" id="resultsSection">
|
| 118 |
+
<!-- File Info Bar -->
|
| 119 |
+
<div class="file-info-bar" id="fileInfoBar">
|
| 120 |
+
<div class="file-info-left">
|
| 121 |
+
<span class="file-type-icon" id="fileTypeIcon">📄</span>
|
| 122 |
+
<div>
|
| 123 |
+
<h3 class="file-name" id="fileName">document.pdf</h3>
|
| 124 |
+
<p class="file-meta" id="fileMeta">PDF • 2.3 MB • 15 pages</p>
|
| 125 |
+
</div>
|
| 126 |
+
</div>
|
| 127 |
+
<div class="file-info-right">
|
| 128 |
+
<span class="processing-time" id="processingTime">⏱ 1.2s</span>
|
| 129 |
+
<button class="btn-back" id="btnBackToUpload" title="Back to upload">
|
| 130 |
+
<svg width="18" height="18" viewBox="0 0 20 20" fill="none">
|
| 131 |
+
<path d="M15 10H5M10 15l-5-5 5-5" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round"/>
|
| 132 |
+
</svg>
|
| 133 |
+
Back
|
| 134 |
+
</button>
|
| 135 |
+
<button class="btn-new" id="btnNewUpload" title="Upload new document">
|
| 136 |
+
<svg width="18" height="18" viewBox="0 0 20 20" fill="none">
|
| 137 |
+
<path d="M10 4v12M4 10h12" stroke="currentColor" stroke-width="2" stroke-linecap="round"/>
|
| 138 |
+
</svg>
|
| 139 |
+
New Upload
|
| 140 |
+
</button>
|
| 141 |
+
</div>
|
| 142 |
+
</div>
|
| 143 |
+
|
| 144 |
+
<!-- Tabs -->
|
| 145 |
+
<div class="tabs">
|
| 146 |
+
<button class="tab active" data-tab="extracted" id="tabExtracted">
|
| 147 |
+
<svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M3 3h12v12H3z" stroke="currentColor" stroke-width="1.5" rx="2"/><path d="M6 7h6M6 10h4" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
|
| 148 |
+
Extracted Text
|
| 149 |
+
</button>
|
| 150 |
+
<button class="tab" data-tab="summary" id="tabSummary">
|
| 151 |
+
<svg width="18" height="18" viewBox="0 0 18 18" fill="none"><path d="M3 5h12M3 9h8M3 13h10" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
|
| 152 |
+
Summary
|
| 153 |
+
</button>
|
| 154 |
+
<button class="tab" data-tab="entities" id="tabEntities">
|
| 155 |
+
<svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="7" cy="7" r="4" stroke="currentColor" stroke-width="1.5"/><path d="M14 14l-3-3" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
|
| 156 |
+
Entities
|
| 157 |
+
</button>
|
| 158 |
+
<button class="tab" data-tab="sentiment" id="tabSentiment">
|
| 159 |
+
<svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="9" cy="9" r="7" stroke="currentColor" stroke-width="1.5"/><path d="M6 11c.5 1 1.5 2 3 2s2.5-1 3-2" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/><circle cx="6.5" cy="7.5" r="1" fill="currentColor"/><circle cx="11.5" cy="7.5" r="1" fill="currentColor"/></svg>
|
| 160 |
+
Sentiment
|
| 161 |
+
</button>
|
| 162 |
+
<button class="tab" data-tab="metadata" id="tabMetadata">
|
| 163 |
+
<svg width="18" height="18" viewBox="0 0 18 18" fill="none"><circle cx="9" cy="9" r="7" stroke="currentColor" stroke-width="1.5"/><path d="M9 6v3l2 2" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
|
| 164 |
+
Metadata
|
| 165 |
+
</button>
|
| 166 |
+
</div>
|
| 167 |
+
|
| 168 |
+
<!-- Tab Content -->
|
| 169 |
+
<div class="tab-content">
|
| 170 |
+
<!-- Extracted Text -->
|
| 171 |
+
<div class="tab-panel active" id="panelExtracted">
|
| 172 |
+
<div class="panel-header">
|
| 173 |
+
<h3>Extracted Text</h3>
|
| 174 |
+
<div class="panel-actions">
|
| 175 |
+
<button class="btn-copy" id="btnCopyText" title="Copy to clipboard">
|
| 176 |
+
<svg width="16" height="16" viewBox="0 0 16 16" fill="none"><rect x="5" y="5" width="9" height="9" rx="1.5" stroke="currentColor" stroke-width="1.5"/><path d="M3 11V3a1 1 0 011-1h8" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
|
| 177 |
+
Copy
|
| 178 |
+
</button>
|
| 179 |
+
<button class="btn-download" id="btnDownloadText" title="Download as .txt">
|
| 180 |
+
<svg width="16" height="16" viewBox="0 0 16 16" fill="none"><path d="M8 2v9M4 7l4 4 4-4M2 14h12" stroke="currentColor" stroke-width="1.5" stroke-linecap="round" stroke-linejoin="round"/></svg>
|
| 181 |
+
Download
|
| 182 |
+
</button>
|
| 183 |
+
</div>
|
| 184 |
+
</div>
|
| 185 |
+
<div class="text-content" id="extractedText">
|
| 186 |
+
<p class="placeholder">No text extracted yet.</p>
|
| 187 |
+
</div>
|
| 188 |
+
</div>
|
| 189 |
+
|
| 190 |
+
<!-- Summary -->
|
| 191 |
+
<div class="tab-panel" id="panelSummary">
|
| 192 |
+
<div class="panel-header">
|
| 193 |
+
<h3>AI Summary</h3>
|
| 194 |
+
<button class="btn-copy" id="btnCopySummary" title="Copy to clipboard">
|
| 195 |
+
<svg width="16" height="16" viewBox="0 0 16 16" fill="none"><rect x="5" y="5" width="9" height="9" rx="1.5" stroke="currentColor" stroke-width="1.5"/><path d="M3 11V3a1 1 0 011-1h8" stroke="currentColor" stroke-width="1.5" stroke-linecap="round"/></svg>
|
| 196 |
+
Copy
|
| 197 |
+
</button>
|
| 198 |
+
</div>
|
| 199 |
+
<div class="summary-content" id="summaryContent">
|
| 200 |
+
<p class="placeholder">No summary available.</p>
|
| 201 |
+
</div>
|
| 202 |
+
<div class="summary-stats hidden" id="summaryStats">
|
| 203 |
+
<div class="stat-card">
|
| 204 |
+
<span class="stat-value" id="statOriginalLen">0</span>
|
| 205 |
+
<span class="stat-label">Original chars</span>
|
| 206 |
+
</div>
|
| 207 |
+
<div class="stat-card">
|
| 208 |
+
<span class="stat-value" id="statSummaryLen">0</span>
|
| 209 |
+
<span class="stat-label">Summary chars</span>
|
| 210 |
+
</div>
|
| 211 |
+
<div class="stat-card">
|
| 212 |
+
<span class="stat-value" id="statCompression">0%</span>
|
| 213 |
+
<span class="stat-label">Compression</span>
|
| 214 |
+
</div>
|
| 215 |
+
<div class="stat-card">
|
| 216 |
+
<span class="stat-value" id="statAlgorithm">—</span>
|
| 217 |
+
<span class="stat-label">Algorithm</span>
|
| 218 |
+
</div>
|
| 219 |
+
</div>
|
| 220 |
+
</div>
|
| 221 |
+
|
| 222 |
+
<!-- Entities -->
|
| 223 |
+
<div class="tab-panel" id="panelEntities">
|
| 224 |
+
<div class="panel-header">
|
| 225 |
+
<h3>Named Entities</h3>
|
| 226 |
+
<span class="entity-count" id="entityCount">0 entities found</span>
|
| 227 |
+
</div>
|
| 228 |
+
<div class="entity-categories" id="entityCategories">
|
| 229 |
+
<p class="placeholder">No entities detected.</p>
|
| 230 |
+
</div>
|
| 231 |
+
<div class="entity-list" id="entityList"></div>
|
| 232 |
+
</div>
|
| 233 |
+
|
| 234 |
+
<!-- Sentiment -->
|
| 235 |
+
<div class="tab-panel" id="panelSentiment">
|
| 236 |
+
<div class="panel-header">
|
| 237 |
+
<h3>Sentiment Analysis</h3>
|
| 238 |
+
</div>
|
| 239 |
+
<div class="sentiment-overview" id="sentimentOverview">
|
| 240 |
+
<p class="placeholder">No sentiment data available.</p>
|
| 241 |
+
</div>
|
| 242 |
+
</div>
|
| 243 |
+
|
| 244 |
+
<!-- Metadata -->
|
| 245 |
+
<div class="tab-panel" id="panelMetadata">
|
| 246 |
+
<div class="panel-header">
|
| 247 |
+
<h3>Document Metadata</h3>
|
| 248 |
+
</div>
|
| 249 |
+
<div class="metadata-content" id="metadataContent">
|
| 250 |
+
<p class="placeholder">No metadata available.</p>
|
| 251 |
+
</div>
|
| 252 |
+
</div>
|
| 253 |
+
</div>
|
| 254 |
+
</section>
|
| 255 |
+
</main>
|
| 256 |
+
|
| 257 |
+
<!-- Footer -->
|
| 258 |
+
<footer class="footer">
|
| 259 |
+
<p>Alldocex v1.0 — Powered by FastAPI, spaCy, VADER & Tesseract OCR</p>
|
| 260 |
+
</footer>
|
| 261 |
+
</div>
|
| 262 |
+
|
| 263 |
+
<!-- Toast Container -->
|
| 264 |
+
<div class="toast-container" id="toastContainer"></div>
|
| 265 |
+
|
| 266 |
+
<script src="/static/app.js"></script>
|
| 267 |
+
</body>
|
| 268 |
+
</html>
|
static/styles.css
ADDED
|
@@ -0,0 +1,1156 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
/* --- CSS Variables / Design Tokens (Corporate Blue & White) --- */
|
| 2 |
+
:root {
|
| 3 |
+
/* Primary Colors */
|
| 4 |
+
--bg-primary: #f8fafc;
|
| 5 |
+
--bg-secondary: #ffffff;
|
| 6 |
+
--bg-card: #ffffff;
|
| 7 |
+
--bg-subtle: #f1f5f9;
|
| 8 |
+
|
| 9 |
+
/* Borders & Accents */
|
| 10 |
+
--border-light: #e2e8f0;
|
| 11 |
+
--border-accent: rgba(37, 99, 235, 0.2);
|
| 12 |
+
|
| 13 |
+
/* Text */
|
| 14 |
+
--text-primary: #1e293b;
|
| 15 |
+
--text-secondary: #475569;
|
| 16 |
+
--text-muted: #94a3b8;
|
| 17 |
+
|
| 18 |
+
/* Corporate Blue Palette */
|
| 19 |
+
--accent-blue-deep: #1e40af;
|
| 20 |
+
--accent-blue: #2563eb;
|
| 21 |
+
--accent-blue-light: #60a5fa;
|
| 22 |
+
--accent-blue-subtle: #dbeafe;
|
| 23 |
+
|
| 24 |
+
/* Functional Colors */
|
| 25 |
+
--accent-green: #10b981;
|
| 26 |
+
--accent-yellow: #f59e0b;
|
| 27 |
+
--accent-red: #ef4444;
|
| 28 |
+
|
| 29 |
+
/* Gradients */
|
| 30 |
+
--gradient-primary: linear-gradient(135deg, #1e40af, #2563eb);
|
| 31 |
+
--gradient-professional: linear-gradient(135deg, #2563eb, #60a5fa);
|
| 32 |
+
|
| 33 |
+
/* Shadows (Clean & Soft) */
|
| 34 |
+
--shadow-sm: 0 1px 3px rgba(0, 0, 0, 0.1);
|
| 35 |
+
--shadow-md: 0 4px 6px -1px rgba(0, 0, 0, 0.1), 0 2px 4px -1px rgba(0, 0, 0, 0.06);
|
| 36 |
+
--shadow-lg: 0 10px 15px -3px rgba(0, 0, 0, 0.1), 0 4px 6px -2px rgba(0, 0, 0, 0.05);
|
| 37 |
+
--shadow-glow: 0 0 20px rgba(37, 99, 235, 0.1);
|
| 38 |
+
|
| 39 |
+
/* Radius */
|
| 40 |
+
--radius-sm: 6px;
|
| 41 |
+
--radius-md: 8px;
|
| 42 |
+
--radius-lg: 12px;
|
| 43 |
+
--radius-xl: 16px;
|
| 44 |
+
|
| 45 |
+
/* Font */
|
| 46 |
+
--font-main: 'Inter', -apple-system, BlinkMacSystemFont, sans-serif;
|
| 47 |
+
|
| 48 |
+
/* Transitions */
|
| 49 |
+
--transition-fast: 0.15s ease;
|
| 50 |
+
--transition-normal: 0.25s ease;
|
| 51 |
+
--transition-smooth: 0.4s cubic-bezier(0.4, 0, 0.2, 1);
|
| 52 |
+
}
|
| 53 |
+
|
| 54 |
+
/* --- Reset & Base --- */
|
| 55 |
+
*, *::before, *::after {
|
| 56 |
+
box-sizing: border-box;
|
| 57 |
+
margin: 0;
|
| 58 |
+
padding: 0;
|
| 59 |
+
}
|
| 60 |
+
|
| 61 |
+
html {
|
| 62 |
+
font-size: 16px;
|
| 63 |
+
scroll-behavior: smooth;
|
| 64 |
+
}
|
| 65 |
+
|
| 66 |
+
body {
|
| 67 |
+
font-family: var(--font-main);
|
| 68 |
+
background: var(--bg-primary);
|
| 69 |
+
color: var(--text-primary);
|
| 70 |
+
min-height: 100vh;
|
| 71 |
+
line-height: 1.6;
|
| 72 |
+
-webkit-font-smoothing: antialiased;
|
| 73 |
+
}
|
| 74 |
+
|
| 75 |
+
/* --- Background Decorations --- */
|
| 76 |
+
.bg-orbs {
|
| 77 |
+
position: fixed;
|
| 78 |
+
inset: 0;
|
| 79 |
+
z-index: -1;
|
| 80 |
+
background: radial-gradient(circle at top right, var(--accent-blue-subtle), transparent 400px),
|
| 81 |
+
radial-gradient(circle at bottom left, #f1f5f9, transparent 400px);
|
| 82 |
+
opacity: 0.5;
|
| 83 |
+
}
|
| 84 |
+
|
| 85 |
+
/* --- App Container --- */
|
| 86 |
+
.app-container {
|
| 87 |
+
max-width: 1100px;
|
| 88 |
+
margin: 0 auto;
|
| 89 |
+
padding: 32px 20px;
|
| 90 |
+
min-height: 100vh;
|
| 91 |
+
display: flex;
|
| 92 |
+
flex-direction: column;
|
| 93 |
+
}
|
| 94 |
+
|
| 95 |
+
/* --- Header --- */
|
| 96 |
+
.header {
|
| 97 |
+
display: flex;
|
| 98 |
+
align-items: center;
|
| 99 |
+
justify-content: space-between;
|
| 100 |
+
padding: 20px 32px;
|
| 101 |
+
background: var(--bg-secondary);
|
| 102 |
+
border: 1px solid var(--border-light);
|
| 103 |
+
border-radius: var(--radius-lg);
|
| 104 |
+
box-shadow: var(--shadow-sm);
|
| 105 |
+
margin-bottom: 32px;
|
| 106 |
+
}
|
| 107 |
+
|
| 108 |
+
.logo {
|
| 109 |
+
display: flex;
|
| 110 |
+
align-items: center;
|
| 111 |
+
gap: 16px;
|
| 112 |
+
}
|
| 113 |
+
|
| 114 |
+
.logo-icon {
|
| 115 |
+
display: flex;
|
| 116 |
+
align-items: center;
|
| 117 |
+
justify-content: center;
|
| 118 |
+
width: 44px;
|
| 119 |
+
height: 44px;
|
| 120 |
+
background: var(--accent-blue-subtle);
|
| 121 |
+
border-radius: var(--radius-md);
|
| 122 |
+
color: var(--accent-blue);
|
| 123 |
+
}
|
| 124 |
+
|
| 125 |
+
.logo h1 {
|
| 126 |
+
font-size: 1.5rem;
|
| 127 |
+
font-weight: 800;
|
| 128 |
+
color: var(--accent-blue-deep);
|
| 129 |
+
letter-spacing: -0.5px;
|
| 130 |
+
}
|
| 131 |
+
|
| 132 |
+
.logo-subtitle {
|
| 133 |
+
font-size: 0.75rem;
|
| 134 |
+
color: var(--text-secondary);
|
| 135 |
+
font-weight: 500;
|
| 136 |
+
text-transform: uppercase;
|
| 137 |
+
letter-spacing: 0.5px;
|
| 138 |
+
}
|
| 139 |
+
|
| 140 |
+
/* --- Main Content --- */
|
| 141 |
+
.main-content {
|
| 142 |
+
flex: 1;
|
| 143 |
+
}
|
| 144 |
+
|
| 145 |
+
/* --- Upload Section --- */
|
| 146 |
+
.upload-zone {
|
| 147 |
+
display: flex;
|
| 148 |
+
flex-direction: column;
|
| 149 |
+
align-items: center;
|
| 150 |
+
justify-content: center;
|
| 151 |
+
padding: 64px 40px;
|
| 152 |
+
background: var(--bg-secondary);
|
| 153 |
+
border: 2px dashed var(--border-light);
|
| 154 |
+
border-radius: var(--radius-xl);
|
| 155 |
+
cursor: pointer;
|
| 156 |
+
transition: var(--transition-smooth);
|
| 157 |
+
box-shadow: var(--shadow-sm);
|
| 158 |
+
}
|
| 159 |
+
|
| 160 |
+
.upload-zone:hover {
|
| 161 |
+
border-color: var(--accent-blue-light);
|
| 162 |
+
background: var(--accent-blue-subtle);
|
| 163 |
+
box-shadow: var(--shadow-glow);
|
| 164 |
+
transform: translateY(-2px);
|
| 165 |
+
}
|
| 166 |
+
|
| 167 |
+
.upload-zone.drag-over {
|
| 168 |
+
border-color: var(--accent-blue);
|
| 169 |
+
background: var(--accent-blue-subtle);
|
| 170 |
+
box-shadow: 0 0 20px rgba(37,99,235, 0.15);
|
| 171 |
+
}
|
| 172 |
+
|
| 173 |
+
.upload-icon {
|
| 174 |
+
margin-bottom: 24px;
|
| 175 |
+
color: var(--accent-blue);
|
| 176 |
+
}
|
| 177 |
+
|
| 178 |
+
.upload-title {
|
| 179 |
+
font-size: 1.5rem;
|
| 180 |
+
font-weight: 700;
|
| 181 |
+
margin-bottom: 8px;
|
| 182 |
+
color: var(--text-primary);
|
| 183 |
+
}
|
| 184 |
+
|
| 185 |
+
.upload-subtitle {
|
| 186 |
+
color: var(--text-secondary);
|
| 187 |
+
font-size: 0.95rem;
|
| 188 |
+
margin-bottom: 24px;
|
| 189 |
+
}
|
| 190 |
+
|
| 191 |
+
.upload-formats {
|
| 192 |
+
display: flex;
|
| 193 |
+
gap: 8px;
|
| 194 |
+
flex-wrap: wrap;
|
| 195 |
+
justify-content: center;
|
| 196 |
+
margin-bottom: 16px;
|
| 197 |
+
}
|
| 198 |
+
|
| 199 |
+
.format-badge {
|
| 200 |
+
padding: 6px 14px;
|
| 201 |
+
font-size: 0.75rem;
|
| 202 |
+
font-weight: 600;
|
| 203 |
+
color: var(--accent-blue);
|
| 204 |
+
background: var(--accent-blue-subtle);
|
| 205 |
+
border-radius: 100px;
|
| 206 |
+
text-transform: uppercase;
|
| 207 |
+
cursor: pointer;
|
| 208 |
+
transition: var(--transition-fast);
|
| 209 |
+
}
|
| 210 |
+
|
| 211 |
+
.format-badge:hover {
|
| 212 |
+
background: var(--accent-blue);
|
| 213 |
+
color: white;
|
| 214 |
+
transform: translateY(-1px);
|
| 215 |
+
box-shadow: var(--shadow-sm);
|
| 216 |
+
}
|
| 217 |
+
|
| 218 |
+
.upload-limit {
|
| 219 |
+
font-size: 0.75rem;
|
| 220 |
+
color: var(--text-muted);
|
| 221 |
+
}
|
| 222 |
+
|
| 223 |
+
/* --- URL Section --- */
|
| 224 |
+
.url-section {
|
| 225 |
+
margin-top: 32px;
|
| 226 |
+
width: 100%;
|
| 227 |
+
}
|
| 228 |
+
|
| 229 |
+
.divider {
|
| 230 |
+
display: flex;
|
| 231 |
+
align-items: center;
|
| 232 |
+
text-align: center;
|
| 233 |
+
margin-bottom: 24px;
|
| 234 |
+
color: var(--text-muted);
|
| 235 |
+
font-size: 0.75rem;
|
| 236 |
+
font-weight: 600;
|
| 237 |
+
letter-spacing: 1px;
|
| 238 |
+
}
|
| 239 |
+
|
| 240 |
+
.divider::before, .divider::after {
|
| 241 |
+
content: '';
|
| 242 |
+
flex: 1;
|
| 243 |
+
border-bottom: 1px solid var(--border-light);
|
| 244 |
+
}
|
| 245 |
+
|
| 246 |
+
.divider:not(:empty)::before {
|
| 247 |
+
margin-right: 16px;
|
| 248 |
+
}
|
| 249 |
+
|
| 250 |
+
.divider:not(:empty)::after {
|
| 251 |
+
margin-left: 16px;
|
| 252 |
+
}
|
| 253 |
+
|
| 254 |
+
.url-input-container {
|
| 255 |
+
display: flex;
|
| 256 |
+
gap: 12px;
|
| 257 |
+
background: var(--bg-secondary);
|
| 258 |
+
border: 1px solid var(--border-light);
|
| 259 |
+
padding: 8px;
|
| 260 |
+
border-radius: var(--radius-lg);
|
| 261 |
+
box-shadow: var(--shadow-sm);
|
| 262 |
+
transition: var(--transition-normal);
|
| 263 |
+
}
|
| 264 |
+
|
| 265 |
+
.url-input-container:focus-within {
|
| 266 |
+
border-color: var(--accent-blue);
|
| 267 |
+
box-shadow: var(--shadow-glow);
|
| 268 |
+
}
|
| 269 |
+
|
| 270 |
+
.url-icon-subtle {
|
| 271 |
+
display: flex;
|
| 272 |
+
align-items: center;
|
| 273 |
+
padding-left: 12px;
|
| 274 |
+
color: var(--text-muted);
|
| 275 |
+
}
|
| 276 |
+
|
| 277 |
+
.url-input-container input {
|
| 278 |
+
flex: 1;
|
| 279 |
+
border: none;
|
| 280 |
+
background: transparent;
|
| 281 |
+
font-family: var(--font-main);
|
| 282 |
+
font-size: 0.95rem;
|
| 283 |
+
color: var(--text-primary);
|
| 284 |
+
outline: none;
|
| 285 |
+
}
|
| 286 |
+
|
| 287 |
+
.btn-url {
|
| 288 |
+
background: var(--gradient-primary);
|
| 289 |
+
color: white;
|
| 290 |
+
border: none;
|
| 291 |
+
padding: 10px 24px;
|
| 292 |
+
border-radius: var(--radius-md);
|
| 293 |
+
font-family: var(--font-main);
|
| 294 |
+
font-size: 0.85rem;
|
| 295 |
+
font-weight: 600;
|
| 296 |
+
cursor: pointer;
|
| 297 |
+
transition: var(--transition-normal);
|
| 298 |
+
white-space: nowrap;
|
| 299 |
+
}
|
| 300 |
+
|
| 301 |
+
.btn-url:hover {
|
| 302 |
+
transform: translateY(-1px);
|
| 303 |
+
box-shadow: 0 4px 12px rgba(37, 99, 235, 0.3);
|
| 304 |
+
}
|
| 305 |
+
|
| 306 |
+
/* --- Transitions & Animations --- */
|
| 307 |
+
@keyframes fadeInUp {
|
| 308 |
+
from { opacity: 0; transform: translateY(15px); }
|
| 309 |
+
to { opacity: 1; transform: translateY(0); }
|
| 310 |
+
}
|
| 311 |
+
|
| 312 |
+
@keyframes fadeIn {
|
| 313 |
+
from { opacity: 0; }
|
| 314 |
+
to { opacity: 1; }
|
| 315 |
+
}
|
| 316 |
+
|
| 317 |
+
.upload-section, .processing-section, .results-section {
|
| 318 |
+
animation: fadeInUp 0.5s var(--transition-smooth);
|
| 319 |
+
}
|
| 320 |
+
|
| 321 |
+
.hidden {
|
| 322 |
+
display: none !important;
|
| 323 |
+
}
|
| 324 |
+
|
| 325 |
+
/* --- Processing Section --- */
|
| 326 |
+
.processing-card {
|
| 327 |
+
display: flex;
|
| 328 |
+
flex-direction: column;
|
| 329 |
+
align-items: center;
|
| 330 |
+
padding: 60px 40px;
|
| 331 |
+
background: var(--bg-secondary);
|
| 332 |
+
border: 1px solid var(--border-light);
|
| 333 |
+
border-radius: var(--radius-xl);
|
| 334 |
+
box-shadow: var(--shadow-md);
|
| 335 |
+
}
|
| 336 |
+
|
| 337 |
+
.processing-spinner {
|
| 338 |
+
position: relative;
|
| 339 |
+
width: 80px;
|
| 340 |
+
height: 80px;
|
| 341 |
+
margin-bottom: 24px;
|
| 342 |
+
}
|
| 343 |
+
|
| 344 |
+
.spinner-ring {
|
| 345 |
+
position: absolute;
|
| 346 |
+
inset: 0;
|
| 347 |
+
border-radius: 50%;
|
| 348 |
+
border: 3px solid #e2e8f0;
|
| 349 |
+
border-top-color: var(--accent-blue);
|
| 350 |
+
animation: spin 1s linear infinite;
|
| 351 |
+
}
|
| 352 |
+
|
| 353 |
+
.ring-inner {
|
| 354 |
+
inset: 10px;
|
| 355 |
+
border-top-color: var(--accent-blue-light);
|
| 356 |
+
animation-direction: reverse;
|
| 357 |
+
animation-duration: 0.8s;
|
| 358 |
+
}
|
| 359 |
+
|
| 360 |
+
@keyframes spin {
|
| 361 |
+
to { transform: rotate(360deg); }
|
| 362 |
+
}
|
| 363 |
+
|
| 364 |
+
.processing-title {
|
| 365 |
+
font-size: 1.2rem;
|
| 366 |
+
font-weight: 600;
|
| 367 |
+
margin-bottom: 6px;
|
| 368 |
+
}
|
| 369 |
+
|
| 370 |
+
.processing-subtitle {
|
| 371 |
+
color: var(--text-secondary);
|
| 372 |
+
font-size: 0.85rem;
|
| 373 |
+
margin-bottom: 30px;
|
| 374 |
+
}
|
| 375 |
+
|
| 376 |
+
.processing-steps {
|
| 377 |
+
display: flex;
|
| 378 |
+
flex-direction: column;
|
| 379 |
+
gap: 12px;
|
| 380 |
+
width: 100%;
|
| 381 |
+
max-width: 360px;
|
| 382 |
+
}
|
| 383 |
+
|
| 384 |
+
.step {
|
| 385 |
+
display: flex;
|
| 386 |
+
align-items: center;
|
| 387 |
+
gap: 12px;
|
| 388 |
+
padding: 10px 16px;
|
| 389 |
+
background: var(--bg-primary);
|
| 390 |
+
border-radius: var(--radius-sm);
|
| 391 |
+
border: 1px solid var(--border-light);
|
| 392 |
+
font-size: 0.85rem;
|
| 393 |
+
color: var(--text-secondary);
|
| 394 |
+
transition: var(--transition-normal);
|
| 395 |
+
}
|
| 396 |
+
|
| 397 |
+
.step.active {
|
| 398 |
+
border-color: var(--accent-blue);
|
| 399 |
+
color: var(--accent-blue-deep);
|
| 400 |
+
background: var(--accent-blue-subtle);
|
| 401 |
+
}
|
| 402 |
+
|
| 403 |
+
.step.done {
|
| 404 |
+
border-color: rgba(16, 185, 129, 0.3);
|
| 405 |
+
color: var(--accent-green);
|
| 406 |
+
}
|
| 407 |
+
|
| 408 |
+
.step-icon {
|
| 409 |
+
font-size: 1.1rem;
|
| 410 |
+
}
|
| 411 |
+
|
| 412 |
+
.step-status {
|
| 413 |
+
margin-left: auto;
|
| 414 |
+
font-size: 0.9rem;
|
| 415 |
+
}
|
| 416 |
+
|
| 417 |
+
/* --- Results Section --- */
|
| 418 |
+
.results-section {
|
| 419 |
+
animation: fadeInUp 0.5s ease;
|
| 420 |
+
}
|
| 421 |
+
|
| 422 |
+
/* File Info Bar */
|
| 423 |
+
.file-info-bar {
|
| 424 |
+
display: flex;
|
| 425 |
+
align-items: center;
|
| 426 |
+
justify-content: space-between;
|
| 427 |
+
padding: 16px 24px;
|
| 428 |
+
background: var(--bg-secondary);
|
| 429 |
+
border: 1px solid var(--border-light);
|
| 430 |
+
border-radius: var(--radius-lg);
|
| 431 |
+
box-shadow: var(--shadow-sm);
|
| 432 |
+
margin-bottom: 20px;
|
| 433 |
+
}
|
| 434 |
+
|
| 435 |
+
.file-info-left {
|
| 436 |
+
display: flex;
|
| 437 |
+
align-items: center;
|
| 438 |
+
gap: 14px;
|
| 439 |
+
}
|
| 440 |
+
|
| 441 |
+
.file-type-icon {
|
| 442 |
+
font-size: 2rem;
|
| 443 |
+
}
|
| 444 |
+
|
| 445 |
+
.file-name {
|
| 446 |
+
font-size: 1rem;
|
| 447 |
+
font-weight: 600;
|
| 448 |
+
}
|
| 449 |
+
|
| 450 |
+
.file-meta {
|
| 451 |
+
font-size: 0.8rem;
|
| 452 |
+
color: var(--text-secondary);
|
| 453 |
+
}
|
| 454 |
+
|
| 455 |
+
.file-info-right {
|
| 456 |
+
display: flex;
|
| 457 |
+
align-items: center;
|
| 458 |
+
gap: 12px;
|
| 459 |
+
}
|
| 460 |
+
|
| 461 |
+
.processing-time {
|
| 462 |
+
font-size: 0.8rem;
|
| 463 |
+
color: var(--accent-blue);
|
| 464 |
+
padding: 4px 12px;
|
| 465 |
+
background: var(--accent-blue-subtle);
|
| 466 |
+
border-radius: 100px;
|
| 467 |
+
}
|
| 468 |
+
|
| 469 |
+
.btn-new, .btn-back {
|
| 470 |
+
display: flex;
|
| 471 |
+
align-items: center;
|
| 472 |
+
gap: 8px;
|
| 473 |
+
padding: 8px 18px;
|
| 474 |
+
border: none;
|
| 475 |
+
border-radius: var(--radius-sm);
|
| 476 |
+
font-family: var(--font-main);
|
| 477 |
+
font-size: 0.85rem;
|
| 478 |
+
font-weight: 600;
|
| 479 |
+
cursor: pointer;
|
| 480 |
+
transition: var(--transition-normal);
|
| 481 |
+
}
|
| 482 |
+
|
| 483 |
+
.btn-new {
|
| 484 |
+
background: var(--gradient-primary);
|
| 485 |
+
color: white;
|
| 486 |
+
}
|
| 487 |
+
|
| 488 |
+
.btn-back {
|
| 489 |
+
background: var(--bg-glass-strong);
|
| 490 |
+
color: var(--text-secondary);
|
| 491 |
+
border: 1px solid var(--border-glass);
|
| 492 |
+
}
|
| 493 |
+
|
| 494 |
+
.btn-new:hover, .btn-back:hover {
|
| 495 |
+
transform: translateY(-1px);
|
| 496 |
+
box-shadow: var(--shadow-md);
|
| 497 |
+
}
|
| 498 |
+
|
| 499 |
+
.btn-new:hover {
|
| 500 |
+
box-shadow: 0 4px 20px rgba(139, 92, 246, 0.4);
|
| 501 |
+
}
|
| 502 |
+
|
| 503 |
+
.btn-back:hover {
|
| 504 |
+
color: var(--text-primary);
|
| 505 |
+
border-color: var(--border-accent);
|
| 506 |
+
}
|
| 507 |
+
|
| 508 |
+
.btn-cancel {
|
| 509 |
+
margin-top: 24px;
|
| 510 |
+
display: flex;
|
| 511 |
+
align-items: center;
|
| 512 |
+
gap: 8px;
|
| 513 |
+
padding: 10px 24px;
|
| 514 |
+
background: rgba(239, 68, 68, 0.1);
|
| 515 |
+
border: 1px solid rgba(239, 68, 68, 0.2);
|
| 516 |
+
border-radius: var(--radius-sm);
|
| 517 |
+
color: var(--accent-red);
|
| 518 |
+
font-family: var(--font-main);
|
| 519 |
+
font-size: 0.85rem;
|
| 520 |
+
font-weight: 600;
|
| 521 |
+
cursor: pointer;
|
| 522 |
+
transition: var(--transition-normal);
|
| 523 |
+
}
|
| 524 |
+
|
| 525 |
+
.btn-cancel:hover {
|
| 526 |
+
background: rgba(239, 68, 68, 0.2);
|
| 527 |
+
border-color: rgba(239, 68, 68, 0.4);
|
| 528 |
+
transform: translateY(-1px);
|
| 529 |
+
}
|
| 530 |
+
|
| 531 |
+
/* --- Tabs --- */
|
| 532 |
+
.tabs {
|
| 533 |
+
display: flex;
|
| 534 |
+
gap: 4px;
|
| 535 |
+
padding: 4px;
|
| 536 |
+
background: var(--bg-subtle);
|
| 537 |
+
border: 1px solid var(--border-light);
|
| 538 |
+
border-radius: var(--radius-lg);
|
| 539 |
+
margin-bottom: 20px;
|
| 540 |
+
overflow-x: auto;
|
| 541 |
+
}
|
| 542 |
+
|
| 543 |
+
.tab {
|
| 544 |
+
display: flex;
|
| 545 |
+
align-items: center;
|
| 546 |
+
gap: 8px;
|
| 547 |
+
padding: 10px 18px;
|
| 548 |
+
background: transparent;
|
| 549 |
+
border: 1px solid transparent;
|
| 550 |
+
border-radius: var(--radius-md);
|
| 551 |
+
color: var(--text-secondary);
|
| 552 |
+
font-family: var(--font-main);
|
| 553 |
+
font-size: 0.82rem;
|
| 554 |
+
font-weight: 500;
|
| 555 |
+
cursor: pointer;
|
| 556 |
+
transition: var(--transition-normal);
|
| 557 |
+
white-space: nowrap;
|
| 558 |
+
flex: 1;
|
| 559 |
+
justify-content: center;
|
| 560 |
+
}
|
| 561 |
+
|
| 562 |
+
.tab:hover {
|
| 563 |
+
color: var(--accent-blue);
|
| 564 |
+
background: #fff;
|
| 565 |
+
}
|
| 566 |
+
|
| 567 |
+
.tab.active {
|
| 568 |
+
color: var(--accent-blue);
|
| 569 |
+
background: #fff;
|
| 570 |
+
border-color: var(--accent-blue);
|
| 571 |
+
box-shadow: var(--shadow-sm);
|
| 572 |
+
}
|
| 573 |
+
|
| 574 |
+
.tab svg {
|
| 575 |
+
opacity: 0.7;
|
| 576 |
+
flex-shrink: 0;
|
| 577 |
+
}
|
| 578 |
+
|
| 579 |
+
.tab.active svg {
|
| 580 |
+
opacity: 1;
|
| 581 |
+
}
|
| 582 |
+
|
| 583 |
+
/* --- Tab Panels --- */
|
| 584 |
+
.tab-content {
|
| 585 |
+
position: relative;
|
| 586 |
+
}
|
| 587 |
+
|
| 588 |
+
.tab-panel {
|
| 589 |
+
display: none;
|
| 590 |
+
animation: fadeIn 0.3s ease;
|
| 591 |
+
}
|
| 592 |
+
|
| 593 |
+
.tab-panel.active {
|
| 594 |
+
display: block;
|
| 595 |
+
}
|
| 596 |
+
|
| 597 |
+
.panel-header {
|
| 598 |
+
display: flex;
|
| 599 |
+
align-items: center;
|
| 600 |
+
justify-content: space-between;
|
| 601 |
+
margin-bottom: 16px;
|
| 602 |
+
}
|
| 603 |
+
|
| 604 |
+
.panel-header h3 {
|
| 605 |
+
font-size: 1.1rem;
|
| 606 |
+
font-weight: 600;
|
| 607 |
+
}
|
| 608 |
+
|
| 609 |
+
.panel-actions {
|
| 610 |
+
display: flex;
|
| 611 |
+
gap: 8px;
|
| 612 |
+
}
|
| 613 |
+
|
| 614 |
+
.btn-copy, .btn-download {
|
| 615 |
+
display: flex;
|
| 616 |
+
align-items: center;
|
| 617 |
+
gap: 6px;
|
| 618 |
+
padding: 6px 14px;
|
| 619 |
+
background: var(--bg-glass-strong);
|
| 620 |
+
border: 1px solid var(--border-glass);
|
| 621 |
+
border-radius: var(--radius-sm);
|
| 622 |
+
color: var(--text-secondary);
|
| 623 |
+
font-family: var(--font-main);
|
| 624 |
+
font-size: 0.8rem;
|
| 625 |
+
cursor: pointer;
|
| 626 |
+
transition: var(--transition-normal);
|
| 627 |
+
}
|
| 628 |
+
|
| 629 |
+
.btn-copy:hover, .btn-download:hover {
|
| 630 |
+
color: var(--accent-cyan);
|
| 631 |
+
border-color: rgba(6, 182, 212, 0.3);
|
| 632 |
+
background: rgba(6, 182, 212, 0.05);
|
| 633 |
+
}
|
| 634 |
+
|
| 635 |
+
.btn-copy.copied {
|
| 636 |
+
color: var(--accent-green);
|
| 637 |
+
border-color: rgba(16, 185, 129, 0.3);
|
| 638 |
+
}
|
| 639 |
+
|
| 640 |
+
/* --- Text Content --- */
|
| 641 |
+
.text-content, .summary-content {
|
| 642 |
+
padding: 24px;
|
| 643 |
+
background: #ffffff;
|
| 644 |
+
border: 1px solid var(--border-light);
|
| 645 |
+
border-radius: var(--radius-lg);
|
| 646 |
+
color: var(--text-primary);
|
| 647 |
+
box-shadow: var(--shadow-sm);
|
| 648 |
+
max-height: 500px;
|
| 649 |
+
overflow-y: auto;
|
| 650 |
+
font-size: 0.9rem;
|
| 651 |
+
line-height: 1.8;
|
| 652 |
+
white-space: pre-wrap;
|
| 653 |
+
word-wrap: break-word;
|
| 654 |
+
}
|
| 655 |
+
|
| 656 |
+
.summary-content {
|
| 657 |
+
border-left: 4px solid var(--accent-blue);
|
| 658 |
+
background: #fafbff;
|
| 659 |
+
font-size: 0.95rem;
|
| 660 |
+
}
|
| 661 |
+
|
| 662 |
+
.placeholder {
|
| 663 |
+
color: var(--text-muted);
|
| 664 |
+
font-style: italic;
|
| 665 |
+
text-align: center;
|
| 666 |
+
padding: 30px 0;
|
| 667 |
+
}
|
| 668 |
+
|
| 669 |
+
/* --- Summary Stats --- */
|
| 670 |
+
.summary-stats {
|
| 671 |
+
display: grid;
|
| 672 |
+
grid-template-columns: repeat(4, 1fr);
|
| 673 |
+
gap: 12px;
|
| 674 |
+
margin-top: 16px;
|
| 675 |
+
}
|
| 676 |
+
|
| 677 |
+
.stat-card {
|
| 678 |
+
display: flex;
|
| 679 |
+
flex-direction: column;
|
| 680 |
+
align-items: center;
|
| 681 |
+
gap: 4px;
|
| 682 |
+
padding: 16px 12px;
|
| 683 |
+
background: var(--bg-glass);
|
| 684 |
+
border: 1px solid var(--border-glass);
|
| 685 |
+
border-radius: var(--radius-md);
|
| 686 |
+
text-align: center;
|
| 687 |
+
}
|
| 688 |
+
|
| 689 |
+
.stat-value {
|
| 690 |
+
font-size: 1.2rem;
|
| 691 |
+
font-weight: 700;
|
| 692 |
+
background: var(--gradient-primary);
|
| 693 |
+
-webkit-background-clip: text;
|
| 694 |
+
-webkit-text-fill-color: transparent;
|
| 695 |
+
background-clip: text;
|
| 696 |
+
}
|
| 697 |
+
|
| 698 |
+
.stat-label {
|
| 699 |
+
font-size: 0.7rem;
|
| 700 |
+
color: var(--text-muted);
|
| 701 |
+
text-transform: uppercase;
|
| 702 |
+
letter-spacing: 0.5px;
|
| 703 |
+
}
|
| 704 |
+
|
| 705 |
+
/* --- Entities --- */
|
| 706 |
+
.entity-count {
|
| 707 |
+
font-size: 0.85rem;
|
| 708 |
+
color: var(--accent-cyan);
|
| 709 |
+
padding: 4px 12px;
|
| 710 |
+
background: rgba(6, 182, 212, 0.1);
|
| 711 |
+
border-radius: 100px;
|
| 712 |
+
}
|
| 713 |
+
|
| 714 |
+
.entity-categories {
|
| 715 |
+
display: flex;
|
| 716 |
+
flex-wrap: wrap;
|
| 717 |
+
gap: 8px;
|
| 718 |
+
margin-bottom: 20px;
|
| 719 |
+
}
|
| 720 |
+
|
| 721 |
+
.entity-category-badge {
|
| 722 |
+
display: flex;
|
| 723 |
+
align-items: center;
|
| 724 |
+
gap: 6px;
|
| 725 |
+
padding: 6px 14px;
|
| 726 |
+
background: var(--bg-glass);
|
| 727 |
+
border: 1px solid var(--border-glass);
|
| 728 |
+
border-radius: 100px;
|
| 729 |
+
font-size: 0.78rem;
|
| 730 |
+
font-weight: 500;
|
| 731 |
+
cursor: pointer;
|
| 732 |
+
transition: var(--transition-normal);
|
| 733 |
+
}
|
| 734 |
+
|
| 735 |
+
.entity-category-badge:hover {
|
| 736 |
+
background: var(--bg-glass-strong);
|
| 737 |
+
}
|
| 738 |
+
|
| 739 |
+
.entity-category-badge .cat-dot {
|
| 740 |
+
width: 8px;
|
| 741 |
+
height: 8px;
|
| 742 |
+
border-radius: 50%;
|
| 743 |
+
}
|
| 744 |
+
|
| 745 |
+
.entity-category-badge .cat-count {
|
| 746 |
+
font-weight: 700;
|
| 747 |
+
margin-left: 4px;
|
| 748 |
+
}
|
| 749 |
+
|
| 750 |
+
.entity-list {
|
| 751 |
+
display: grid;
|
| 752 |
+
grid-template-columns: repeat(auto-fill, minmax(280px, 1fr));
|
| 753 |
+
gap: 10px;
|
| 754 |
+
}
|
| 755 |
+
|
| 756 |
+
.entity-item {
|
| 757 |
+
display: flex;
|
| 758 |
+
align-items: center;
|
| 759 |
+
justify-content: space-between;
|
| 760 |
+
padding: 12px 16px;
|
| 761 |
+
background: var(--bg-glass);
|
| 762 |
+
border: 1px solid var(--border-glass);
|
| 763 |
+
border-radius: var(--radius-md);
|
| 764 |
+
transition: var(--transition-normal);
|
| 765 |
+
}
|
| 766 |
+
|
| 767 |
+
.entity-item:hover {
|
| 768 |
+
background: var(--bg-glass-strong);
|
| 769 |
+
border-color: var(--border-accent);
|
| 770 |
+
}
|
| 771 |
+
|
| 772 |
+
.entity-item-left {
|
| 773 |
+
display: flex;
|
| 774 |
+
align-items: center;
|
| 775 |
+
gap: 10px;
|
| 776 |
+
min-width: 0;
|
| 777 |
+
}
|
| 778 |
+
|
| 779 |
+
.entity-type-badge {
|
| 780 |
+
padding: 2px 8px;
|
| 781 |
+
font-size: 0.65rem;
|
| 782 |
+
font-weight: 700;
|
| 783 |
+
border-radius: 4px;
|
| 784 |
+
letter-spacing: 0.5px;
|
| 785 |
+
text-transform: uppercase;
|
| 786 |
+
white-space: nowrap;
|
| 787 |
+
flex-shrink: 0;
|
| 788 |
+
}
|
| 789 |
+
|
| 790 |
+
.entity-text {
|
| 791 |
+
font-size: 0.88rem;
|
| 792 |
+
font-weight: 500;
|
| 793 |
+
white-space: nowrap;
|
| 794 |
+
overflow: hidden;
|
| 795 |
+
text-overflow: ellipsis;
|
| 796 |
+
}
|
| 797 |
+
|
| 798 |
+
.entity-item-count {
|
| 799 |
+
font-size: 0.75rem;
|
| 800 |
+
color: var(--text-muted);
|
| 801 |
+
padding: 2px 8px;
|
| 802 |
+
background: var(--bg-glass-strong);
|
| 803 |
+
border-radius: 100px;
|
| 804 |
+
flex-shrink: 0;
|
| 805 |
+
margin-left: 8px;
|
| 806 |
+
}
|
| 807 |
+
|
| 808 |
+
/* Entity color mapping */
|
| 809 |
+
.badge-PERSON { background: rgba(236, 72, 153, 0.15); color: var(--accent-pink); }
|
| 810 |
+
.badge-ORG { background: rgba(59, 130, 246, 0.15); color: var(--accent-blue); }
|
| 811 |
+
.badge-GPE { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
|
| 812 |
+
.badge-DATE { background: rgba(245, 158, 11, 0.15); color: var(--accent-yellow); }
|
| 813 |
+
.badge-MONEY { background: rgba(139, 92, 246, 0.15); color: var(--accent-purple); }
|
| 814 |
+
.badge-EVENT { background: rgba(6, 182, 212, 0.15); color: var(--accent-cyan); }
|
| 815 |
+
.badge-PRODUCT { background: rgba(251, 146, 60, 0.15); color: #fb923c; }
|
| 816 |
+
.badge-LAW { background: rgba(168, 85, 247, 0.15); color: #a855f7; }
|
| 817 |
+
.badge-NORP { background: rgba(244, 114, 182, 0.15); color: #f472b6; }
|
| 818 |
+
.badge-EMAIL { background: rgba(6, 182, 212, 0.15); color: var(--accent-cyan); }
|
| 819 |
+
.badge-PHONE { background: rgba(59, 130, 246, 0.15); color: var(--accent-blue); }
|
| 820 |
+
.badge-URL { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
|
| 821 |
+
.badge-TIME { background: rgba(245, 158, 11, 0.15); color: var(--accent-yellow); }
|
| 822 |
+
.badge-PERCENT { background: rgba(139, 92, 246, 0.15); color: var(--accent-purple); }
|
| 823 |
+
.badge-CARDINAL { background: rgba(100, 116, 139, 0.15); color: #94a3b8; }
|
| 824 |
+
.badge-ORDINAL { background: rgba(100, 116, 139, 0.15); color: #94a3b8; }
|
| 825 |
+
.badge-QUANTITY { background: rgba(251, 146, 60, 0.15); color: #fb923c; }
|
| 826 |
+
|
| 827 |
+
/* --- Sentiment --- */
|
| 828 |
+
.sentiment-overview {
|
| 829 |
+
display: flex;
|
| 830 |
+
flex-direction: column;
|
| 831 |
+
gap: 20px;
|
| 832 |
+
}
|
| 833 |
+
|
| 834 |
+
.sentiment-gauge-container {
|
| 835 |
+
display: flex;
|
| 836 |
+
flex-direction: column;
|
| 837 |
+
align-items: center;
|
| 838 |
+
gap: 16px;
|
| 839 |
+
padding: 30px;
|
| 840 |
+
background: var(--bg-glass);
|
| 841 |
+
border: 1px solid var(--border-glass);
|
| 842 |
+
border-radius: var(--radius-lg);
|
| 843 |
+
backdrop-filter: blur(20px);
|
| 844 |
+
}
|
| 845 |
+
|
| 846 |
+
.sentiment-label-display {
|
| 847 |
+
font-size: 1.6rem;
|
| 848 |
+
font-weight: 800;
|
| 849 |
+
letter-spacing: -0.5px;
|
| 850 |
+
}
|
| 851 |
+
|
| 852 |
+
.sentiment-score {
|
| 853 |
+
font-size: 3rem;
|
| 854 |
+
font-weight: 800;
|
| 855 |
+
background: var(--gradient-primary);
|
| 856 |
+
-webkit-background-clip: text;
|
| 857 |
+
-webkit-text-fill-color: transparent;
|
| 858 |
+
background-clip: text;
|
| 859 |
+
line-height: 1;
|
| 860 |
+
}
|
| 861 |
+
|
| 862 |
+
.sentiment-bar-container {
|
| 863 |
+
width: 100%;
|
| 864 |
+
max-width: 500px;
|
| 865 |
+
}
|
| 866 |
+
|
| 867 |
+
.sentiment-bar {
|
| 868 |
+
width: 100%;
|
| 869 |
+
height: 12px;
|
| 870 |
+
border-radius: 6px;
|
| 871 |
+
background: var(--bg-glass-strong);
|
| 872 |
+
overflow: hidden;
|
| 873 |
+
display: flex;
|
| 874 |
+
}
|
| 875 |
+
|
| 876 |
+
.sentiment-bar-positive {
|
| 877 |
+
background: var(--accent-green);
|
| 878 |
+
transition: width 0.8s ease;
|
| 879 |
+
}
|
| 880 |
+
|
| 881 |
+
.sentiment-bar-neutral {
|
| 882 |
+
background: var(--text-muted);
|
| 883 |
+
transition: width 0.8s ease;
|
| 884 |
+
}
|
| 885 |
+
|
| 886 |
+
.sentiment-bar-negative {
|
| 887 |
+
background: var(--accent-red);
|
| 888 |
+
transition: width 0.8s ease;
|
| 889 |
+
}
|
| 890 |
+
|
| 891 |
+
.sentiment-bar-labels {
|
| 892 |
+
display: flex;
|
| 893 |
+
justify-content: space-between;
|
| 894 |
+
margin-top: 8px;
|
| 895 |
+
font-size: 0.75rem;
|
| 896 |
+
color: var(--text-secondary);
|
| 897 |
+
}
|
| 898 |
+
|
| 899 |
+
.sentiment-bar-labels span {
|
| 900 |
+
display: flex;
|
| 901 |
+
align-items: center;
|
| 902 |
+
gap: 6px;
|
| 903 |
+
}
|
| 904 |
+
|
| 905 |
+
.sentiment-bar-labels .dot {
|
| 906 |
+
width: 8px;
|
| 907 |
+
height: 8px;
|
| 908 |
+
border-radius: 50%;
|
| 909 |
+
}
|
| 910 |
+
|
| 911 |
+
.dot-pos { background: var(--accent-green); }
|
| 912 |
+
.dot-neu { background: var(--text-muted); }
|
| 913 |
+
.dot-neg { background: var(--accent-red); }
|
| 914 |
+
|
| 915 |
+
.sentiment-sentences {
|
| 916 |
+
background: var(--bg-glass);
|
| 917 |
+
border: 1px solid var(--border-glass);
|
| 918 |
+
border-radius: var(--radius-lg);
|
| 919 |
+
padding: 20px;
|
| 920 |
+
max-height: 400px;
|
| 921 |
+
overflow-y: auto;
|
| 922 |
+
}
|
| 923 |
+
|
| 924 |
+
.sentiment-sentences h4 {
|
| 925 |
+
font-size: 0.9rem;
|
| 926 |
+
font-weight: 600;
|
| 927 |
+
margin-bottom: 12px;
|
| 928 |
+
color: var(--text-secondary);
|
| 929 |
+
}
|
| 930 |
+
|
| 931 |
+
.sentence-item {
|
| 932 |
+
display: flex;
|
| 933 |
+
align-items: flex-start;
|
| 934 |
+
gap: 12px;
|
| 935 |
+
padding: 10px 0;
|
| 936 |
+
border-bottom: 1px solid var(--border-glass);
|
| 937 |
+
font-size: 0.85rem;
|
| 938 |
+
}
|
| 939 |
+
|
| 940 |
+
.sentence-item:last-child {
|
| 941 |
+
border-bottom: none;
|
| 942 |
+
}
|
| 943 |
+
|
| 944 |
+
.sentence-sentiment-badge {
|
| 945 |
+
padding: 2px 8px;
|
| 946 |
+
font-size: 0.65rem;
|
| 947 |
+
font-weight: 700;
|
| 948 |
+
border-radius: 4px;
|
| 949 |
+
white-space: nowrap;
|
| 950 |
+
flex-shrink: 0;
|
| 951 |
+
margin-top: 2px;
|
| 952 |
+
}
|
| 953 |
+
|
| 954 |
+
.sent-positive { background: rgba(16, 185, 129, 0.15); color: var(--accent-green); }
|
| 955 |
+
.sent-negative { background: rgba(239, 68, 68, 0.15); color: var(--accent-red); }
|
| 956 |
+
.sent-neutral { background: rgba(100, 116, 139, 0.15); color: var(--text-muted); }
|
| 957 |
+
|
| 958 |
+
.sentence-text {
|
| 959 |
+
color: var(--text-secondary);
|
| 960 |
+
line-height: 1.5;
|
| 961 |
+
}
|
| 962 |
+
|
| 963 |
+
/* --- Metadata --- */
|
| 964 |
+
.metadata-content {
|
| 965 |
+
background: var(--bg-glass);
|
| 966 |
+
border: 1px solid var(--border-glass);
|
| 967 |
+
border-radius: var(--radius-lg);
|
| 968 |
+
overflow: hidden;
|
| 969 |
+
}
|
| 970 |
+
|
| 971 |
+
.metadata-table {
|
| 972 |
+
width: 100%;
|
| 973 |
+
border-collapse: collapse;
|
| 974 |
+
}
|
| 975 |
+
|
| 976 |
+
.metadata-table tr {
|
| 977 |
+
border-bottom: 1px solid var(--border-glass);
|
| 978 |
+
}
|
| 979 |
+
|
| 980 |
+
.metadata-table tr:last-child {
|
| 981 |
+
border-bottom: none;
|
| 982 |
+
}
|
| 983 |
+
|
| 984 |
+
.metadata-table td {
|
| 985 |
+
padding: 12px 20px;
|
| 986 |
+
font-size: 0.88rem;
|
| 987 |
+
}
|
| 988 |
+
|
| 989 |
+
.metadata-table td:first-child {
|
| 990 |
+
font-weight: 600;
|
| 991 |
+
color: var(--text-secondary);
|
| 992 |
+
width: 200px;
|
| 993 |
+
white-space: nowrap;
|
| 994 |
+
}
|
| 995 |
+
|
| 996 |
+
.metadata-table td:last-child {
|
| 997 |
+
color: var(--text-primary);
|
| 998 |
+
}
|
| 999 |
+
|
| 1000 |
+
/* --- Footer --- */
|
| 1001 |
+
.footer {
|
| 1002 |
+
text-align: center;
|
| 1003 |
+
padding: 24px 0 12px;
|
| 1004 |
+
font-size: 0.75rem;
|
| 1005 |
+
color: var(--text-muted);
|
| 1006 |
+
}
|
| 1007 |
+
|
| 1008 |
+
/* --- Toast Notifications --- */
|
| 1009 |
+
.toast-container {
|
| 1010 |
+
position: fixed;
|
| 1011 |
+
bottom: 24px;
|
| 1012 |
+
right: 24px;
|
| 1013 |
+
z-index: 1000;
|
| 1014 |
+
display: flex;
|
| 1015 |
+
flex-direction: column;
|
| 1016 |
+
gap: 10px;
|
| 1017 |
+
}
|
| 1018 |
+
|
| 1019 |
+
.toast {
|
| 1020 |
+
display: flex;
|
| 1021 |
+
align-items: center;
|
| 1022 |
+
gap: 10px;
|
| 1023 |
+
padding: 14px 20px;
|
| 1024 |
+
background: var(--bg-card);
|
| 1025 |
+
border: 1px solid var(--border-glass);
|
| 1026 |
+
border-radius: var(--radius-md);
|
| 1027 |
+
backdrop-filter: blur(20px);
|
| 1028 |
+
color: var(--text-primary);
|
| 1029 |
+
font-size: 0.85rem;
|
| 1030 |
+
box-shadow: var(--shadow-lg);
|
| 1031 |
+
animation: slideInRight 0.3s ease, fadeOut 0.5s ease 3.5s forwards;
|
| 1032 |
+
max-width: 380px;
|
| 1033 |
+
}
|
| 1034 |
+
|
| 1035 |
+
.toast.toast-error {
|
| 1036 |
+
border-color: rgba(239, 68, 68, 0.3);
|
| 1037 |
+
}
|
| 1038 |
+
|
| 1039 |
+
.toast.toast-success {
|
| 1040 |
+
border-color: rgba(16, 185, 129, 0.3);
|
| 1041 |
+
}
|
| 1042 |
+
|
| 1043 |
+
.toast-icon {
|
| 1044 |
+
font-size: 1.2rem;
|
| 1045 |
+
flex-shrink: 0;
|
| 1046 |
+
}
|
| 1047 |
+
|
| 1048 |
+
/* --- Utility Classes --- */
|
| 1049 |
+
.hidden {
|
| 1050 |
+
display: none !important;
|
| 1051 |
+
}
|
| 1052 |
+
|
| 1053 |
+
/* --- Animations --- */
|
| 1054 |
+
@keyframes fadeInUp {
|
| 1055 |
+
from {
|
| 1056 |
+
opacity: 0;
|
| 1057 |
+
transform: translateY(20px);
|
| 1058 |
+
}
|
| 1059 |
+
to {
|
| 1060 |
+
opacity: 1;
|
| 1061 |
+
transform: translateY(0);
|
| 1062 |
+
}
|
| 1063 |
+
}
|
| 1064 |
+
|
| 1065 |
+
@keyframes fadeIn {
|
| 1066 |
+
from { opacity: 0; }
|
| 1067 |
+
to { opacity: 1; }
|
| 1068 |
+
}
|
| 1069 |
+
|
| 1070 |
+
@keyframes slideInRight {
|
| 1071 |
+
from {
|
| 1072 |
+
opacity: 0;
|
| 1073 |
+
transform: translateX(40px);
|
| 1074 |
+
}
|
| 1075 |
+
to {
|
| 1076 |
+
opacity: 1;
|
| 1077 |
+
transform: translateX(0);
|
| 1078 |
+
}
|
| 1079 |
+
}
|
| 1080 |
+
|
| 1081 |
+
@keyframes fadeOut {
|
| 1082 |
+
from { opacity: 1; }
|
| 1083 |
+
to { opacity: 0; }
|
| 1084 |
+
}
|
| 1085 |
+
|
| 1086 |
+
/* --- Responsive --- */
|
| 1087 |
+
@media (max-width: 768px) {
|
| 1088 |
+
.app-container {
|
| 1089 |
+
padding: 12px;
|
| 1090 |
+
}
|
| 1091 |
+
|
| 1092 |
+
.header {
|
| 1093 |
+
flex-direction: column;
|
| 1094 |
+
gap: 12px;
|
| 1095 |
+
padding: 14px;
|
| 1096 |
+
}
|
| 1097 |
+
|
| 1098 |
+
.upload-zone {
|
| 1099 |
+
padding: 40px 24px;
|
| 1100 |
+
}
|
| 1101 |
+
|
| 1102 |
+
.upload-title {
|
| 1103 |
+
font-size: 1.1rem;
|
| 1104 |
+
}
|
| 1105 |
+
|
| 1106 |
+
.tabs {
|
| 1107 |
+
overflow-x: auto;
|
| 1108 |
+
}
|
| 1109 |
+
|
| 1110 |
+
.tab {
|
| 1111 |
+
padding: 8px 12px;
|
| 1112 |
+
font-size: 0.75rem;
|
| 1113 |
+
}
|
| 1114 |
+
|
| 1115 |
+
.tab svg {
|
| 1116 |
+
display: none;
|
| 1117 |
+
}
|
| 1118 |
+
|
| 1119 |
+
.summary-stats {
|
| 1120 |
+
grid-template-columns: repeat(2, 1fr);
|
| 1121 |
+
}
|
| 1122 |
+
|
| 1123 |
+
.entity-list {
|
| 1124 |
+
grid-template-columns: 1fr;
|
| 1125 |
+
}
|
| 1126 |
+
|
| 1127 |
+
.file-info-bar {
|
| 1128 |
+
flex-direction: column;
|
| 1129 |
+
gap: 12px;
|
| 1130 |
+
align-items: flex-start;
|
| 1131 |
+
}
|
| 1132 |
+
|
| 1133 |
+
.file-info-right {
|
| 1134 |
+
width: 100%;
|
| 1135 |
+
justify-content: flex-end;
|
| 1136 |
+
}
|
| 1137 |
+
|
| 1138 |
+
.metadata-table td:first-child {
|
| 1139 |
+
width: 140px;
|
| 1140 |
+
}
|
| 1141 |
+
|
| 1142 |
+
.sentiment-score {
|
| 1143 |
+
font-size: 2rem;
|
| 1144 |
+
}
|
| 1145 |
+
}
|
| 1146 |
+
|
| 1147 |
+
@media (max-width: 480px) {
|
| 1148 |
+
.summary-stats {
|
| 1149 |
+
grid-template-columns: 1fr 1fr;
|
| 1150 |
+
}
|
| 1151 |
+
|
| 1152 |
+
.format-badge {
|
| 1153 |
+
font-size: 0.65rem;
|
| 1154 |
+
padding: 3px 8px;
|
| 1155 |
+
}
|
| 1156 |
+
}
|
test_api.py
ADDED
|
@@ -0,0 +1,78 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Quick API test script for Alldocex."""
|
| 2 |
+
import requests
|
| 3 |
+
import time
|
| 4 |
+
import json
|
| 5 |
+
|
| 6 |
+
BASE_URL = "http://localhost:8000"
|
| 7 |
+
|
| 8 |
+
# Upload the test document
|
| 9 |
+
print("Uploading test_document.docx...")
|
| 10 |
+
with open("test_document.docx", "rb") as f:
|
| 11 |
+
res = requests.post(
|
| 12 |
+
f"{BASE_URL}/api/upload",
|
| 13 |
+
files={"file": ("test_document.docx", f, "application/vnd.openxmlformats-officedocument.wordprocessingml.document")}
|
| 14 |
+
)
|
| 15 |
+
|
| 16 |
+
data = res.json()
|
| 17 |
+
print(f"Upload response: {data['status']} - File ID: {data['file_id']}")
|
| 18 |
+
task_id = data["file_id"]
|
| 19 |
+
|
| 20 |
+
# Poll for results
|
| 21 |
+
print("Waiting for processing...")
|
| 22 |
+
for i in range(30):
|
| 23 |
+
time.sleep(1)
|
| 24 |
+
res = requests.get(f"{BASE_URL}/api/status/{task_id}")
|
| 25 |
+
result = res.json()
|
| 26 |
+
status = result["status"]
|
| 27 |
+
print(f" Poll {i+1}: {status}")
|
| 28 |
+
if status in ("completed", "error"):
|
| 29 |
+
break
|
| 30 |
+
|
| 31 |
+
print(f"\n{'='*50}")
|
| 32 |
+
print(f"STATUS: {result['status']}")
|
| 33 |
+
print(f"Processing time: {round(result.get('processing_time_ms', 0), 1)} ms")
|
| 34 |
+
print(f"{'='*50}")
|
| 35 |
+
|
| 36 |
+
# Extraction
|
| 37 |
+
if result.get("extraction"):
|
| 38 |
+
ext = result["extraction"]
|
| 39 |
+
print(f"\n--- EXTRACTION ---")
|
| 40 |
+
print(f"Success: {ext['success']}")
|
| 41 |
+
print(f"Word count: {ext['metadata']['word_count']}")
|
| 42 |
+
print(f"Char count: {ext['metadata']['character_count']}")
|
| 43 |
+
print(f"File type: {ext['metadata']['file_type']}")
|
| 44 |
+
print(f"First 300 chars:\n{ext['raw_text'][:300]}")
|
| 45 |
+
|
| 46 |
+
# Summary
|
| 47 |
+
if result.get("summary"):
|
| 48 |
+
s = result["summary"]
|
| 49 |
+
print(f"\n--- SUMMARY ---")
|
| 50 |
+
print(f"Algorithm: {s['algorithm']}")
|
| 51 |
+
print(f"Original length: {s['original_length']}")
|
| 52 |
+
print(f"Summary length: {s['summary_length']}")
|
| 53 |
+
print(f"Compression: {round((1 - s['compression_ratio']) * 100, 1)}%")
|
| 54 |
+
print(f"Summary:\n{s['summary'][:500]}")
|
| 55 |
+
|
| 56 |
+
# Entities
|
| 57 |
+
if result.get("entities"):
|
| 58 |
+
e = result["entities"]
|
| 59 |
+
print(f"\n--- ENTITIES ---")
|
| 60 |
+
print(f"Total entities: {e['total_entities']}")
|
| 61 |
+
print(f"Categories: {json.dumps(e['entity_counts'], indent=2)}")
|
| 62 |
+
for ent in e["entities"][:20]:
|
| 63 |
+
print(f" [{ent['label']:8s}] {ent['text']} (x{ent['count']})")
|
| 64 |
+
|
| 65 |
+
# Sentiment
|
| 66 |
+
if result.get("sentiment"):
|
| 67 |
+
sent = result["sentiment"]
|
| 68 |
+
print(f"\n--- SENTIMENT ---")
|
| 69 |
+
print(f"Label: {sent['overall_label']}")
|
| 70 |
+
print(f"Compound: {sent['overall_compound']}")
|
| 71 |
+
print(f"Positive: {sent['overall_positive']}")
|
| 72 |
+
print(f"Negative: {sent['overall_negative']}")
|
| 73 |
+
print(f"Neutral: {sent['overall_neutral']}")
|
| 74 |
+
print(f"Sentence breakdowns: {len(sent['sentence_breakdown'])}")
|
| 75 |
+
for sb in sent["sentence_breakdown"][:5]:
|
| 76 |
+
print(f" [{sb['label']:15s}] {sb['text'][:80]}...")
|
| 77 |
+
|
| 78 |
+
print("\n=== TEST COMPLETE ===")
|
test_simple.py
ADDED
|
@@ -0,0 +1,49 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Simple API test - writes results to file."""
|
| 2 |
+
import requests, time, json
|
| 3 |
+
|
| 4 |
+
BASE = "http://localhost:8000"
|
| 5 |
+
out = []
|
| 6 |
+
|
| 7 |
+
with open("test_document.docx", "rb") as f:
|
| 8 |
+
res = requests.post(f"{BASE}/api/upload", files={"file": ("test_document.docx", f)})
|
| 9 |
+
data = res.json()
|
| 10 |
+
task_id = data["file_id"]
|
| 11 |
+
out.append(f"Upload: {data['status']} (ID: {task_id})")
|
| 12 |
+
|
| 13 |
+
for i in range(30):
|
| 14 |
+
time.sleep(1)
|
| 15 |
+
res = requests.get(f"{BASE}/api/status/{task_id}")
|
| 16 |
+
result = res.json()
|
| 17 |
+
if result["status"] in ("completed", "error"):
|
| 18 |
+
break
|
| 19 |
+
|
| 20 |
+
out.append(f"Status: {result['status']}")
|
| 21 |
+
out.append(f"Time: {round(result.get('processing_time_ms', 0))}ms")
|
| 22 |
+
|
| 23 |
+
if result.get("extraction"):
|
| 24 |
+
e = result["extraction"]
|
| 25 |
+
out.append(f"\nEXTRACTION: success={e['success']}, words={e['metadata']['word_count']}")
|
| 26 |
+
out.append(f"Text preview: {e['raw_text'][:200]}...")
|
| 27 |
+
|
| 28 |
+
if result.get("summary"):
|
| 29 |
+
s = result["summary"]
|
| 30 |
+
out.append(f"\nSUMMARY ({s['algorithm']}): compression={round((1-s['compression_ratio'])*100)}%")
|
| 31 |
+
out.append(s["summary"][:400])
|
| 32 |
+
|
| 33 |
+
if result.get("entities"):
|
| 34 |
+
ent = result["entities"]
|
| 35 |
+
out.append(f"\nENTITIES: {ent['total_entities']} found")
|
| 36 |
+
out.append(f"Categories: {json.dumps(ent['entity_counts'])}")
|
| 37 |
+
for e in ent["entities"][:15]:
|
| 38 |
+
out.append(f" [{e['label']}] {e['text']} (x{e['count']})")
|
| 39 |
+
|
| 40 |
+
if result.get("sentiment"):
|
| 41 |
+
s = result["sentiment"]
|
| 42 |
+
out.append(f"\nSENTIMENT: {s['overall_label']} (compound={s['overall_compound']})")
|
| 43 |
+
out.append(f"Pos={s['overall_positive']} Neg={s['overall_negative']} Neu={s['overall_neutral']}")
|
| 44 |
+
|
| 45 |
+
out.append("\nDONE")
|
| 46 |
+
text = "\n".join(out)
|
| 47 |
+
with open("test_output.txt", "w", encoding="utf-8") as f:
|
| 48 |
+
f.write(text)
|
| 49 |
+
print(text)
|