Spaces:

Krish-05
/

text-extraction-api

Sleeping

App Files Files Community

krishnachoudhary-hclguvi commited on Apr 4

Commit

38365d2

unverified ·

1 Parent(s): 02b5142

Sync GitHub commit 5677d7c production and endpoint fixes

Browse files

Files changed (5) hide show

README.md +93 -43
config.py +10 -0
main.py +156 -24
models/schemas.py +5 -0
test_sync_api.py +57 -0

README.md CHANGED Viewed

@@ -1,85 +1,135 @@
----
-title: Text Extraction Api
-emoji: 🏃
-colorFrom: blue
-colorTo: indigo
-sdk: docker
-pinned: false
----
----
-title: Text Extraction Api
-emoji: ??
-colorFrom: blue
-colorTo: white
-sdk: docker
-pinned: false
----
-# Alldocex â€” Intelligent Document Processing System
 ![Version](https://img.shields.io/badge/version-1.1.0-blue)
 ![License](https://img.shields.io/badge/license-MIT-green)
 **Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
-## ðŸš€ Key Features
 *   **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
-*   **Layout-Aware PDF Engine**: Uses advanced 'layout' mode to preserve columns, tables, and physical text positioning.
-*   **Intelligent OCR**: Powered by **EasyOCR** (Deep Learning based) for superior accuracy in scanned documents.
 *   **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
-*   **AI Analysis Suite**:
-    *   **Extractive Summarization**: Condenses long documents into key highlights.
-    *   **Named Entity Recognition (NER)**: Detects People, Organizations, Dates, and more via **spaCy**.
-    *   **Sentiment Analysis**: Analyzes emotional tone using the **VADER** algorithm.
-*   **Downloadable Results**: Export extracted text as clean `.txt` files.
-*   **Corporate UI**: A professional Blue & White dashboard with smooth animations and intuitive navigation.
-## ðŸ› ï¸ Technology Stack
 *   **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
-*   **PDF Processing**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) (Layout Mode)
-*   **OCR**: [EasyOCR](https://github.com/JaidedAI/EasyOCR) & [Tesseract](https://github.com/tesseract-ocr/tesseract)
-*   **NLP**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
-*   **Frontend**: Vanilla HTML5, CSS3 (Modern UI), and JavaScript (ES6+)
-## ðŸ“¦ Installation
 ### 1. Clone the repository
 ```bash
 git clone <your-repo-url>
-cd guvi-extraction
 ```
-### 2. Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
-### 3. Install NLP model
 ```bash
 python -m spacy download en_core_web_sm
 ```
-## ðŸƒ Getting Started
 1.  Start the backend server:
     ```bash
     python main.py
     ```
-2.  Open your browser and navigate to:
-    `http://localhost:7860
-    `
-## ðŸ“˜ Usage
 1.  **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
 2.  **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
 3.  **URL Entry**: Paste a web link to summarize online articles instantly.
 4.  **Download**: Once processing is complete, use the **Download** button to save the extracted text.
----

+# Alldocex — Intelligent Document Processing System
 ![Version](https://img.shields.io/badge/version-1.1.0-blue)
 ![License](https://img.shields.io/badge/license-MIT-green)
 **Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
+## 🚀 Key Features
 *   **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
+*   **Gemini AI-Powered Extraction**: Integrates **Gemini 1.5 Flash** for high-precision, layout-aware OCR and structured data extraction.
+*   **Structured AI Analysis**:
+    *   Generates clean, structured output combining high-level key points and explicitly extracted details (names, phone numbers, contact info).
+    *   **Extractive Summarization**: Condenses long documents into bulleted top highlights.
+    *   **Named Entity Recognition (NER)** & **Sentiment Analysis**: Detailed semantic NLP via **spaCy** and **VADER**.
+*   **Robust Fallback Mechanisms**: Deep scan OCR recovery using **EasyOCR** and **Tesseract** locally when AI processing fails or hits quota limits.
+*   **Perfected Document Typography**: Uses **Marked.js** for native Markdown-parsed display delivering mathematically perfect text alignment and human-readable formatting.
 *   **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
+*   **Downloadable & Exportable Results**: Export raw structured summaries and text as clean `.txt` files.
+*   **Corporate UI**: A premium Blue & White dashboard with smooth user flows and dynamic interactions.
+*   **Cloud Ready**: Specifically tailored and tested for automated deployment to **Hugging Face Spaces**.
+## 🛠️ Technology Stack
 *   **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
+*   **AI Engine**: [Google Gemini API](https://aistudio.google.com/) (Gemini 1.5 Flash)
+*   **OCR & Layout Recovery**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/), [EasyOCR](https://github.com/JaidedAI/EasyOCR), & [Tesseract](https://github.com/tesseract-ocr/tesseract)
+*   **NLP Processing**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
+*   **Frontend**: Vanilla HTML5, CSS3, ES6 JavaScript, and [Marked.js](https://marked.js.org/) for rendering.
+## 📦 Installation & Setup
 ### 1. Clone the repository
 ```bash
 git clone <your-repo-url>
+cd <repo-folder>
 ```
+### 2. Environment Variables
+Create a `.env` file in the root directory and add your Google Gemini API key plus the deployment API access key:
+```env
+GEMINI_API_KEY=your_gemini_api_key_here
+API_ACCESS_KEY=your_deployment_api_key_here
+```
+The deployed API expects a valid key in one of these headers:
+- `x-api-key: your_deployment_api_key_here`
+- `Authorization: Bearer your_deployment_api_key_here`
+### 3. Install dependencies
 ```bash
 pip install -r requirements.txt
 ```
+### 4. Install NLP model & OS Dependencies (if missing)
 ```bash
 python -m spacy download en_core_web_sm
+# Note: Tesseract OCR must be installed on your system's OS layer.
 ```
+## 🏃 Getting Started
 1.  Start the backend server:
     ```bash
     python main.py
     ```
+2.  Open your browser and navigate to the indicated localhost address (e.g., `http://localhost:7860`).
+## � API Endpoints
+The deployment exposes these authenticated API endpoints:
+- `POST /api/upload`
+  - Upload a document file and start processing.
+  - Content type: `multipart/form-data`
+  - Header: `x-api-key` or `Authorization: Bearer <key>`
+- `POST /api/extract/url`
+  - Send a JSON payload with a URL to extract content.
+  - Example body: `{ "url": "https://example.com/article" }`
+- `GET /api/status/{task_id}`
+  - Poll task status and receive extracted text, summary, entities, and sentiment.
+- `GET /api/download/{task_id}`
+  - Download extracted text as a `.txt` file.
+- `GET /api/health`
+  - Check service health and dependency availability.
+### Example curl calls
+Upload a file:
+```bash
+curl -X POST "http://localhost:7860/api/upload" \
+  -H "x-api-key: your_deployment_api_key_here" \
+  -F "file=@/path/to/document.pdf"
+```
+Extract from a URL:
+```bash
+curl -X POST "http://localhost:7860/api/extract/url" \
+  -H "Content-Type: application/json" \
+  -H "Authorization: Bearer your_deployment_api_key_here" \
+  -d '{"url": "https://example.com/article"}'
+```
+Check status:
+```bash
+curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/status/<task_id>"
+```
+Download text:
+```bash
+curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/download/<task_id>" -o output.txt
+```
+## �📘 Usage
 1.  **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
 2.  **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
 3.  **URL Entry**: Paste a web link to summarize online articles instantly.
 4.  **Download**: Once processing is complete, use the **Download** button to save the extracted text.
+## 🤖 AI Tools Used
+- **Gemini 1.5 Flash**: Primary AI model for high-precision OCR and structured data extraction.
+- **spaCy (en_core_web_sm)**: Used for Named Entity Recognition (NER).
+- **VADER**: Sentiment analysis tool integrated with spaCy.
+- **Sumy**: Library for extractive summarization of documents.
+- **EasyOCR**: Fallback OCR engine for image processing.
+- **Tesseract**: Additional OCR engine for text recovery.
+- **PyMuPDF**: PDF parsing and layout analysis.

config.py CHANGED Viewed

@@ -81,6 +81,16 @@ SENTIMENT_THRESHOLDS = {
 GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
 GEMINI_MODEL_NAME = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
 # Flag to check if Gemini is configured
 def is_gemini_available():
     return bool(GEMINI_API_KEY)

 GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
 GEMINI_MODEL_NAME = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
+# API access key for external clients
+API_ACCESS_KEY = (
+    os.getenv("API_ACCESS_KEY") or
+    os.getenv("VALID_API_KEY") or
+    os.getenv("API_KEY")
+)
+def is_api_key_valid(key: str) -> bool:
+    return bool(API_ACCESS_KEY and key and key.strip() == API_ACCESS_KEY)
 # Flag to check if Gemini is configured
 def is_gemini_available():
     return bool(GEMINI_API_KEY)

main.py CHANGED Viewed

@@ -6,12 +6,65 @@ import os
 import uuid
 import time
 import asyncio
-from typing import Dict
-from fastapi import FastAPI, UploadFile, File, HTTPException
 from fastapi.staticfiles import StaticFiles
 from fastapi.responses import FileResponse, JSONResponse
 from fastapi.middleware.cors import CORSMiddleware
 from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
 from models.schemas import (
     UploadResponse, ProcessingResult, TaskStatus,
@@ -67,15 +120,29 @@ def _get_file_type(filename: str) -> str:
     return "unknown"
-def _process_document(file_path: str, file_type: str, task_id: str):
     """
-    Process a document: extract text, then run all analyzers.
-    This runs in a thread pool to avoid blocking the event loop.
     """
-    start_time = time.time()
-    task = tasks[task_id]
-    task.status = TaskStatus.PROCESSING
     try:
         # Step 1: Extract text based on file type
         if file_type == "pdf":
@@ -97,19 +164,21 @@ def _process_document(file_path: str, file_type: str, task_id: str):
             task.error_message = extraction.error_message or "No text could be extracted."
             task.processing_time_ms = (time.time() - start_time) * 1000
             return
         raw_text = extraction.raw_text
         # Intelligent Formatting Pass via Gemini
-        formatted_text = clean_format_text(raw_text)
-        if formatted_text == raw_text:
-            # Fallback cleanup for broken line breaks if Gemini was unavailable
-            import re
-            formatted_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', formatted_text)
-            formatted_text = re.sub(r'[ \t]+', ' ', formatted_text)
-        extraction.raw_text = formatted_text.strip()
-        raw_text = extraction.raw_text
         # Step 2: Summarization
         try:
@@ -137,10 +206,22 @@ def _process_document(file_path: str, file_type: str, task_id: str):
         task.error_message = str(e)
         task.processing_time_ms = (time.time() - start_time) * 1000
     finally:
         # Clean up uploaded file
         try:
-            if os.path.exists(file_path):
                 os.remove(file_path)
         except Exception:
             pass
@@ -148,7 +229,7 @@ def _process_document(file_path: str, file_type: str, task_id: str):
 # --- API Endpoints ---
-@app.post("/api/upload", response_model=ProcessingResult)
 async def upload_and_process(file: UploadFile = File(...)):
     """
     Upload a document and start processing.
@@ -204,7 +285,58 @@ async def upload_and_process(file: UploadFile = File(...)):
     return task
-@app.post("/api/extract/url", response_model=ProcessingResult)
 async def extract_from_url(data: Dict[str, str]):
     """
     Extract content from a web URL and process it.
@@ -236,7 +368,7 @@ async def extract_from_url(data: Dict[str, str]):
     return task
-@app.get("/api/status/{task_id}")
 async def get_task_status(task_id: str):
     """Get the processing status and results for a task."""
     if task_id not in tasks:
@@ -244,7 +376,7 @@ async def get_task_status(task_id: str):
     return tasks[task_id]
-@app.get("/api/download/{task_id}")
 async def download_results(task_id: str):
     """Download the extracted text as a .txt file."""
     if task_id not in tasks:

 import uuid
 import time
 import asyncio
+from typing import Dict, Optional
+from fastapi import FastAPI, UploadFile, File, HTTPException, Depends, Header
 from fastapi.staticfiles import StaticFiles
 from fastapi.responses import FileResponse, JSONResponse
 from fastapi.middleware.cors import CORSMiddleware
+import ssl
+# --- CRITICAL: Setup NLP models BEFORE importing analyzers/extractors ---
+def _setup_nlp_models():
+    """Download NLTK and spaCy models on startup."""
+    print("=" * 60)
+    print("Initializing NLP models (this may take a few minutes)...")
+    print("=" * 60)
+    # Fix SSL for NLTK downloads
+    try:
+        if hasattr(ssl, '_create_unverified_context'):
+            ssl._create_default_https_context = ssl._create_unverified_context
+    except:
+        pass
+    # Download NLTK data
+    try:
+        import nltk
+        print("[1/3] NLTK resources...", end=" ", flush=True)
+        nltk.download('wordnet', quiet=True)
+        nltk.download('punkt', quiet=True)
+        nltk.download('omw-1.4', quiet=True)
+        nltk.download('averaged_perceptron_tagger', quiet=True)
+        print("✓")
+    except Exception as e:
+        print(f"⚠ ({e})")
+    # Download spaCy model
+    try:
+        import spacy
+        print("[2/3] spaCy en_core_web_sm...", end=" ", flush=True)
+        try:
+            spacy.load('en_core_web_sm')
+            print("✓")
+        except OSError:
+            print("downloading...", end=" ", flush=True)
+            import subprocess
+            subprocess.run([sys.executable, "-m", "spacy", "download", "en_core_web_sm"], capture_output=True)
+            print("✓")
+    except Exception as e:
+        print(f"⚠ ({e})")
+    print("[3/3] App initialization...", end=" ", flush=True)
+    print("✓")
+    print("=" * 60)
+    print("NLP setup complete! App is ready.")
+    print("=" * 60 + "\n")
+# Setup models IMMEDIATELY
+import sys
+_setup_nlp_models()
+import config
 from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
 from models.schemas import (
     UploadResponse, ProcessingResult, TaskStatus,
     return "unknown"
+async def get_api_key(
+    x_api_key: Optional[str] = Header(None, alias="x-api-key"),
+    authorization: Optional[str] = Header(None, alias="Authorization"),
+) -> str:
+    """Validate incoming API key from header or bearer auth."""
+    token = x_api_key
+    if authorization:
+        bearer_prefix = "Bearer "
+        if authorization.startswith(bearer_prefix):
+            token = authorization[len(bearer_prefix) :].strip()
+        else:
+            token = authorization.strip()
+    if not token or not config.is_api_key_valid(token):
+        raise HTTPException(status_code=401, detail="Unauthorized. Invalid API key.")
+    return token
+def _perform_extraction_and_analysis(task: ProcessingResult, file_path: str, file_type: str, start_time: float):
     """
+    Common logic for document processing: extraction, summarization, NER, and sentiment.
     """
     try:
         # Step 1: Extract text based on file type
         if file_type == "pdf":
             task.error_message = extraction.error_message or "No text could be extracted."
             task.processing_time_ms = (time.time() - start_time) * 1000
             return
         raw_text = extraction.raw_text
         # Intelligent Formatting Pass via Gemini
+        try:
+            formatted_text = clean_format_text(raw_text)
+            if formatted_text == raw_text:
+                # Fallback cleanup for broken line breaks if Gemini was unavailable
+                import re
+                formatted_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', formatted_text)
+                formatted_text = re.sub(r'[ \t]+', ' ', formatted_text)
+            extraction.raw_text = formatted_text.strip()
+            raw_text = extraction.raw_text
+        except Exception as e:
+            print(f"Text cleanup error: {e}")
         # Step 2: Summarization
         try:
         task.error_message = str(e)
         task.processing_time_ms = (time.time() - start_time) * 1000
+def _process_document(file_path: str, file_type: str, task_id: str):
+    """
+    Process a document: extract text, then run all analyzers.
+    This runs in a thread pool to avoid blocking the event loop.
+    """
+    start_time = time.time()
+    task = tasks[task_id]
+    task.status = TaskStatus.PROCESSING
+    try:
+        _perform_extraction_and_analysis(task, file_path, file_type, start_time)
     finally:
         # Clean up uploaded file
         try:
+            if os.path.exists(file_path) and file_type != "url":
                 os.remove(file_path)
         except Exception:
             pass
 # --- API Endpoints ---
+@app.post("/api/upload", response_model=ProcessingResult, dependencies=[Depends(get_api_key)])
 async def upload_and_process(file: UploadFile = File(...)):
     """
     Upload a document and start processing.
     return task
+@app.post("/api/v1/extract", response_model=ProcessingResult, dependencies=[Depends(get_api_key)])
+async def synchronous_extract(file: UploadFile = File(...)):
+    """
+    Synchronous extraction endpoint for API testers and bots.
+    Directly returns the extraction results.
+    """
+    # 1. Validation
+    filename = file.filename or "unknown"
+    ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
+    if ext not in ALLOWED_EXTENSIONS:
+        raise HTTPException(status_code=400, detail=f"Unsupported file type: .{ext}")
+    content = await file.read()
+    if len(content) > MAX_FILE_SIZE_BYTES:
+        raise HTTPException(status_code=400, detail="File too large.")
+    if len(content) == 0:
+        raise HTTPException(status_code=400, detail="Empty file.")
+    # 2. Save temporary file
+    file_id = f"sync_{str(uuid.uuid4())[:8]}"
+    file_path = os.path.join(UPLOAD_DIR, f"{file_id}_{filename}")
+    with open(file_path, "wb") as f:
+        f.write(content)
+    # 3. Process
+    file_type = _get_file_type(filename)
+    start_time = time.time()
+    # Create the result object
+    task = ProcessingResult.create_pending(file_id=file_id, filename=filename, file_type=file_type)
+    # Run processing synchronously in the current thread (it's okay here because it's a dedicated sync endpoint)
+    # Actually, to be safe with FastAPI's async loop, we should run it in a thread still,
+    # but await its completion.
+    await asyncio.get_event_loop().run_in_executor(
+        None, _perform_extraction_and_analysis, task, file_path, file_type, start_time
+    )
+    # 4. Cleanup
+    try:
+        if os.path.exists(file_path):
+            os.remove(file_path)
+    except Exception:
+        pass
+    if task.status == TaskStatus.ERROR:
+        raise HTTPException(status_code=500, detail=task.error_message or "Processing failed.")
+    return task
+@app.post("/api/extract/url", response_model=ProcessingResult, dependencies=[Depends(get_api_key)])
 async def extract_from_url(data: Dict[str, str]):
     """
     Extract content from a web URL and process it.
     return task
+@app.get("/api/status/{task_id}", dependencies=[Depends(get_api_key)])
 async def get_task_status(task_id: str):
     """Get the processing status and results for a task."""
     if task_id not in tasks:
     return tasks[task_id]
+@app.get("/api/download/{task_id}", dependencies=[Depends(get_api_key)])
 async def download_results(task_id: str):
     """Download the extracted text as a .txt file."""
     if task_id not in tasks:

models/schemas.py CHANGED Viewed

@@ -96,6 +96,7 @@ class SentimentResult(BaseModel):
 class ProcessingResult(BaseModel):
     file_id: str
     filename: str
     file_type: str
     status: TaskStatus
     extraction: Optional[ExtractionResult] = None
@@ -106,11 +107,15 @@ class ProcessingResult(BaseModel):
     error_message: Optional[str] = None
     timestamp: float = 0
     @staticmethod
     def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
         return ProcessingResult(
             file_id=file_id,
             filename=filename,
             file_type=file_type,
             status=TaskStatus.PENDING,
             timestamp=time.time(),

 class ProcessingResult(BaseModel):
     file_id: str
     filename: str
+    fileName: Optional[str] = None  # CamelCase for external testers
     file_type: str
     status: TaskStatus
     extraction: Optional[ExtractionResult] = None
     error_message: Optional[str] = None
     timestamp: float = 0
+    class Config:
+        allow_population_by_field_name = True
     @staticmethod
     def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
         return ProcessingResult(
             file_id=file_id,
             filename=filename,
+            fileName=filename,
             file_type=file_type,
             status=TaskStatus.PENDING,
             timestamp=time.time(),

test_sync_api.py ADDED Viewed

	@@ -0,0 +1,57 @@

+import requests
+import json
+import sys
+import os
+BASE_URL = "http://127.0.0.1:7860"
+API_KEY = "alldocex-test-key-2024"
+def test_sync_extract(file_path):
+    print(f"Testing synchronous extraction for: {file_path}")
+    if not os.path.exists(file_path):
+        print(f"Error: File not found: {file_path}")
+        return
+    url = f"{BASE_URL}/api/v1/extract"
+    headers = {
+        "x-api-key": API_KEY
+    }
+    files = {
+        "file": (os.path.basename(file_path), open(file_path, "rb"), "application/octet-stream")
+    }
+    try:
+        response = requests.post(url, headers=headers, files=files)
+        print(f"Status Code: {response.status_code}")
+        if response.status_code == 200:
+            result = response.json()
+            print("\n--- RESULTS ---")
+            print(f"Filename: {result.get('filename')}")
+            print(f"Status: {result.get('status')}")
+            print(f"Extraction Success: {result.get('extraction', {}).get('success')}")
+            text = result.get('extraction', {}).get('raw_text', '')
+            print(f"Full Text Length: {len(text)}")
+            print(f"Snippet: {text[:200]}...")
+            summary = result.get('summary', {}).get('summary', '')
+            if summary:
+                print(f"Summary Snippet: {summary[:200]}...")
+            entities = result.get('entities', {}).get('total_entities', 0)
+            print(f"Total Entities Foundations: {entities}")
+            print("\n[SUCCESS] Synchronous endpoint working correctly.")
+        else:
+            print(f"Error Response: {response.text}")
+    except Exception as e:
+        print(f"Request failed: {e}")
+if __name__ == "__main__":
+    # Test with the existing sample document
+    sample_doc = "test_document.docx"
+    test_sync_extract(sample_doc)