text-extraction-api / README.md
krishnachoudhary-hclguvi
Add HF Space metadata frontmatter to README
3f97393 unverified
metadata
title: Text Extraction Api
emoji: πŸš€
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false

Alldocex β€” Intelligent Document Processing System

Version License

Alldocex is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.

πŸš€ Key Features

  • Multi-Format Extraction: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
  • Gemini AI-Powered Extraction: Integrates Gemini 1.5 Flash for high-precision, layout-aware OCR and structured data extraction.
  • Structured AI Analysis:
    • Generates clean, structured output combining high-level key points and explicitly extracted details (names, phone numbers, contact info).
    • Extractive Summarization: Condenses long documents into bulleted top highlights.
    • Named Entity Recognition (NER) & Sentiment Analysis: Detailed semantic NLP via spaCy and VADER.
  • Robust Fallback Mechanisms: Deep scan OCR recovery using EasyOCR and Tesseract locally when AI processing fails or hits quota limits.
  • Perfected Document Typography: Uses Marked.js for native Markdown-parsed display delivering mathematically perfect text alignment and human-readable formatting.
  • Web URL Summarization: Paste any web link to instantly extract and analyze its core content.
  • Downloadable & Exportable Results: Export raw structured summaries and text as clean .txt files.
  • Corporate UI: A premium Blue & White dashboard with smooth user flows and dynamic interactions.
  • Cloud Ready: Specifically tailored and tested for automated deployment to Hugging Face Spaces.

πŸ› οΈ Technology Stack

πŸ“¦ Installation & Setup

1. Clone the repository

git clone <your-repo-url>
cd <repo-folder>

2. Environment Variables

Create a .env file in the root directory and add your Google Gemini API key plus the deployment API access key:

GEMINI_API_KEY=your_gemini_api_key_here
API_ACCESS_KEY=your_deployment_api_key_here

The deployed API expects a valid key in one of these headers:

  • x-api-key: your_deployment_api_key_here
  • Authorization: Bearer your_deployment_api_key_here

3. Install dependencies

pip install -r requirements.txt

4. Install NLP model & OS Dependencies (if missing)

python -m spacy download en_core_web_sm
# Note: Tesseract OCR must be installed on your system's OS layer.

πŸƒ Getting Started

  1. Start the backend server:
    python main.py
    
  2. Open your browser and navigate to the indicated localhost address (e.g., http://localhost:7860).

οΏ½ API Endpoints

The deployment exposes these authenticated API endpoints:

  • POST /api/upload

    • Upload a document file and start processing.
    • Content type: multipart/form-data
    • Header: x-api-key or Authorization: Bearer <key>
  • POST /api/extract/url

    • Send a JSON payload with a URL to extract content.
    • Example body: { "url": "https://example.com/article" }
  • GET /api/status/{task_id}

    • Poll task status and receive extracted text, summary, entities, and sentiment.
  • GET /api/download/{task_id}

    • Download extracted text as a .txt file.
  • GET /api/health

    • Check service health and dependency availability.

Example curl calls

Upload a file:

curl -X POST "http://localhost:7860/api/upload" \
  -H "x-api-key: your_deployment_api_key_here" \
  -F "file=@/path/to/document.pdf"

Extract from a URL:

curl -X POST "http://localhost:7860/api/extract/url" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_deployment_api_key_here" \
  -d '{"url": "https://example.com/article"}'

Check status:

curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/status/<task_id>"

Download text:

curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/download/<task_id>" -o output.txt

οΏ½πŸ“˜ Usage

  1. Direct Upload: Drag and drop your PDFs or images into the dashboard.
  2. Format Selection: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
  3. URL Entry: Paste a web link to summarize online articles instantly.
  4. Download: Once processing is complete, use the Download button to save the extracted text.

πŸ€– AI Tools Used

  • Gemini 1.5 Flash: Primary AI model for high-precision OCR and structured data extraction.
  • spaCy (en_core_web_sm): Used for Named Entity Recognition (NER).
  • VADER: Sentiment analysis tool integrated with spaCy.
  • Sumy: Library for extractive summarization of documents.
  • EasyOCR: Fallback OCR engine for image processing.
  • Tesseract: Additional OCR engine for text recovery.
  • PyMuPDF: PDF parsing and layout analysis.