Spaces:
Sleeping
Sleeping
metadata
title: Text Extraction Api
emoji: π
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
Alldocex β Intelligent Document Processing System
Alldocex is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
π Key Features
- Multi-Format Extraction: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
- Gemini AI-Powered Extraction: Integrates Gemini 1.5 Flash for high-precision, layout-aware OCR and structured data extraction.
- Structured AI Analysis:
- Generates clean, structured output combining high-level key points and explicitly extracted details (names, phone numbers, contact info).
- Extractive Summarization: Condenses long documents into bulleted top highlights.
- Named Entity Recognition (NER) & Sentiment Analysis: Detailed semantic NLP via spaCy and VADER.
- Robust Fallback Mechanisms: Deep scan OCR recovery using EasyOCR and Tesseract locally when AI processing fails or hits quota limits.
- Perfected Document Typography: Uses Marked.js for native Markdown-parsed display delivering mathematically perfect text alignment and human-readable formatting.
- Web URL Summarization: Paste any web link to instantly extract and analyze its core content.
- Downloadable & Exportable Results: Export raw structured summaries and text as clean
.txtfiles. - Corporate UI: A premium Blue & White dashboard with smooth user flows and dynamic interactions.
- Cloud Ready: Specifically tailored and tested for automated deployment to Hugging Face Spaces.
π οΈ Technology Stack
- Backend: FastAPI (Async Python)
- AI Engine: Google Gemini API (Gemini 1.5 Flash)
- OCR & Layout Recovery: PyMuPDF, EasyOCR, & Tesseract
- NLP Processing: spaCy & Sumy
- Frontend: Vanilla HTML5, CSS3, ES6 JavaScript, and Marked.js for rendering.
π¦ Installation & Setup
1. Clone the repository
git clone <your-repo-url>
cd <repo-folder>
2. Environment Variables
Create a .env file in the root directory and add your Google Gemini API key plus the deployment API access key:
GEMINI_API_KEY=your_gemini_api_key_here
API_ACCESS_KEY=your_deployment_api_key_here
The deployed API expects a valid key in one of these headers:
x-api-key: your_deployment_api_key_hereAuthorization: Bearer your_deployment_api_key_here
3. Install dependencies
pip install -r requirements.txt
4. Install NLP model & OS Dependencies (if missing)
python -m spacy download en_core_web_sm
# Note: Tesseract OCR must be installed on your system's OS layer.
π Getting Started
- Start the backend server:
python main.py - Open your browser and navigate to the indicated localhost address (e.g.,
http://localhost:7860).
οΏ½ API Endpoints
The deployment exposes these authenticated API endpoints:
POST /api/upload- Upload a document file and start processing.
- Content type:
multipart/form-data - Header:
x-api-keyorAuthorization: Bearer <key>
POST /api/extract/url- Send a JSON payload with a URL to extract content.
- Example body:
{ "url": "https://example.com/article" }
GET /api/status/{task_id}- Poll task status and receive extracted text, summary, entities, and sentiment.
GET /api/download/{task_id}- Download extracted text as a
.txtfile.
- Download extracted text as a
GET /api/health- Check service health and dependency availability.
Example curl calls
Upload a file:
curl -X POST "http://localhost:7860/api/upload" \
-H "x-api-key: your_deployment_api_key_here" \
-F "file=@/path/to/document.pdf"
Extract from a URL:
curl -X POST "http://localhost:7860/api/extract/url" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_deployment_api_key_here" \
-d '{"url": "https://example.com/article"}'
Check status:
curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/status/<task_id>"
Download text:
curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/download/<task_id>" -o output.txt
οΏ½π Usage
- Direct Upload: Drag and drop your PDFs or images into the dashboard.
- Format Selection: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
- URL Entry: Paste a web link to summarize online articles instantly.
- Download: Once processing is complete, use the Download button to save the extracted text.
π€ AI Tools Used
- Gemini 1.5 Flash: Primary AI model for high-precision OCR and structured data extraction.
- spaCy (en_core_web_sm): Used for Named Entity Recognition (NER).
- VADER: Sentiment analysis tool integrated with spaCy.
- Sumy: Library for extractive summarization of documents.
- EasyOCR: Fallback OCR engine for image processing.
- Tesseract: Additional OCR engine for text recovery.
- PyMuPDF: PDF parsing and layout analysis.