Spaces:
Sleeping
Sleeping
File size: 5,633 Bytes
3f97393 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 d1e9916 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 52a0fe9 38365d2 ddf011e | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
title: Text Extraction Api
emoji: "π"
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
---
# Alldocex β Intelligent Document Processing System


**Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
## π Key Features
* **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
* **Gemini AI-Powered Extraction**: Integrates **Gemini 1.5 Flash** for high-precision, layout-aware OCR and structured data extraction.
* **Structured AI Analysis**:
* Generates clean, structured output combining high-level key points and explicitly extracted details (names, phone numbers, contact info).
* **Extractive Summarization**: Condenses long documents into bulleted top highlights.
* **Named Entity Recognition (NER)** & **Sentiment Analysis**: Detailed semantic NLP via **spaCy** and **VADER**.
* **Robust Fallback Mechanisms**: Deep scan OCR recovery using **EasyOCR** and **Tesseract** locally when AI processing fails or hits quota limits.
* **Perfected Document Typography**: Uses **Marked.js** for native Markdown-parsed display delivering mathematically perfect text alignment and human-readable formatting.
* **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
* **Downloadable & Exportable Results**: Export raw structured summaries and text as clean `.txt` files.
* **Corporate UI**: A premium Blue & White dashboard with smooth user flows and dynamic interactions.
* **Cloud Ready**: Specifically tailored and tested for automated deployment to **Hugging Face Spaces**.
## π οΈ Technology Stack
* **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
* **AI Engine**: [Google Gemini API](https://aistudio.google.com/) (Gemini 1.5 Flash)
* **OCR & Layout Recovery**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/), [EasyOCR](https://github.com/JaidedAI/EasyOCR), & [Tesseract](https://github.com/tesseract-ocr/tesseract)
* **NLP Processing**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
* **Frontend**: Vanilla HTML5, CSS3, ES6 JavaScript, and [Marked.js](https://marked.js.org/) for rendering.
## π¦ Installation & Setup
### 1. Clone the repository
```bash
git clone <your-repo-url>
cd <repo-folder>
```
### 2. Environment Variables
Create a `.env` file in the root directory and add your Google Gemini API key plus the deployment API access key:
```env
GEMINI_API_KEY=your_gemini_api_key_here
API_ACCESS_KEY=your_deployment_api_key_here
```
The deployed API expects a valid key in one of these headers:
- `x-api-key: your_deployment_api_key_here`
- `Authorization: Bearer your_deployment_api_key_here`
### 3. Install dependencies
```bash
pip install -r requirements.txt
```
### 4. Install NLP model & OS Dependencies (if missing)
```bash
python -m spacy download en_core_web_sm
# Note: Tesseract OCR must be installed on your system's OS layer.
```
## π Getting Started
1. Start the backend server:
```bash
python main.py
```
2. Open your browser and navigate to the indicated localhost address (e.g., `http://localhost:7860`).
## οΏ½ API Endpoints
The deployment exposes these authenticated API endpoints:
- `POST /api/upload`
- Upload a document file and start processing.
- Content type: `multipart/form-data`
- Header: `x-api-key` or `Authorization: Bearer <key>`
- `POST /api/extract/url`
- Send a JSON payload with a URL to extract content.
- Example body: `{ "url": "https://example.com/article" }`
- `GET /api/status/{task_id}`
- Poll task status and receive extracted text, summary, entities, and sentiment.
- `GET /api/download/{task_id}`
- Download extracted text as a `.txt` file.
- `GET /api/health`
- Check service health and dependency availability.
### Example curl calls
Upload a file:
```bash
curl -X POST "http://localhost:7860/api/upload" \
-H "x-api-key: your_deployment_api_key_here" \
-F "file=@/path/to/document.pdf"
```
Extract from a URL:
```bash
curl -X POST "http://localhost:7860/api/extract/url" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer your_deployment_api_key_here" \
-d '{"url": "https://example.com/article"}'
```
Check status:
```bash
curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/status/<task_id>"
```
Download text:
```bash
curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/download/<task_id>" -o output.txt
```
## οΏ½π Usage
1. **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
2. **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
3. **URL Entry**: Paste a web link to summarize online articles instantly.
4. **Download**: Once processing is complete, use the **Download** button to save the extracted text.
## π€ AI Tools Used
- **Gemini 1.5 Flash**: Primary AI model for high-precision OCR and structured data extraction.
- **spaCy (en_core_web_sm)**: Used for Named Entity Recognition (NER).
- **VADER**: Sentiment analysis tool integrated with spaCy.
- **Sumy**: Library for extractive summarization of documents.
- **EasyOCR**: Fallback OCR engine for image processing.
- **Tesseract**: Additional OCR engine for text recovery.
- **PyMuPDF**: PDF parsing and layout analysis.
|