krishnachoudhary-hclguvi commited on
Commit
38365d2
·
unverified ·
1 Parent(s): 02b5142

Sync GitHub commit 5677d7c production and endpoint fixes

Browse files
Files changed (5) hide show
  1. README.md +93 -43
  2. config.py +10 -0
  3. main.py +156 -24
  4. models/schemas.py +5 -0
  5. test_sync_api.py +57 -0
README.md CHANGED
@@ -1,85 +1,135 @@
1
- ---
2
- title: Text Extraction Api
3
- emoji: 🏃
4
- colorFrom: blue
5
- colorTo: indigo
6
- sdk: docker
7
- pinned: false
8
- ---
9
-
10
- ---
11
- title: Text Extraction Api
12
- emoji: ??
13
- colorFrom: blue
14
- colorTo: white
15
- sdk: docker
16
- pinned: false
17
- ---
18
- # Alldocex — Intelligent Document Processing System
19
 
20
  ![Version](https://img.shields.io/badge/version-1.1.0-blue)
21
  ![License](https://img.shields.io/badge/license-MIT-green)
22
 
23
  **Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
24
 
25
- ## 🚀 Key Features
26
 
27
  * **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
28
- * **Layout-Aware PDF Engine**: Uses advanced 'layout' mode to preserve columns, tables, and physical text positioning.
29
- * **Intelligent OCR**: Powered by **EasyOCR** (Deep Learning based) for superior accuracy in scanned documents.
 
 
 
 
 
30
  * **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
31
- * **AI Analysis Suite**:
32
- * **Extractive Summarization**: Condenses long documents into key highlights.
33
- * **Named Entity Recognition (NER)**: Detects People, Organizations, Dates, and more via **spaCy**.
34
- * **Sentiment Analysis**: Analyzes emotional tone using the **VADER** algorithm.
35
- * **Downloadable Results**: Export extracted text as clean `.txt` files.
36
- * **Corporate UI**: A professional Blue & White dashboard with smooth animations and intuitive navigation.
37
 
38
- ## 🛠️ Technology Stack
39
 
40
  * **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
41
- * **PDF Processing**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/) (Layout Mode)
42
- * **OCR**: [EasyOCR](https://github.com/JaidedAI/EasyOCR) & [Tesseract](https://github.com/tesseract-ocr/tesseract)
43
- * **NLP**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
44
- * **Frontend**: Vanilla HTML5, CSS3 (Modern UI), and JavaScript (ES6+)
45
 
46
- ## 📦 Installation
47
 
48
  ### 1. Clone the repository
49
  ```bash
50
  git clone <your-repo-url>
51
- cd guvi-extraction
52
  ```
53
 
54
- ### 2. Install dependencies
 
 
 
 
 
 
 
 
 
 
 
55
  ```bash
56
  pip install -r requirements.txt
57
  ```
58
 
59
- ### 3. Install NLP model
60
  ```bash
61
  python -m spacy download en_core_web_sm
 
62
  ```
63
 
64
- ## 🏃 Getting Started
65
 
66
  1. Start the backend server:
67
  ```bash
68
  python main.py
69
  ```
70
- 2. Open your browser and navigate to:
71
- `http://localhost:7860
72
- `
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
 
74
- ## 📘 Usage
75
 
76
  1. **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
77
  2. **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
78
  3. **URL Entry**: Paste a web link to summarize online articles instantly.
79
  4. **Download**: Once processing is complete, use the **Download** button to save the extracted text.
80
 
 
81
 
82
- ---
83
-
 
 
 
 
 
84
 
85
 
 
1
+ # Alldocex — Intelligent Document Processing System
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
  ![Version](https://img.shields.io/badge/version-1.1.0-blue)
4
  ![License](https://img.shields.io/badge/license-MIT-green)
5
 
6
  **Alldocex** is a high-performance, professional-grade document intelligence platform that extracts, analyzes, and summarizes content from various document formats using state-of-the-art AI.
7
 
8
+ ## 🚀 Key Features
9
 
10
  * **Multi-Format Extraction**: Supports PDF, DOCX, and high-resolution images (PNG, JPG, TIFF, etc.).
11
+ * **Gemini AI-Powered Extraction**: Integrates **Gemini 1.5 Flash** for high-precision, layout-aware OCR and structured data extraction.
12
+ * **Structured AI Analysis**:
13
+ * Generates clean, structured output combining high-level key points and explicitly extracted details (names, phone numbers, contact info).
14
+ * **Extractive Summarization**: Condenses long documents into bulleted top highlights.
15
+ * **Named Entity Recognition (NER)** & **Sentiment Analysis**: Detailed semantic NLP via **spaCy** and **VADER**.
16
+ * **Robust Fallback Mechanisms**: Deep scan OCR recovery using **EasyOCR** and **Tesseract** locally when AI processing fails or hits quota limits.
17
+ * **Perfected Document Typography**: Uses **Marked.js** for native Markdown-parsed display delivering mathematically perfect text alignment and human-readable formatting.
18
  * **Web URL Summarization**: Paste any web link to instantly extract and analyze its core content.
19
+ * **Downloadable & Exportable Results**: Export raw structured summaries and text as clean `.txt` files.
20
+ * **Corporate UI**: A premium Blue & White dashboard with smooth user flows and dynamic interactions.
21
+ * **Cloud Ready**: Specifically tailored and tested for automated deployment to **Hugging Face Spaces**.
 
 
 
22
 
23
+ ## 🛠️ Technology Stack
24
 
25
  * **Backend**: [FastAPI](https://fastapi.tiangolo.com/) (Async Python)
26
+ * **AI Engine**: [Google Gemini API](https://aistudio.google.com/) (Gemini 1.5 Flash)
27
+ * **OCR & Layout Recovery**: [PyMuPDF](https://pymupdf.readthedocs.io/en/latest/), [EasyOCR](https://github.com/JaidedAI/EasyOCR), & [Tesseract](https://github.com/tesseract-ocr/tesseract)
28
+ * **NLP Processing**: [spaCy](https://spacy.io/) & [Sumy](https://github.com/miso-belica/sumy)
29
+ * **Frontend**: Vanilla HTML5, CSS3, ES6 JavaScript, and [Marked.js](https://marked.js.org/) for rendering.
30
 
31
+ ## 📦 Installation & Setup
32
 
33
  ### 1. Clone the repository
34
  ```bash
35
  git clone <your-repo-url>
36
+ cd <repo-folder>
37
  ```
38
 
39
+ ### 2. Environment Variables
40
+ Create a `.env` file in the root directory and add your Google Gemini API key plus the deployment API access key:
41
+ ```env
42
+ GEMINI_API_KEY=your_gemini_api_key_here
43
+ API_ACCESS_KEY=your_deployment_api_key_here
44
+ ```
45
+
46
+ The deployed API expects a valid key in one of these headers:
47
+ - `x-api-key: your_deployment_api_key_here`
48
+ - `Authorization: Bearer your_deployment_api_key_here`
49
+
50
+ ### 3. Install dependencies
51
  ```bash
52
  pip install -r requirements.txt
53
  ```
54
 
55
+ ### 4. Install NLP model & OS Dependencies (if missing)
56
  ```bash
57
  python -m spacy download en_core_web_sm
58
+ # Note: Tesseract OCR must be installed on your system's OS layer.
59
  ```
60
 
61
+ ## 🏃 Getting Started
62
 
63
  1. Start the backend server:
64
  ```bash
65
  python main.py
66
  ```
67
+ 2. Open your browser and navigate to the indicated localhost address (e.g., `http://localhost:7860`).
68
+
69
+ ## � API Endpoints
70
+
71
+ The deployment exposes these authenticated API endpoints:
72
+
73
+ - `POST /api/upload`
74
+ - Upload a document file and start processing.
75
+ - Content type: `multipart/form-data`
76
+ - Header: `x-api-key` or `Authorization: Bearer <key>`
77
+
78
+ - `POST /api/extract/url`
79
+ - Send a JSON payload with a URL to extract content.
80
+ - Example body: `{ "url": "https://example.com/article" }`
81
+
82
+ - `GET /api/status/{task_id}`
83
+ - Poll task status and receive extracted text, summary, entities, and sentiment.
84
+
85
+ - `GET /api/download/{task_id}`
86
+ - Download extracted text as a `.txt` file.
87
+
88
+ - `GET /api/health`
89
+ - Check service health and dependency availability.
90
+
91
+ ### Example curl calls
92
+
93
+ Upload a file:
94
+ ```bash
95
+ curl -X POST "http://localhost:7860/api/upload" \
96
+ -H "x-api-key: your_deployment_api_key_here" \
97
+ -F "file=@/path/to/document.pdf"
98
+ ```
99
+
100
+ Extract from a URL:
101
+ ```bash
102
+ curl -X POST "http://localhost:7860/api/extract/url" \
103
+ -H "Content-Type: application/json" \
104
+ -H "Authorization: Bearer your_deployment_api_key_here" \
105
+ -d '{"url": "https://example.com/article"}'
106
+ ```
107
+
108
+ Check status:
109
+ ```bash
110
+ curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/status/<task_id>"
111
+ ```
112
+
113
+ Download text:
114
+ ```bash
115
+ curl -H "x-api-key: your_deployment_api_key_here" "http://localhost:7860/api/download/<task_id>" -o output.txt
116
+ ```
117
 
118
+ ## �📘 Usage
119
 
120
  1. **Direct Upload**: Drag and drop your PDFs or images into the dashboard.
121
  2. **Format Selection**: Click on specific badges (PDF, PNG, JPG) to open a filtered file picker.
122
  3. **URL Entry**: Paste a web link to summarize online articles instantly.
123
  4. **Download**: Once processing is complete, use the **Download** button to save the extracted text.
124
 
125
+ ## 🤖 AI Tools Used
126
 
127
+ - **Gemini 1.5 Flash**: Primary AI model for high-precision OCR and structured data extraction.
128
+ - **spaCy (en_core_web_sm)**: Used for Named Entity Recognition (NER).
129
+ - **VADER**: Sentiment analysis tool integrated with spaCy.
130
+ - **Sumy**: Library for extractive summarization of documents.
131
+ - **EasyOCR**: Fallback OCR engine for image processing.
132
+ - **Tesseract**: Additional OCR engine for text recovery.
133
+ - **PyMuPDF**: PDF parsing and layout analysis.
134
 
135
 
config.py CHANGED
@@ -81,6 +81,16 @@ SENTIMENT_THRESHOLDS = {
81
  GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
82
  GEMINI_MODEL_NAME = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
83
 
 
 
 
 
 
 
 
 
 
 
84
  # Flag to check if Gemini is configured
85
  def is_gemini_available():
86
  return bool(GEMINI_API_KEY)
 
81
  GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")
82
  GEMINI_MODEL_NAME = os.getenv("GEMINI_MODEL", "gemini-2.5-flash")
83
 
84
+ # API access key for external clients
85
+ API_ACCESS_KEY = (
86
+ os.getenv("API_ACCESS_KEY") or
87
+ os.getenv("VALID_API_KEY") or
88
+ os.getenv("API_KEY")
89
+ )
90
+
91
+ def is_api_key_valid(key: str) -> bool:
92
+ return bool(API_ACCESS_KEY and key and key.strip() == API_ACCESS_KEY)
93
+
94
  # Flag to check if Gemini is configured
95
  def is_gemini_available():
96
  return bool(GEMINI_API_KEY)
main.py CHANGED
@@ -6,12 +6,65 @@ import os
6
  import uuid
7
  import time
8
  import asyncio
9
- from typing import Dict
10
- from fastapi import FastAPI, UploadFile, File, HTTPException
11
  from fastapi.staticfiles import StaticFiles
12
  from fastapi.responses import FileResponse, JSONResponse
13
  from fastapi.middleware.cors import CORSMiddleware
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
 
15
  from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
16
  from models.schemas import (
17
  UploadResponse, ProcessingResult, TaskStatus,
@@ -67,15 +120,29 @@ def _get_file_type(filename: str) -> str:
67
  return "unknown"
68
 
69
 
70
- def _process_document(file_path: str, file_type: str, task_id: str):
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
  """
72
- Process a document: extract text, then run all analyzers.
73
- This runs in a thread pool to avoid blocking the event loop.
74
  """
75
- start_time = time.time()
76
- task = tasks[task_id]
77
- task.status = TaskStatus.PROCESSING
78
-
79
  try:
80
  # Step 1: Extract text based on file type
81
  if file_type == "pdf":
@@ -97,19 +164,21 @@ def _process_document(file_path: str, file_type: str, task_id: str):
97
  task.error_message = extraction.error_message or "No text could be extracted."
98
  task.processing_time_ms = (time.time() - start_time) * 1000
99
  return
 
100
  raw_text = extraction.raw_text
101
 
102
  # Intelligent Formatting Pass via Gemini
103
- formatted_text = clean_format_text(raw_text)
104
-
105
- if formatted_text == raw_text:
106
- # Fallback cleanup for broken line breaks if Gemini was unavailable
107
- import re
108
- formatted_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', formatted_text)
109
- formatted_text = re.sub(r'[ \t]+', ' ', formatted_text)
110
-
111
- extraction.raw_text = formatted_text.strip()
112
- raw_text = extraction.raw_text
 
113
 
114
  # Step 2: Summarization
115
  try:
@@ -137,10 +206,22 @@ def _process_document(file_path: str, file_type: str, task_id: str):
137
  task.error_message = str(e)
138
  task.processing_time_ms = (time.time() - start_time) * 1000
139
 
 
 
 
 
 
 
 
 
 
 
 
 
140
  finally:
141
  # Clean up uploaded file
142
  try:
143
- if os.path.exists(file_path):
144
  os.remove(file_path)
145
  except Exception:
146
  pass
@@ -148,7 +229,7 @@ def _process_document(file_path: str, file_type: str, task_id: str):
148
 
149
  # --- API Endpoints ---
150
 
151
- @app.post("/api/upload", response_model=ProcessingResult)
152
  async def upload_and_process(file: UploadFile = File(...)):
153
  """
154
  Upload a document and start processing.
@@ -204,7 +285,58 @@ async def upload_and_process(file: UploadFile = File(...)):
204
  return task
205
 
206
 
207
- @app.post("/api/extract/url", response_model=ProcessingResult)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
208
  async def extract_from_url(data: Dict[str, str]):
209
  """
210
  Extract content from a web URL and process it.
@@ -236,7 +368,7 @@ async def extract_from_url(data: Dict[str, str]):
236
  return task
237
 
238
 
239
- @app.get("/api/status/{task_id}")
240
  async def get_task_status(task_id: str):
241
  """Get the processing status and results for a task."""
242
  if task_id not in tasks:
@@ -244,7 +376,7 @@ async def get_task_status(task_id: str):
244
  return tasks[task_id]
245
 
246
 
247
- @app.get("/api/download/{task_id}")
248
  async def download_results(task_id: str):
249
  """Download the extracted text as a .txt file."""
250
  if task_id not in tasks:
 
6
  import uuid
7
  import time
8
  import asyncio
9
+ from typing import Dict, Optional
10
+ from fastapi import FastAPI, UploadFile, File, HTTPException, Depends, Header
11
  from fastapi.staticfiles import StaticFiles
12
  from fastapi.responses import FileResponse, JSONResponse
13
  from fastapi.middleware.cors import CORSMiddleware
14
+ import ssl
15
+
16
+ # --- CRITICAL: Setup NLP models BEFORE importing analyzers/extractors ---
17
+ def _setup_nlp_models():
18
+ """Download NLTK and spaCy models on startup."""
19
+ print("=" * 60)
20
+ print("Initializing NLP models (this may take a few minutes)...")
21
+ print("=" * 60)
22
+
23
+ # Fix SSL for NLTK downloads
24
+ try:
25
+ if hasattr(ssl, '_create_unverified_context'):
26
+ ssl._create_default_https_context = ssl._create_unverified_context
27
+ except:
28
+ pass
29
+
30
+ # Download NLTK data
31
+ try:
32
+ import nltk
33
+ print("[1/3] NLTK resources...", end=" ", flush=True)
34
+ nltk.download('wordnet', quiet=True)
35
+ nltk.download('punkt', quiet=True)
36
+ nltk.download('omw-1.4', quiet=True)
37
+ nltk.download('averaged_perceptron_tagger', quiet=True)
38
+ print("✓")
39
+ except Exception as e:
40
+ print(f"⚠ ({e})")
41
+
42
+ # Download spaCy model
43
+ try:
44
+ import spacy
45
+ print("[2/3] spaCy en_core_web_sm...", end=" ", flush=True)
46
+ try:
47
+ spacy.load('en_core_web_sm')
48
+ print("✓")
49
+ except OSError:
50
+ print("downloading...", end=" ", flush=True)
51
+ import subprocess
52
+ subprocess.run([sys.executable, "-m", "spacy", "download", "en_core_web_sm"], capture_output=True)
53
+ print("✓")
54
+ except Exception as e:
55
+ print(f"⚠ ({e})")
56
+
57
+ print("[3/3] App initialization...", end=" ", flush=True)
58
+ print("✓")
59
+ print("=" * 60)
60
+ print("NLP setup complete! App is ready.")
61
+ print("=" * 60 + "\n")
62
+
63
+ # Setup models IMMEDIATELY
64
+ import sys
65
+ _setup_nlp_models()
66
 
67
+ import config
68
  from config import UPLOAD_DIR, STATIC_DIR, MAX_FILE_SIZE_BYTES, ALLOWED_EXTENSIONS
69
  from models.schemas import (
70
  UploadResponse, ProcessingResult, TaskStatus,
 
120
  return "unknown"
121
 
122
 
123
+ async def get_api_key(
124
+ x_api_key: Optional[str] = Header(None, alias="x-api-key"),
125
+ authorization: Optional[str] = Header(None, alias="Authorization"),
126
+ ) -> str:
127
+ """Validate incoming API key from header or bearer auth."""
128
+ token = x_api_key
129
+ if authorization:
130
+ bearer_prefix = "Bearer "
131
+ if authorization.startswith(bearer_prefix):
132
+ token = authorization[len(bearer_prefix) :].strip()
133
+ else:
134
+ token = authorization.strip()
135
+
136
+ if not token or not config.is_api_key_valid(token):
137
+ raise HTTPException(status_code=401, detail="Unauthorized. Invalid API key.")
138
+
139
+ return token
140
+
141
+
142
+ def _perform_extraction_and_analysis(task: ProcessingResult, file_path: str, file_type: str, start_time: float):
143
  """
144
+ Common logic for document processing: extraction, summarization, NER, and sentiment.
 
145
  """
 
 
 
 
146
  try:
147
  # Step 1: Extract text based on file type
148
  if file_type == "pdf":
 
164
  task.error_message = extraction.error_message or "No text could be extracted."
165
  task.processing_time_ms = (time.time() - start_time) * 1000
166
  return
167
+
168
  raw_text = extraction.raw_text
169
 
170
  # Intelligent Formatting Pass via Gemini
171
+ try:
172
+ formatted_text = clean_format_text(raw_text)
173
+ if formatted_text == raw_text:
174
+ # Fallback cleanup for broken line breaks if Gemini was unavailable
175
+ import re
176
+ formatted_text = re.sub(r'(?<!\n)\n(?!\n)', ' ', formatted_text)
177
+ formatted_text = re.sub(r'[ \t]+', ' ', formatted_text)
178
+ extraction.raw_text = formatted_text.strip()
179
+ raw_text = extraction.raw_text
180
+ except Exception as e:
181
+ print(f"Text cleanup error: {e}")
182
 
183
  # Step 2: Summarization
184
  try:
 
206
  task.error_message = str(e)
207
  task.processing_time_ms = (time.time() - start_time) * 1000
208
 
209
+
210
+ def _process_document(file_path: str, file_type: str, task_id: str):
211
+ """
212
+ Process a document: extract text, then run all analyzers.
213
+ This runs in a thread pool to avoid blocking the event loop.
214
+ """
215
+ start_time = time.time()
216
+ task = tasks[task_id]
217
+ task.status = TaskStatus.PROCESSING
218
+
219
+ try:
220
+ _perform_extraction_and_analysis(task, file_path, file_type, start_time)
221
  finally:
222
  # Clean up uploaded file
223
  try:
224
+ if os.path.exists(file_path) and file_type != "url":
225
  os.remove(file_path)
226
  except Exception:
227
  pass
 
229
 
230
  # --- API Endpoints ---
231
 
232
+ @app.post("/api/upload", response_model=ProcessingResult, dependencies=[Depends(get_api_key)])
233
  async def upload_and_process(file: UploadFile = File(...)):
234
  """
235
  Upload a document and start processing.
 
285
  return task
286
 
287
 
288
+ @app.post("/api/v1/extract", response_model=ProcessingResult, dependencies=[Depends(get_api_key)])
289
+ async def synchronous_extract(file: UploadFile = File(...)):
290
+ """
291
+ Synchronous extraction endpoint for API testers and bots.
292
+ Directly returns the extraction results.
293
+ """
294
+ # 1. Validation
295
+ filename = file.filename or "unknown"
296
+ ext = filename.rsplit(".", 1)[-1].lower() if "." in filename else ""
297
+ if ext not in ALLOWED_EXTENSIONS:
298
+ raise HTTPException(status_code=400, detail=f"Unsupported file type: .{ext}")
299
+
300
+ content = await file.read()
301
+ if len(content) > MAX_FILE_SIZE_BYTES:
302
+ raise HTTPException(status_code=400, detail="File too large.")
303
+ if len(content) == 0:
304
+ raise HTTPException(status_code=400, detail="Empty file.")
305
+
306
+ # 2. Save temporary file
307
+ file_id = f"sync_{str(uuid.uuid4())[:8]}"
308
+ file_path = os.path.join(UPLOAD_DIR, f"{file_id}_{filename}")
309
+ with open(file_path, "wb") as f:
310
+ f.write(content)
311
+
312
+ # 3. Process
313
+ file_type = _get_file_type(filename)
314
+ start_time = time.time()
315
+
316
+ # Create the result object
317
+ task = ProcessingResult.create_pending(file_id=file_id, filename=filename, file_type=file_type)
318
+
319
+ # Run processing synchronously in the current thread (it's okay here because it's a dedicated sync endpoint)
320
+ # Actually, to be safe with FastAPI's async loop, we should run it in a thread still,
321
+ # but await its completion.
322
+ await asyncio.get_event_loop().run_in_executor(
323
+ None, _perform_extraction_and_analysis, task, file_path, file_type, start_time
324
+ )
325
+
326
+ # 4. Cleanup
327
+ try:
328
+ if os.path.exists(file_path):
329
+ os.remove(file_path)
330
+ except Exception:
331
+ pass
332
+
333
+ if task.status == TaskStatus.ERROR:
334
+ raise HTTPException(status_code=500, detail=task.error_message or "Processing failed.")
335
+
336
+ return task
337
+
338
+
339
+ @app.post("/api/extract/url", response_model=ProcessingResult, dependencies=[Depends(get_api_key)])
340
  async def extract_from_url(data: Dict[str, str]):
341
  """
342
  Extract content from a web URL and process it.
 
368
  return task
369
 
370
 
371
+ @app.get("/api/status/{task_id}", dependencies=[Depends(get_api_key)])
372
  async def get_task_status(task_id: str):
373
  """Get the processing status and results for a task."""
374
  if task_id not in tasks:
 
376
  return tasks[task_id]
377
 
378
 
379
+ @app.get("/api/download/{task_id}", dependencies=[Depends(get_api_key)])
380
  async def download_results(task_id: str):
381
  """Download the extracted text as a .txt file."""
382
  if task_id not in tasks:
models/schemas.py CHANGED
@@ -96,6 +96,7 @@ class SentimentResult(BaseModel):
96
  class ProcessingResult(BaseModel):
97
  file_id: str
98
  filename: str
 
99
  file_type: str
100
  status: TaskStatus
101
  extraction: Optional[ExtractionResult] = None
@@ -106,11 +107,15 @@ class ProcessingResult(BaseModel):
106
  error_message: Optional[str] = None
107
  timestamp: float = 0
108
 
 
 
 
109
  @staticmethod
110
  def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
111
  return ProcessingResult(
112
  file_id=file_id,
113
  filename=filename,
 
114
  file_type=file_type,
115
  status=TaskStatus.PENDING,
116
  timestamp=time.time(),
 
96
  class ProcessingResult(BaseModel):
97
  file_id: str
98
  filename: str
99
+ fileName: Optional[str] = None # CamelCase for external testers
100
  file_type: str
101
  status: TaskStatus
102
  extraction: Optional[ExtractionResult] = None
 
107
  error_message: Optional[str] = None
108
  timestamp: float = 0
109
 
110
+ class Config:
111
+ allow_population_by_field_name = True
112
+
113
  @staticmethod
114
  def create_pending(file_id: str, filename: str, file_type: str) -> "ProcessingResult":
115
  return ProcessingResult(
116
  file_id=file_id,
117
  filename=filename,
118
+ fileName=filename,
119
  file_type=file_type,
120
  status=TaskStatus.PENDING,
121
  timestamp=time.time(),
test_sync_api.py ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import json
3
+ import sys
4
+ import os
5
+
6
+ BASE_URL = "http://127.0.0.1:7860"
7
+ API_KEY = "alldocex-test-key-2024"
8
+
9
+ def test_sync_extract(file_path):
10
+ print(f"Testing synchronous extraction for: {file_path}")
11
+
12
+ if not os.path.exists(file_path):
13
+ print(f"Error: File not found: {file_path}")
14
+ return
15
+
16
+ url = f"{BASE_URL}/api/v1/extract"
17
+ headers = {
18
+ "x-api-key": API_KEY
19
+ }
20
+
21
+ files = {
22
+ "file": (os.path.basename(file_path), open(file_path, "rb"), "application/octet-stream")
23
+ }
24
+
25
+ try:
26
+ response = requests.post(url, headers=headers, files=files)
27
+ print(f"Status Code: {response.status_code}")
28
+
29
+ if response.status_code == 200:
30
+ result = response.json()
31
+ print("\n--- RESULTS ---")
32
+ print(f"Filename: {result.get('filename')}")
33
+ print(f"Status: {result.get('status')}")
34
+ print(f"Extraction Success: {result.get('extraction', {}).get('success')}")
35
+
36
+ text = result.get('extraction', {}).get('raw_text', '')
37
+ print(f"Full Text Length: {len(text)}")
38
+ print(f"Snippet: {text[:200]}...")
39
+
40
+ summary = result.get('summary', {}).get('summary', '')
41
+ if summary:
42
+ print(f"Summary Snippet: {summary[:200]}...")
43
+
44
+ entities = result.get('entities', {}).get('total_entities', 0)
45
+ print(f"Total Entities Foundations: {entities}")
46
+
47
+ print("\n[SUCCESS] Synchronous endpoint working correctly.")
48
+ else:
49
+ print(f"Error Response: {response.text}")
50
+
51
+ except Exception as e:
52
+ print(f"Request failed: {e}")
53
+
54
+ if __name__ == "__main__":
55
+ # Test with the existing sample document
56
+ sample_doc = "test_document.docx"
57
+ test_sync_extract(sample_doc)