PDF-Parser / README.md
saifisvibinn
Fix API docs: ensure HTTPS, add /api/predict endpoint to README
2226eb2
---
title: PDF Layout Extractor
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false
---
# PDF Layout Extraction Tool
A powerful tool for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using DocLayout-YOLO.
## Features
- **REST API endpoints** for programmatic access
- **Real-time progress tracking** with visual progress bar
- **Multiple processing modes:** Images only, Markdown only, or Both
- **Background processing** - upload files and track progress via API
- **Modern web UI** with dark/light theme
- **GPU/CPU support** - automatically detects available hardware
## API Endpoints
**Base URL:** `https://saifisvibin-volaris-pdf-tool.hf.space`
- `POST /api/predict` - **Recommended**: Synchronous PDF extraction (returns complete results immediately)
- `POST /api/upload` - Upload PDFs for async processing (returns `task_id`)
- `GET /api/progress/<task_id>` - Get processing progress (0-100%)
- `GET /api/pdf-list` - List all processed PDFs
- `GET /api/pdf-details/<pdf_stem>` - Get details for a processed PDF
- `GET /api/device-info` - Get GPU/CPU device information
- `GET /api/docs` - Interactive API documentation
- `GET /output/<path>` - Download processed files (PDFs, images, markdown)
## Example API Usage
### Simple Synchronous Extraction (Recommended)
```python
import requests
# Upload and get results immediately
files = {'file': open('document.pdf', 'rb')}
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/predict', files=files)
result = response.json()
print(f"Text: {result['text']}")
print(f"Figures: {len(result['figures'])}")
print(f"Tables: {len(result['tables'])}")
```
### Async Processing with Progress
```python
import requests
import time
# Upload a PDF (async)
files = {'files[]': open('document.pdf', 'rb')}
data = {'extraction_mode': 'both'} # or 'images' or 'markdown'
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/upload', files=files, data=data)
task_id = response.json()['task_id']
# Poll for progress
while True:
progress = requests.get(f'https://saifisvibin-volaris-pdf-tool.hf.space/api/progress/{task_id}').json()
print(f"Progress: {progress['progress']}% - {progress['message']}")
if progress['status'] == 'completed':
break
time.sleep(0.5)
# Get results
results = progress['results']
```
## Processing Modes
1. **Images Only** - Extracts figures and tables with layout detection
2. **Markdown Only** - Extracts text content as markdown
3. **Both** - Extracts both images and markdown content
## Output Structure
Each processed PDF creates a directory with:
- `*_content_list.json` - Metadata for extracted figures/tables
- `*_layout.pdf` - Annotated PDF with layout boxes
- `*.md` - Markdown export (if enabled)
- `figures/` - Extracted figure images
- `tables/` - Extracted table images
## Built With
- [DocLayout-YOLO](https://github.com/juliozhao/DocLayout-YOLO) - Layout detection
- [PyMuPDF](https://pymupdf.readthedocs.io/) - PDF processing
- [Flask](https://flask.palletsprojects.com/) - Web framework