PDF-Parser

Sleeping

File size: 3,134 Bytes

a20a1e3
ca421d1
 
 
 
 
 
 
a20a1e3
 
ca421d1
a20a1e3
ca421d1
a20a1e3
ca421d1
a20a1e3
 
ca421d1
a20a1e3
 
 
 
4d5ce5c
ca421d1
 
2226eb2
 
 
 
a20a1e3
 
 
 
2226eb2
a20a1e3
4d5ce5c
ca421d1
 
2226eb2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a20a1e3
 
 
4d5ce5c
2226eb2
a20a1e3
 
ca421d1
a20a1e3
4d5ce5c
a20a1e3
 
ca421d1
a20a1e3
 
 
 
 
 
 
 
 
ca421d1
a20a1e3
ca421d1
 
 
a20a1e3
ca421d1
a20a1e3
ca421d1
 
 
 
 
 
4d5ce5c
ca421d1
4d5ce5c
ca421d1

---
title: PDF Layout Extractor
emoji: 📄
colorFrom: blue
colorTo: purple
sdk: docker
app_file: app.py
pinned: false
---

# PDF Layout Extraction Tool

A powerful tool for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using DocLayout-YOLO.

## Features

- **REST API endpoints** for programmatic access
- **Real-time progress tracking** with visual progress bar
- **Multiple processing modes:** Images only, Markdown only, or Both
- **Background processing** - upload files and track progress via API
- **Modern web UI** with dark/light theme
- **GPU/CPU support** - automatically detects available hardware

## API Endpoints

**Base URL:** `https://saifisvibin-volaris-pdf-tool.hf.space`

- `POST /api/predict` - **Recommended**: Synchronous PDF extraction (returns complete results immediately)
- `POST /api/upload` - Upload PDFs for async processing (returns `task_id`)
- `GET /api/progress/<task_id>` - Get processing progress (0-100%)
- `GET /api/pdf-list` - List all processed PDFs
- `GET /api/pdf-details/<pdf_stem>` - Get details for a processed PDF
- `GET /api/device-info` - Get GPU/CPU device information
- `GET /api/docs` - Interactive API documentation
- `GET /output/<path>` - Download processed files (PDFs, images, markdown)

## Example API Usage

### Simple Synchronous Extraction (Recommended)

```python
import requests

# Upload and get results immediately
files = {'file': open('document.pdf', 'rb')}
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/predict', files=files)
result = response.json()

print(f"Text: {result['text']}")
print(f"Figures: {len(result['figures'])}")
print(f"Tables: {len(result['tables'])}")
```

### Async Processing with Progress

```python
import requests
import time

# Upload a PDF (async)
files = {'files[]': open('document.pdf', 'rb')}
data = {'extraction_mode': 'both'}  # or 'images' or 'markdown'
response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/upload', files=files, data=data)
task_id = response.json()['task_id']

# Poll for progress
while True:
    progress = requests.get(f'https://saifisvibin-volaris-pdf-tool.hf.space/api/progress/{task_id}').json()
    print(f"Progress: {progress['progress']}% - {progress['message']}")
    if progress['status'] == 'completed':
        break
    time.sleep(0.5)

# Get results
results = progress['results']
```

## Processing Modes

1. **Images Only** - Extracts figures and tables with layout detection
2. **Markdown Only** - Extracts text content as markdown
3. **Both** - Extracts both images and markdown content

## Output Structure

Each processed PDF creates a directory with:
- `*_content_list.json` - Metadata for extracted figures/tables
- `*_layout.pdf` - Annotated PDF with layout boxes
- `*.md` - Markdown export (if enabled)
- `figures/` - Extracted figure images
- `tables/` - Extracted table images

## Built With

- [DocLayout-YOLO](https://github.com/juliozhao/DocLayout-YOLO) - Layout detection
- [PyMuPDF](https://pymupdf.readthedocs.io/) - PDF processing
- [Flask](https://flask.palletsprojects.com/) - Web framework