Spaces:
Running
Running
| title: PDF Layout Extractor | |
| emoji: 📄 | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: docker | |
| app_file: app.py | |
| pinned: false | |
| # PDF Layout Extraction Tool | |
| A powerful tool for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using DocLayout-YOLO. | |
| ## Features | |
| - **REST API endpoints** for programmatic access | |
| - **Real-time progress tracking** with visual progress bar | |
| - **Multiple processing modes:** Images only, Markdown only, or Both | |
| - **Background processing** - upload files and track progress via API | |
| - **Modern web UI** with dark/light theme | |
| - **GPU/CPU support** - automatically detects available hardware | |
| ## API Endpoints | |
| **Base URL:** `https://saifisvibin-volaris-pdf-tool.hf.space` | |
| - `POST /api/predict` - **Recommended**: Synchronous PDF extraction (returns complete results immediately) | |
| - `POST /api/upload` - Upload PDFs for async processing (returns `task_id`) | |
| - `GET /api/progress/<task_id>` - Get processing progress (0-100%) | |
| - `GET /api/pdf-list` - List all processed PDFs | |
| - `GET /api/pdf-details/<pdf_stem>` - Get details for a processed PDF | |
| - `GET /api/device-info` - Get GPU/CPU device information | |
| - `GET /api/docs` - Interactive API documentation | |
| - `GET /output/<path>` - Download processed files (PDFs, images, markdown) | |
| ## Example API Usage | |
| ### Simple Synchronous Extraction (Recommended) | |
| ```python | |
| import requests | |
| # Upload and get results immediately | |
| files = {'file': open('document.pdf', 'rb')} | |
| response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/predict', files=files) | |
| result = response.json() | |
| print(f"Text: {result['text']}") | |
| print(f"Figures: {len(result['figures'])}") | |
| print(f"Tables: {len(result['tables'])}") | |
| ``` | |
| ### Async Processing with Progress | |
| ```python | |
| import requests | |
| import time | |
| # Upload a PDF (async) | |
| files = {'files[]': open('document.pdf', 'rb')} | |
| data = {'extraction_mode': 'both'} # or 'images' or 'markdown' | |
| response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/upload', files=files, data=data) | |
| task_id = response.json()['task_id'] | |
| # Poll for progress | |
| while True: | |
| progress = requests.get(f'https://saifisvibin-volaris-pdf-tool.hf.space/api/progress/{task_id}').json() | |
| print(f"Progress: {progress['progress']}% - {progress['message']}") | |
| if progress['status'] == 'completed': | |
| break | |
| time.sleep(0.5) | |
| # Get results | |
| results = progress['results'] | |
| ``` | |
| ## Processing Modes | |
| 1. **Images Only** - Extracts figures and tables with layout detection | |
| 2. **Markdown Only** - Extracts text content as markdown | |
| 3. **Both** - Extracts both images and markdown content | |
| ## Output Structure | |
| Each processed PDF creates a directory with: | |
| - `*_content_list.json` - Metadata for extracted figures/tables | |
| - `*_layout.pdf` - Annotated PDF with layout boxes | |
| - `*.md` - Markdown export (if enabled) | |
| - `figures/` - Extracted figure images | |
| - `tables/` - Extracted table images | |
| ## Built With | |
| - [DocLayout-YOLO](https://github.com/juliozhao/DocLayout-YOLO) - Layout detection | |
| - [PyMuPDF](https://pymupdf.readthedocs.io/) - PDF processing | |
| - [Flask](https://flask.palletsprojects.com/) - Web framework | |