PDF-Parser

Running

App Files Files Community

PDF-Parser / README.md

saifisvibinn

Fix API docs: ensure HTTPS, add /api/predict endpoint to README

2226eb2 5 months ago

preview code

raw

history blame contribute delete

3.13 kB

	---
	title: PDF Layout Extractor
	emoji: 📄
	colorFrom: blue
	colorTo: purple
	sdk: docker
	app_file: app.py
	pinned: false
	---

	# PDF Layout Extraction Tool

	A powerful tool for extracting figures, tables, annotated layouts, and markdown text from scientific PDFs using DocLayout-YOLO.

	## Features

	- REST API endpoints for programmatic access
	- Real-time progress tracking with visual progress bar
	- Multiple processing modes: Images only, Markdown only, or Both
	- Background processing - upload files and track progress via API
	- Modern web UI with dark/light theme
	- GPU/CPU support - automatically detects available hardware

	## API Endpoints

	Base URL: `https://saifisvibin-volaris-pdf-tool.hf.space`

	- `POST /api/predict` - Recommended: Synchronous PDF extraction (returns complete results immediately)
	- `POST /api/upload` - Upload PDFs for async processing (returns `task_id`)
	- `GET /api/progress/<task_id>` - Get processing progress (0-100%)
	- `GET /api/pdf-list` - List all processed PDFs
	- `GET /api/pdf-details/<pdf_stem>` - Get details for a processed PDF
	- `GET /api/device-info` - Get GPU/CPU device information
	- `GET /api/docs` - Interactive API documentation
	- `GET /output/<path>` - Download processed files (PDFs, images, markdown)

	## Example API Usage

	### Simple Synchronous Extraction (Recommended)

	```python
	import requests

	# Upload and get results immediately
	files = {'file': open('document.pdf', 'rb')}
	response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/predict', files=files)
	result = response.json()

	print(f"Text: {result['text']}")
	print(f"Figures: {len(result['figures'])}")
	print(f"Tables: {len(result['tables'])}")
	```

	### Async Processing with Progress

	```python
	import requests
	import time

	# Upload a PDF (async)
	files = {'files[]': open('document.pdf', 'rb')}
	data = {'extraction_mode': 'both'} # or 'images' or 'markdown'
	response = requests.post('https://saifisvibin-volaris-pdf-tool.hf.space/api/upload', files=files, data=data)
	task_id = response.json()['task_id']

	# Poll for progress
	while True:
	progress = requests.get(f'https://saifisvibin-volaris-pdf-tool.hf.space/api/progress/{task_id}').json()
	print(f"Progress: {progress['progress']}% - {progress['message']}")
	if progress['status'] == 'completed':
	break
	time.sleep(0.5)

	# Get results
	results = progress['results']
	```

	## Processing Modes

	1. Images Only - Extracts figures and tables with layout detection
	2. Markdown Only - Extracts text content as markdown
	3. Both - Extracts both images and markdown content

	## Output Structure

	Each processed PDF creates a directory with:
	- `*_content_list.json` - Metadata for extracted figures/tables
	- `*_layout.pdf` - Annotated PDF with layout boxes
	- `*.md` - Markdown export (if enabled)
	- `figures/` - Extracted figure images
	- `tables/` - Extracted table images

	## Built With

	- [DocLayout-YOLO](https://github.com/juliozhao/DocLayout-YOLO) - Layout detection
	- [PyMuPDF](https://pymupdf.readthedocs.io/) - PDF processing
	- [Flask](https://flask.palletsprojects.com/) - Web framework