MinerUapi / README.md
marcosremar2's picture
Fix README merge conflict
a7cd086
metadata
title: MinerUapi
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

MinerU PDF Converter

This Space provides a service for converting PDF files to Markdown and JSON formats using the MinerU PDF extraction tool.

Features

  • Web interface for uploading and converting PDF files
  • RESTful API for programmatic access
  • Health monitoring endpoint
  • High-quality PDF extraction with support for tables, formulas, and complex layouts
  • Output in both Markdown and structured JSON formats
  • Comprehensive error handling and fallback mechanisms

API Usage

The service exposes several API endpoints for programmatic access:

1. PDF Conversion Endpoint

POST /api/convert

Request:

  • Content-Type: multipart/form-data
  • Body: form field 'file' containing the PDF file

Response:

{
  "success": true,
  "message": "PDF conversion successful",
  "job_id": "uuid",
  "base_filename": "filename",
  "file_info": {
    "original_filename": "document.pdf",
    "size_bytes": 42950,
    "content_type": "application/pdf"
  },
  "markdown": "# Converted markdown content...",
  "json": { 
    "title": "Document Title",
    "sections": [...]
  },
  "log": "Processing log...",
  "files": {
    "markdown_path": "document.md",
    "json_path": "document.json"
  }
}

2. Health Check Endpoint

GET /health

Response:

{
  "status": "healthy",
  "version": "1.1.0",
  "environment": {
    "python_version": "3.10.12",
    "platform": "Linux-6.1.58+-x86_64-with-glibc2.35",
    "processor": "x86_64"
  },
  "configuration": {
    "upload_folder_exists": true,
    "output_folder_exists": true,
    "magic_pdf_installed": true
  }
}

Client Example

A Python client script (api_client.py) is included in this repository for easy integration:

# Example usage
python api_client.py path/to/your/document.pdf --api-url https://marcosremar2-mineruapi.hf.space

The client includes features such as:

  • Automatic health check to verify API status
  • Retry logic for failed requests
  • Progress tracking
  • Comprehensive error handling

You can also use curl:

curl -X POST -F "file=@path/to/your/document.pdf" https://marcosremar2-mineruapi.hf.space/api/convert

And check health with:

curl https://marcosremar2-mineruapi.hf.space/health

Web Interface

The Space also provides a web interface where you can:

  • Upload PDF files for conversion
  • View the generated Markdown and JSON
  • Download the converted files
  • View processing logs

Implementation Details

This service uses:

  • MinerU for high-quality PDF extraction
  • PyMuPDF as a fallback conversion method
  • Flask web server for the interface and API
  • Docker container for deployment on Hugging Face Spaces

Error Handling

The service includes robust error handling:

  • Automatic fallback to local PDF conversion if MinerU is unavailable
  • Detailed error messages and logs
  • API responses include comprehensive details for debugging

Learn More

For more information about MinerU, visit the MinerU repository.