File size: 3,146 Bytes
4e3d16d
b286135
4e3d16d
 
 
 
 
b286135
4e3d16d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7cd086
4e3d16d
 
 
 
 
 
 
 
 
 
 
a7cd086
4e3d16d
 
 
 
 
a7cd086
4e3d16d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a7cd086
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
title: MinerUapi
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# MinerU PDF Converter

This Space provides a service for converting PDF files to Markdown and JSON formats using the MinerU PDF extraction tool.

## Features

- Web interface for uploading and converting PDF files
- RESTful API for programmatic access
- Health monitoring endpoint
- High-quality PDF extraction with support for tables, formulas, and complex layouts
- Output in both Markdown and structured JSON formats
- Comprehensive error handling and fallback mechanisms

## API Usage

The service exposes several API endpoints for programmatic access:

### 1. PDF Conversion Endpoint

```
POST /api/convert
```

**Request:**
- Content-Type: multipart/form-data
- Body: form field 'file' containing the PDF file

**Response:**
```json
{
  "success": true,
  "message": "PDF conversion successful",
  "job_id": "uuid",
  "base_filename": "filename",
  "file_info": {
    "original_filename": "document.pdf",
    "size_bytes": 42950,
    "content_type": "application/pdf"
  },
  "markdown": "# Converted markdown content...",
  "json": { 
    "title": "Document Title",
    "sections": [...]
  },
  "log": "Processing log...",
  "files": {
    "markdown_path": "document.md",
    "json_path": "document.json"
  }
}
```

### 2. Health Check Endpoint

```
GET /health
```

**Response:**
```json
{
  "status": "healthy",
  "version": "1.1.0",
  "environment": {
    "python_version": "3.10.12",
    "platform": "Linux-6.1.58+-x86_64-with-glibc2.35",
    "processor": "x86_64"
  },
  "configuration": {
    "upload_folder_exists": true,
    "output_folder_exists": true,
    "magic_pdf_installed": true
  }
}
```

### Client Example

A Python client script (`api_client.py`) is included in this repository for easy integration:

```python
# Example usage
python api_client.py path/to/your/document.pdf --api-url https://marcosremar2-mineruapi.hf.space
```

The client includes features such as:
- Automatic health check to verify API status
- Retry logic for failed requests
- Progress tracking
- Comprehensive error handling

You can also use curl:

```bash
curl -X POST -F "file=@path/to/your/document.pdf" https://marcosremar2-mineruapi.hf.space/api/convert
```

And check health with:

```bash
curl https://marcosremar2-mineruapi.hf.space/health
```

## Web Interface

The Space also provides a web interface where you can:
- Upload PDF files for conversion
- View the generated Markdown and JSON
- Download the converted files
- View processing logs

## Implementation Details

This service uses:
- MinerU for high-quality PDF extraction
- PyMuPDF as a fallback conversion method
- Flask web server for the interface and API
- Docker container for deployment on Hugging Face Spaces

## Error Handling

The service includes robust error handling:
- Automatic fallback to local PDF conversion if MinerU is unavailable
- Detailed error messages and logs
- API responses include comprehensive details for debugging

## Learn More

For more information about MinerU, visit [the MinerU repository](https://github.com/opendatalab/MinerU).