---
title: MinerU OCR Service
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---

# MinerU OCR & Document Extraction Service

Production-quality OCR and document extraction API powered by **MinerU** (`magic-pdf` package), running on Hugging Face Docker Spaces — free CPU tier.

---

## Root Cause Analysis — Previous Build Failures

| Cause | Detail | Fix Applied |
|---|---|---|
| `libreoffice` in apt | ~1.5 GB installed; caused disk/OOM during build | **Removed** — DOCX/PPTX/XLSX dropped |
| X11 libs (`libsm6`, `libxext6`, `libxrender-dev`) | Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display | **Removed** |
| `libmagic1` | C dep for python-magic, which was never imported | **Removed** |
| `python-magic` in pip | Listed in requirements but never used in code | **Removed** |
| `wget` / `curl` in apt | Runtime testing tools, not needed in container | **Removed** |
| Single pip layer | Any failure re-downloads all packages including 2 GB magic-pdf | **Split into 2 cached layers** |
| MFR models downloaded | Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled | **Excluded via ignore_patterns** |
| `ModuleNotFoundError: magic_pdf.pipe.UNIPipe` | API removed in magic-pdf >= 1.0 | **Replaced** with current API |

---

## Supported File Types

| Category | Extensions |
|---|---|
| PDF | `.pdf` (searchable, scanned, mixed) |
| Images | `.jpg` `.jpeg` `.png` `.webp` `.bmp` `.tiff` `.tif` `.gif` `.heic` `.heif` `.avif` |

> DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable.

---

## API Endpoints

### `GET /health`
```json
{ "status": "healthy" }
```

### `GET /status`
```json
{
  "status": "healthy",
  "provider": "mineru",
  "version": "1.3.12",
  "modelsLoaded": true,
  "uptimeSeconds": 3742,
  "memoryUsedMB": 5200,
  "memoryTotalMB": 16384,
  "activeRequests": 0,
  "cacheEntries": 3
}
```

### `POST /extract`
```json
{
  "success": true,
  "filename": "invoice.pdf",
  "docType": "invoice",
  "pageCount": 2,
  "confidence": 0.95,
  "markdown": "# Invoice\n\n...",
  "metadata": {
    "parseMethod": "txt",
    "backend": "pipeline",
    "cached": false,
    "processingTimeMs": 4200,
    "docTypeClassification": "invoice",
    "imageCount": 0,
    "tableCount": 1,
    "formulaCount": 0
  }
}
```

### `POST /batch`
```json
{
  "success": true,
  "processed": 3,
  "results": [...]
}
```

---

## Deployment

1. Create a new Hugging Face Space → **Docker** SDK
2. Upload all files from this directory to the Space repo root
3. Space builds automatically (~20–30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache)
4. Test with curl:

```bash
curl https://<your-space>.hf.space/health
curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf"
curl -X POST https://<your-space>.hf.space/batch \
  -F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png"
```

---

## Resource Budget (free CPU Basic)

| Resource | Budget | Usage |
|---|---|---|
| Disk | 50 GB | ~8–10 GB (packages ~3 GB + models ~4–5 GB) |
| RAM | 16 GB | ~6–8 GB at load; peaks ~10 GB during extraction |
| vCPU | 2 | Single uvicorn worker; sequential batch processing |

---

## Known Limitations

1. **No DOCX/PPTX/XLSX** — dropped to make the build reliable. Re-add with LibreOffice on paid hardware.
2. **Formula OCR disabled** — MFR models excluded to save ~1-2 GB. Set `"enable": true` in `~/magic-pdf.json` and re-download MFR models to enable.
3. **First-request latency** — models are lazy-loaded into RAM on the first request (~30–60 s).
4. **30 MB file size limit** — enforced on both `/extract` and `/batch`.
5. **In-process cache** — SHA256 cache lives in RAM, cleared on container restart.
6. **HEIC/HEIF** — requires `pillow-heif` (included); may fail on unusual encoder variants.