--- title: MinerU OCR Service emoji: 📄 colorFrom: blue colorTo: green sdk: docker app_port: 7860 pinned: false --- # MinerU OCR & Document Extraction Service Production-quality OCR and document extraction API powered by **MinerU** (`magic-pdf` package), running on Hugging Face Docker Spaces — free CPU tier. --- ## Root Cause Analysis — Previous Build Failures | Cause | Detail | Fix Applied | |---|---|---| | `libreoffice` in apt | ~1.5 GB installed; caused disk/OOM during build | **Removed** — DOCX/PPTX/XLSX dropped | | X11 libs (`libsm6`, `libxext6`, `libxrender-dev`) | Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display | **Removed** | | `libmagic1` | C dep for python-magic, which was never imported | **Removed** | | `python-magic` in pip | Listed in requirements but never used in code | **Removed** | | `wget` / `curl` in apt | Runtime testing tools, not needed in container | **Removed** | | Single pip layer | Any failure re-downloads all packages including 2 GB magic-pdf | **Split into 2 cached layers** | | MFR models downloaded | Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled | **Excluded via ignore_patterns** | | `ModuleNotFoundError: magic_pdf.pipe.UNIPipe` | API removed in magic-pdf >= 1.0 | **Replaced** with current API | --- ## Supported File Types | Category | Extensions | |---|---| | PDF | `.pdf` (searchable, scanned, mixed) | | Images | `.jpg` `.jpeg` `.png` `.webp` `.bmp` `.tiff` `.tif` `.gif` `.heic` `.heif` `.avif` | > DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable. --- ## API Endpoints ### `GET /health` ```json { "status": "healthy" } ``` ### `GET /status` ```json { "status": "healthy", "provider": "mineru", "version": "1.3.12", "modelsLoaded": true, "uptimeSeconds": 3742, "memoryUsedMB": 5200, "memoryTotalMB": 16384, "activeRequests": 0, "cacheEntries": 3 } ``` ### `POST /extract` ```json { "success": true, "filename": "invoice.pdf", "docType": "invoice", "pageCount": 2, "confidence": 0.95, "markdown": "# Invoice\n\n...", "metadata": { "parseMethod": "txt", "backend": "pipeline", "cached": false, "processingTimeMs": 4200, "docTypeClassification": "invoice", "imageCount": 0, "tableCount": 1, "formulaCount": 0 } } ``` ### `POST /batch` ```json { "success": true, "processed": 3, "results": [...] } ``` --- ## Deployment 1. Create a new Hugging Face Space → **Docker** SDK 2. Upload all files from this directory to the Space repo root 3. Space builds automatically (~20–30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache) 4. Test with curl: ```bash curl https://.hf.space/health curl -X POST https://.hf.space/extract -F "file=@doc.pdf" curl -X POST https://.hf.space/batch \ -F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png" ``` --- ## Resource Budget (free CPU Basic) | Resource | Budget | Usage | |---|---|---| | Disk | 50 GB | ~8–10 GB (packages ~3 GB + models ~4–5 GB) | | RAM | 16 GB | ~6–8 GB at load; peaks ~10 GB during extraction | | vCPU | 2 | Single uvicorn worker; sequential batch processing | --- ## Known Limitations 1. **No DOCX/PPTX/XLSX** — dropped to make the build reliable. Re-add with LibreOffice on paid hardware. 2. **Formula OCR disabled** — MFR models excluded to save ~1-2 GB. Set `"enable": true` in `~/magic-pdf.json` and re-download MFR models to enable. 3. **First-request latency** — models are lazy-loaded into RAM on the first request (~30–60 s). 4. **30 MB file size limit** — enforced on both `/extract` and `/batch`. 5. **In-process cache** — SHA256 cache lives in RAM, cleared on container restart. 6. **HEIC/HEIF** — requires `pillow-heif` (included); may fail on unusual encoder variants.