Spaces:
Running
Running
| title: MinerU OCR Service | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: false | |
| # MinerU OCR & Document Extraction Service | |
| Production-quality OCR and document extraction API powered by **MinerU** (`magic-pdf` package), running on Hugging Face Docker Spaces β free CPU tier. | |
| --- | |
| ## Root Cause Analysis β Previous Build Failures | |
| | Cause | Detail | Fix Applied | | |
| |---|---|---| | |
| | `libreoffice` in apt | ~1.5 GB installed; caused disk/OOM during build | **Removed** β DOCX/PPTX/XLSX dropped | | |
| | X11 libs (`libsm6`, `libxext6`, `libxrender-dev`) | Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display | **Removed** | | |
| | `libmagic1` | C dep for python-magic, which was never imported | **Removed** | | |
| | `python-magic` in pip | Listed in requirements but never used in code | **Removed** | | |
| | `wget` / `curl` in apt | Runtime testing tools, not needed in container | **Removed** | | |
| | Single pip layer | Any failure re-downloads all packages including 2 GB magic-pdf | **Split into 2 cached layers** | | |
| | MFR models downloaded | Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled | **Excluded via ignore_patterns** | | |
| | `ModuleNotFoundError: magic_pdf.pipe.UNIPipe` | API removed in magic-pdf >= 1.0 | **Replaced** with current API | | |
| --- | |
| ## Supported File Types | |
| | Category | Extensions | | |
| |---|---| | |
| | PDF | `.pdf` (searchable, scanned, mixed) | | |
| | Images | `.jpg` `.jpeg` `.png` `.webp` `.bmp` `.tiff` `.tif` `.gif` `.heic` `.heif` `.avif` | | |
| > DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable. | |
| --- | |
| ## API Endpoints | |
| ### `GET /health` | |
| ```json | |
| { "status": "healthy" } | |
| ``` | |
| ### `GET /status` | |
| ```json | |
| { | |
| "status": "healthy", | |
| "provider": "mineru", | |
| "version": "1.3.12", | |
| "modelsLoaded": true, | |
| "uptimeSeconds": 3742, | |
| "memoryUsedMB": 5200, | |
| "memoryTotalMB": 16384, | |
| "activeRequests": 0, | |
| "cacheEntries": 3 | |
| } | |
| ``` | |
| ### `POST /extract` | |
| ```json | |
| { | |
| "success": true, | |
| "filename": "invoice.pdf", | |
| "docType": "invoice", | |
| "pageCount": 2, | |
| "confidence": 0.95, | |
| "markdown": "# Invoice\n\n...", | |
| "metadata": { | |
| "parseMethod": "txt", | |
| "backend": "pipeline", | |
| "cached": false, | |
| "processingTimeMs": 4200, | |
| "docTypeClassification": "invoice", | |
| "imageCount": 0, | |
| "tableCount": 1, | |
| "formulaCount": 0 | |
| } | |
| } | |
| ``` | |
| ### `POST /batch` | |
| ```json | |
| { | |
| "success": true, | |
| "processed": 3, | |
| "results": [...] | |
| } | |
| ``` | |
| --- | |
| ## Deployment | |
| 1. Create a new Hugging Face Space β **Docker** SDK | |
| 2. Upload all files from this directory to the Space repo root | |
| 3. Space builds automatically (~20β30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache) | |
| 4. Test with curl: | |
| ```bash | |
| curl https://<your-space>.hf.space/health | |
| curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf" | |
| curl -X POST https://<your-space>.hf.space/batch \ | |
| -F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png" | |
| ``` | |
| --- | |
| ## Resource Budget (free CPU Basic) | |
| | Resource | Budget | Usage | | |
| |---|---|---| | |
| | Disk | 50 GB | ~8β10 GB (packages ~3 GB + models ~4β5 GB) | | |
| | RAM | 16 GB | ~6β8 GB at load; peaks ~10 GB during extraction | | |
| | vCPU | 2 | Single uvicorn worker; sequential batch processing | | |
| --- | |
| ## Known Limitations | |
| 1. **No DOCX/PPTX/XLSX** β dropped to make the build reliable. Re-add with LibreOffice on paid hardware. | |
| 2. **Formula OCR disabled** β MFR models excluded to save ~1-2 GB. Set `"enable": true` in `~/magic-pdf.json` and re-download MFR models to enable. | |
| 3. **First-request latency** β models are lazy-loaded into RAM on the first request (~30β60 s). | |
| 4. **30 MB file size limit** β enforced on both `/extract` and `/batch`. | |
| 5. **In-process cache** β SHA256 cache lives in RAM, cleared on container restart. | |
| 6. **HEIC/HEIF** β requires `pillow-heif` (included); may fail on unusual encoder variants. | |