Spaces:
Running
Running
metadata
title: MinerU OCR Service
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
MinerU OCR & Document Extraction Service
Production-quality OCR and document extraction API powered by MinerU (magic-pdf package), running on Hugging Face Docker Spaces β free CPU tier.
Root Cause Analysis β Previous Build Failures
| Cause | Detail | Fix Applied |
|---|---|---|
libreoffice in apt |
~1.5 GB installed; caused disk/OOM during build | Removed β DOCX/PPTX/XLSX dropped |
X11 libs (libsm6, libxext6, libxrender-dev) |
Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display | Removed |
libmagic1 |
C dep for python-magic, which was never imported | Removed |
python-magic in pip |
Listed in requirements but never used in code | Removed |
wget / curl in apt |
Runtime testing tools, not needed in container | Removed |
| Single pip layer | Any failure re-downloads all packages including 2 GB magic-pdf | Split into 2 cached layers |
| MFR models downloaded | Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled | Excluded via ignore_patterns |
ModuleNotFoundError: magic_pdf.pipe.UNIPipe |
API removed in magic-pdf >= 1.0 | Replaced with current API |
Supported File Types
| Category | Extensions |
|---|---|
.pdf (searchable, scanned, mixed) |
|
| Images | .jpg .jpeg .png .webp .bmp .tiff .tif .gif .heic .heif .avif |
DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable.
API Endpoints
GET /health
{ "status": "healthy" }
GET /status
{
"status": "healthy",
"provider": "mineru",
"version": "1.3.12",
"modelsLoaded": true,
"uptimeSeconds": 3742,
"memoryUsedMB": 5200,
"memoryTotalMB": 16384,
"activeRequests": 0,
"cacheEntries": 3
}
POST /extract
{
"success": true,
"filename": "invoice.pdf",
"docType": "invoice",
"pageCount": 2,
"confidence": 0.95,
"markdown": "# Invoice\n\n...",
"metadata": {
"parseMethod": "txt",
"backend": "pipeline",
"cached": false,
"processingTimeMs": 4200,
"docTypeClassification": "invoice",
"imageCount": 0,
"tableCount": 1,
"formulaCount": 0
}
}
POST /batch
{
"success": true,
"processed": 3,
"results": [...]
}
Deployment
- Create a new Hugging Face Space β Docker SDK
- Upload all files from this directory to the Space repo root
- Space builds automatically (~20β30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache)
- Test with curl:
curl https://<your-space>.hf.space/health
curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf"
curl -X POST https://<your-space>.hf.space/batch \
-F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png"
Resource Budget (free CPU Basic)
| Resource | Budget | Usage |
|---|---|---|
| Disk | 50 GB | ~8β10 GB (packages ~3 GB + models ~4β5 GB) |
| RAM | 16 GB | ~6β8 GB at load; peaks ~10 GB during extraction |
| vCPU | 2 | Single uvicorn worker; sequential batch processing |
Known Limitations
- No DOCX/PPTX/XLSX β dropped to make the build reliable. Re-add with LibreOffice on paid hardware.
- Formula OCR disabled β MFR models excluded to save
1-2 GB. Set/magic-pdf.json` and re-download MFR models to enable."enable": truein ` - First-request latency β models are lazy-loaded into RAM on the first request (~30β60 s).
- 30 MB file size limit β enforced on both
/extractand/batch. - In-process cache β SHA256 cache lives in RAM, cleared on container restart.
- HEIC/HEIF β requires
pillow-heif(included); may fail on unusual encoder variants.