node-2 / README.md
sfdghsdvxfbgn's picture
Upload 7 files
d42d358 verified
|
Raw
History Blame Contribute Delete
3.97 kB
---
title: MinerU OCR Service
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
---
# MinerU OCR & Document Extraction Service
Production-quality OCR and document extraction API powered by **MinerU** (`magic-pdf` package), running on Hugging Face Docker Spaces β€” free CPU tier.
---
## Root Cause Analysis β€” Previous Build Failures
| Cause | Detail | Fix Applied |
|---|---|---|
| `libreoffice` in apt | ~1.5 GB installed; caused disk/OOM during build | **Removed** β€” DOCX/PPTX/XLSX dropped |
| X11 libs (`libsm6`, `libxext6`, `libxrender-dev`) | Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display | **Removed** |
| `libmagic1` | C dep for python-magic, which was never imported | **Removed** |
| `python-magic` in pip | Listed in requirements but never used in code | **Removed** |
| `wget` / `curl` in apt | Runtime testing tools, not needed in container | **Removed** |
| Single pip layer | Any failure re-downloads all packages including 2 GB magic-pdf | **Split into 2 cached layers** |
| MFR models downloaded | Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled | **Excluded via ignore_patterns** |
| `ModuleNotFoundError: magic_pdf.pipe.UNIPipe` | API removed in magic-pdf >= 1.0 | **Replaced** with current API |
---
## Supported File Types
| Category | Extensions |
|---|---|
| PDF | `.pdf` (searchable, scanned, mixed) |
| Images | `.jpg` `.jpeg` `.png` `.webp` `.bmp` `.tiff` `.tif` `.gif` `.heic` `.heif` `.avif` |
> DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable.
---
## API Endpoints
### `GET /health`
```json
{ "status": "healthy" }
```
### `GET /status`
```json
{
"status": "healthy",
"provider": "mineru",
"version": "1.3.12",
"modelsLoaded": true,
"uptimeSeconds": 3742,
"memoryUsedMB": 5200,
"memoryTotalMB": 16384,
"activeRequests": 0,
"cacheEntries": 3
}
```
### `POST /extract`
```json
{
"success": true,
"filename": "invoice.pdf",
"docType": "invoice",
"pageCount": 2,
"confidence": 0.95,
"markdown": "# Invoice\n\n...",
"metadata": {
"parseMethod": "txt",
"backend": "pipeline",
"cached": false,
"processingTimeMs": 4200,
"docTypeClassification": "invoice",
"imageCount": 0,
"tableCount": 1,
"formulaCount": 0
}
}
```
### `POST /batch`
```json
{
"success": true,
"processed": 3,
"results": [...]
}
```
---
## Deployment
1. Create a new Hugging Face Space β†’ **Docker** SDK
2. Upload all files from this directory to the Space repo root
3. Space builds automatically (~20–30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache)
4. Test with curl:
```bash
curl https://<your-space>.hf.space/health
curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf"
curl -X POST https://<your-space>.hf.space/batch \
-F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png"
```
---
## Resource Budget (free CPU Basic)
| Resource | Budget | Usage |
|---|---|---|
| Disk | 50 GB | ~8–10 GB (packages ~3 GB + models ~4–5 GB) |
| RAM | 16 GB | ~6–8 GB at load; peaks ~10 GB during extraction |
| vCPU | 2 | Single uvicorn worker; sequential batch processing |
---
## Known Limitations
1. **No DOCX/PPTX/XLSX** β€” dropped to make the build reliable. Re-add with LibreOffice on paid hardware.
2. **Formula OCR disabled** β€” MFR models excluded to save ~1-2 GB. Set `"enable": true` in `~/magic-pdf.json` and re-download MFR models to enable.
3. **First-request latency** β€” models are lazy-loaded into RAM on the first request (~30–60 s).
4. **30 MB file size limit** β€” enforced on both `/extract` and `/batch`.
5. **In-process cache** β€” SHA256 cache lives in RAM, cleared on container restart.
6. **HEIC/HEIF** β€” requires `pillow-heif` (included); may fail on unusual encoder variants.