node-2 / README.md
sfdghsdvxfbgn's picture
Upload 7 files
d42d358 verified
|
Raw
History Blame Contribute Delete
3.97 kB
metadata
title: MinerU OCR Service
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

MinerU OCR & Document Extraction Service

Production-quality OCR and document extraction API powered by MinerU (magic-pdf package), running on Hugging Face Docker Spaces β€” free CPU tier.


Root Cause Analysis β€” Previous Build Failures

Cause Detail Fix Applied
libreoffice in apt ~1.5 GB installed; caused disk/OOM during build Removed β€” DOCX/PPTX/XLSX dropped
X11 libs (libsm6, libxext6, libxrender-dev) Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display Removed
libmagic1 C dep for python-magic, which was never imported Removed
python-magic in pip Listed in requirements but never used in code Removed
wget / curl in apt Runtime testing tools, not needed in container Removed
Single pip layer Any failure re-downloads all packages including 2 GB magic-pdf Split into 2 cached layers
MFR models downloaded Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled Excluded via ignore_patterns
ModuleNotFoundError: magic_pdf.pipe.UNIPipe API removed in magic-pdf >= 1.0 Replaced with current API

Supported File Types

Category Extensions
PDF .pdf (searchable, scanned, mixed)
Images .jpg .jpeg .png .webp .bmp .tiff .tif .gif .heic .heif .avif

DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable.


API Endpoints

GET /health

{ "status": "healthy" }

GET /status

{
  "status": "healthy",
  "provider": "mineru",
  "version": "1.3.12",
  "modelsLoaded": true,
  "uptimeSeconds": 3742,
  "memoryUsedMB": 5200,
  "memoryTotalMB": 16384,
  "activeRequests": 0,
  "cacheEntries": 3
}

POST /extract

{
  "success": true,
  "filename": "invoice.pdf",
  "docType": "invoice",
  "pageCount": 2,
  "confidence": 0.95,
  "markdown": "# Invoice\n\n...",
  "metadata": {
    "parseMethod": "txt",
    "backend": "pipeline",
    "cached": false,
    "processingTimeMs": 4200,
    "docTypeClassification": "invoice",
    "imageCount": 0,
    "tableCount": 1,
    "formulaCount": 0
  }
}

POST /batch

{
  "success": true,
  "processed": 3,
  "results": [...]
}

Deployment

  1. Create a new Hugging Face Space β†’ Docker SDK
  2. Upload all files from this directory to the Space repo root
  3. Space builds automatically (~20–30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache)
  4. Test with curl:
curl https://<your-space>.hf.space/health
curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf"
curl -X POST https://<your-space>.hf.space/batch \
  -F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png"

Resource Budget (free CPU Basic)

Resource Budget Usage
Disk 50 GB ~8–10 GB (packages ~3 GB + models ~4–5 GB)
RAM 16 GB ~6–8 GB at load; peaks ~10 GB during extraction
vCPU 2 Single uvicorn worker; sequential batch processing

Known Limitations

  1. No DOCX/PPTX/XLSX β€” dropped to make the build reliable. Re-add with LibreOffice on paid hardware.
  2. Formula OCR disabled β€” MFR models excluded to save 1-2 GB. Set "enable": true in `/magic-pdf.json` and re-download MFR models to enable.
  3. First-request latency β€” models are lazy-loaded into RAM on the first request (~30–60 s).
  4. 30 MB file size limit β€” enforced on both /extract and /batch.
  5. In-process cache β€” SHA256 cache lives in RAM, cleared on container restart.
  6. HEIC/HEIF β€” requires pillow-heif (included); may fail on unusual encoder variants.