Spaces:

sfdghsdvxfbgn
/

node-2

Running

App Files Files Community

node-2 / README.md

sfdghsdvxfbgn

Upload 7 files

d42d358 verified 17 days ago

preview code

Raw

History Blame Contribute Delete

3.97 kB

metadata

title: MinerU OCR Service
emoji: 📄
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false

MinerU OCR & Document Extraction Service

Production-quality OCR and document extraction API powered by MinerU (magic-pdf package), running on Hugging Face Docker Spaces — free CPU tier.

Root Cause Analysis — Previous Build Failures

Cause	Detail	Fix Applied
`libreoffice` in apt	~1.5 GB installed; caused disk/OOM during build	Removed — DOCX/PPTX/XLSX dropped
X11 libs (`libsm6`, `libxext6`, `libxrender-dev`)	Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display	Removed
`libmagic1`	C dep for python-magic, which was never imported	Removed
`python-magic` in pip	Listed in requirements but never used in code	Removed
`wget` / `curl` in apt	Runtime testing tools, not needed in container	Removed
Single pip layer	Any failure re-downloads all packages including 2 GB magic-pdf	Split into 2 cached layers
MFR models downloaded	Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled	Excluded via ignore_patterns
`ModuleNotFoundError: magic_pdf.pipe.UNIPipe`	API removed in magic-pdf >= 1.0	Replaced with current API

Supported File Types

Category	Extensions
PDF	`.pdf` (searchable, scanned, mixed)
Images	`.jpg` `.jpeg` `.png` `.webp` `.bmp` `.tiff` `.tif` `.gif` `.heic` `.heif` `.avif`

DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable.

API Endpoints

`GET /health`

{ "status": "healthy" }

`GET /status`

{
  "status": "healthy",
  "provider": "mineru",
  "version": "1.3.12",
  "modelsLoaded": true,
  "uptimeSeconds": 3742,
  "memoryUsedMB": 5200,
  "memoryTotalMB": 16384,
  "activeRequests": 0,
  "cacheEntries": 3
}

`POST /extract`

{
  "success": true,
  "filename": "invoice.pdf",
  "docType": "invoice",
  "pageCount": 2,
  "confidence": 0.95,
  "markdown": "# Invoice\n\n...",
  "metadata": {
    "parseMethod": "txt",
    "backend": "pipeline",
    "cached": false,
    "processingTimeMs": 4200,
    "docTypeClassification": "invoice",
    "imageCount": 0,
    "tableCount": 1,
    "formulaCount": 0
  }
}

`POST /batch`

{
  "success": true,
  "processed": 3,
  "results": [...]
}

Deployment

Create a new Hugging Face Space → Docker SDK
Upload all files from this directory to the Space repo root
Space builds automatically (~20–30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache)
Test with curl:

curl https://<your-space>.hf.space/health
curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf"
curl -X POST https://<your-space>.hf.space/batch \
  -F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png"

Resource Budget (free CPU Basic)

Resource	Budget	Usage
Disk	50 GB	~8–10 GB (packages ~3 GB + models ~4–5 GB)
RAM	16 GB	~6–8 GB at load; peaks ~10 GB during extraction
vCPU	2	Single uvicorn worker; sequential batch processing

Known Limitations

No DOCX/PPTX/XLSX — dropped to make the build reliable. Re-add with LibreOffice on paid hardware.
Formula OCR disabled — MFR models excluded to save ~~1-2 GB. Set "enable": true in `~~/magic-pdf.json` and re-download MFR models to enable.
First-request latency — models are lazy-loaded into RAM on the first request (~30–60 s).
30 MB file size limit — enforced on both /extract and /batch.
In-process cache — SHA256 cache lives in RAM, cleared on container restart.
HEIC/HEIF — requires pillow-heif (included); may fail on unusual encoder variants.