Spaces:

sfdghsdvxfbgn
/

node-2

Running

App Files Files Community

node-2 / README.md

sfdghsdvxfbgn

Upload 7 files

d42d358 verified 18 days ago

preview code

Raw

History Blame Contribute Delete

3.97 kB

	---
	title: MinerU OCR Service
	emoji: 📄
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	---

	# MinerU OCR & Document Extraction Service

	Production-quality OCR and document extraction API powered by MinerU (`magic-pdf` package), running on Hugging Face Docker Spaces — free CPU tier.

	---

	## Root Cause Analysis — Previous Build Failures

	\| Cause \| Detail \| Fix Applied \|
	\|---\|---\|---\|
	\| `libreoffice` in apt \| ~1.5 GB installed; caused disk/OOM during build \| Removed — DOCX/PPTX/XLSX dropped \|
	\| X11 libs (`libsm6`, `libxext6`, `libxrender-dev`) \| Only needed for OpenCV GUI windows (cv2.imshow); headless server never uses a display \| Removed \|
	\| `libmagic1` \| C dep for python-magic, which was never imported \| Removed \|
	\| `python-magic` in pip \| Listed in requirements but never used in code \| Removed \|
	\| `wget` / `curl` in apt \| Runtime testing tools, not needed in container \| Removed \|
	\| Single pip layer \| Any failure re-downloads all packages including 2 GB magic-pdf \| Split into 2 cached layers \|
	\| MFR models downloaded \| Formula recognition (unimernet, ~1-2 GB) downloaded even though disabled \| Excluded via ignore_patterns \|
	\| `ModuleNotFoundError: magic_pdf.pipe.UNIPipe` \| API removed in magic-pdf >= 1.0 \| Replaced with current API \|

	---

	## Supported File Types

	\| Category \| Extensions \|
	\|---\|---\|
	\| PDF \| `.pdf` (searchable, scanned, mixed) \|
	\| Images \| `.jpg` `.jpeg` `.png` `.webp` `.bmp` `.tiff` `.tif` `.gif` `.heic` `.heif` `.avif` \|

	> DOCX / PPTX / XLSX were removed because they required LibreOffice, which caused build failures on free-tier hardware. Add them back once the base deployment is stable.

	---

	## API Endpoints

	### `GET /health`
	```json
	{ "status": "healthy" }
	```

	### `GET /status`
	```json
	{
	"status": "healthy",
	"provider": "mineru",
	"version": "1.3.12",
	"modelsLoaded": true,
	"uptimeSeconds": 3742,
	"memoryUsedMB": 5200,
	"memoryTotalMB": 16384,
	"activeRequests": 0,
	"cacheEntries": 3
	}
	```

	### `POST /extract`
	```json
	{
	"success": true,
	"filename": "invoice.pdf",
	"docType": "invoice",
	"pageCount": 2,
	"confidence": 0.95,
	"markdown": "# Invoice\n\n...",
	"metadata": {
	"parseMethod": "txt",
	"backend": "pipeline",
	"cached": false,
	"processingTimeMs": 4200,
	"docTypeClassification": "invoice",
	"imageCount": 0,
	"tableCount": 1,
	"formulaCount": 0
	}
	}
	```

	### `POST /batch`
	```json
	{
	"success": true,
	"processed": 3,
	"results": [...]
	}
	```

	---

	## Deployment

	1. Create a new Hugging Face Space → Docker SDK
	2. Upload all files from this directory to the Space repo root
	3. Space builds automatically (~20–30 min for first build; subsequent code-only rebuilds are faster due to Docker layer cache)
	4. Test with curl:

	```bash
	curl https://<your-space>.hf.space/health
	curl -X POST https://<your-space>.hf.space/extract -F "file=@doc.pdf"
	curl -X POST https://<your-space>.hf.space/batch \
	-F "files=@a.pdf" -F "files=@b.jpg" -F "files=@c.png"
	```

	---

	## Resource Budget (free CPU Basic)

	\| Resource \| Budget \| Usage \|
	\|---\|---\|---\|
	\| Disk \| 50 GB \| ~8–10 GB (packages ~3 GB + models ~4–5 GB) \|
	\| RAM \| 16 GB \| ~6–8 GB at load; peaks ~10 GB during extraction \|
	\| vCPU \| 2 \| Single uvicorn worker; sequential batch processing \|

	---

	## Known Limitations

	1. No DOCX/PPTX/XLSX — dropped to make the build reliable. Re-add with LibreOffice on paid hardware.
	2. Formula OCR disabled — MFR models excluded to save ~1-2 GB. Set `"enable": true` in `~/magic-pdf.json` and re-download MFR models to enable.
	3. First-request latency — models are lazy-loaded into RAM on the first request (~30–60 s).
	4. 30 MB file size limit — enforced on both `/extract` and `/batch`.
	5. In-process cache — SHA256 cache lives in RAM, cleared on container restart.
	6. HEIC/HEIF — requires `pillow-heif` (included); may fail on unusual encoder variants.