Spaces:

pradyten
/

pdf-extractor

Running

App Files Files Community

pdf-extractor / README.md

github-actions[bot]

Sync from GitHub

229a366 20 days ago

preview code

raw

history blame contribute delete

1.69 kB

	---
	title: Pdf Extractor
	emoji: 🚀
	colorFrom: red
	colorTo: red
	sdk: docker
	app_port: 8501
	tags:
	- streamlit
	pinned: false
	short_description: pdf_extractor
	---

	# Pdf Extractor

	This repository contains a PDF-to-JSON extractor and (optionally) a Hugging Face
	Space UI.

	## What's here
	- `extractor.py` converts PDFs to images, selects a JSON template, and calls
	OpenAI to extract structured data.
	- `templates/` holds JSON schemas used for extraction.
	- `src/streamlit_app.py` is the Space UI entrypoint (when present).

	## Quick start (local)
	1. `python -m pip install -r requirements.txt`
	2. `python extractor.py`
	3. Provide a PDF filename that matches a keyword in `TEMPLATE_REGISTRY`
	(example: `resume.pdf`).

	## Streamlit UI
	- Run locally with `streamlit run src/streamlit_app.py`.
	- Upload a PDF, preview it, then click Extract to render JSON output.
	- Set `OPENAI_API_KEY` in your environment before running.
	- Space uses `streamlit==1.29.0` for consistent upload behavior.

	## Samples and Supported Documents
	- Sample PDFs are hosted in the Hugging Face dataset `pradyten/pdf-extractor-samples`.
	- The Streamlit UI lists dataset PDFs under "Use sample." Override with `SAMPLE_DATASET_REPO`.
	- Supported document types (based on templates):
	- USCIS Form I-129 H-1B Petition
	- Form I-94 Arrival/Departure Record
	- Form I-20 Certificate of Eligibility
	- Passport
	- US Visa
	- Academic Transcript
	- Diploma
	- Employment Letter
	- Resume/CV
	- Corporate Tax Returns
	- Marriage Certificate
	- Proof of In-Country Status

	## Hugging Face Space
	This Space is configured to run a Streamlit app on port 8501. Set
	`OPENAI_API_KEY` in Space secrets to enable extraction.