Spaces:

pradyten
/

pdf-extractor

Running

App Files Files Community

pdf-extractor / README.md

github-actions[bot]

Sync from GitHub

229a366 18 days ago

preview code

raw

history blame contribute delete

1.69 kB

metadata

title: Pdf Extractor
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: pdf_extractor

Pdf Extractor

This repository contains a PDF-to-JSON extractor and (optionally) a Hugging Face Space UI.

What's here

extractor.py converts PDFs to images, selects a JSON template, and calls OpenAI to extract structured data.
templates/ holds JSON schemas used for extraction.
src/streamlit_app.py is the Space UI entrypoint (when present).

Quick start (local)

python -m pip install -r requirements.txt
python extractor.py
Provide a PDF filename that matches a keyword in TEMPLATE_REGISTRY (example: resume.pdf).

Streamlit UI

Run locally with streamlit run src/streamlit_app.py.
Upload a PDF, preview it, then click Extract to render JSON output.
Set OPENAI_API_KEY in your environment before running.
Space uses streamlit==1.29.0 for consistent upload behavior.

Samples and Supported Documents

Sample PDFs are hosted in the Hugging Face dataset pradyten/pdf-extractor-samples.
The Streamlit UI lists dataset PDFs under "Use sample." Override with SAMPLE_DATASET_REPO.
Supported document types (based on templates):
- USCIS Form I-129 H-1B Petition
- Form I-94 Arrival/Departure Record
- Form I-20 Certificate of Eligibility
- Passport
- US Visa
- Academic Transcript
- Diploma
- Employment Letter
- Resume/CV
- Corporate Tax Returns
- Marriage Certificate
- Proof of In-Country Status

Hugging Face Space

This Space is configured to run a Streamlit app on port 8501. Set OPENAI_API_KEY in Space secrets to enable extraction.