Spaces:
Running
Running
metadata
title: Pdf Extractor
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: pdf_extractor
Pdf Extractor
This repository contains a PDF-to-JSON extractor and (optionally) a Hugging Face Space UI.
What's here
extractor.pyconverts PDFs to images, selects a JSON template, and calls OpenAI to extract structured data.templates/holds JSON schemas used for extraction.src/streamlit_app.pyis the Space UI entrypoint (when present).
Quick start (local)
python -m pip install -r requirements.txtpython extractor.py- Provide a PDF filename that matches a keyword in
TEMPLATE_REGISTRY(example:resume.pdf).
Streamlit UI
- Run locally with
streamlit run src/streamlit_app.py. - Upload a PDF, preview it, then click Extract to render JSON output.
- Set
OPENAI_API_KEYin your environment before running. - Space uses
streamlit==1.29.0for consistent upload behavior.
Samples and Supported Documents
- Sample PDFs are hosted in the Hugging Face dataset
pradyten/pdf-extractor-samples. - The Streamlit UI lists dataset PDFs under "Use sample." Override with
SAMPLE_DATASET_REPO. - Supported document types (based on templates):
- USCIS Form I-129 H-1B Petition
- Form I-94 Arrival/Departure Record
- Form I-20 Certificate of Eligibility
- Passport
- US Visa
- Academic Transcript
- Diploma
- Employment Letter
- Resume/CV
- Corporate Tax Returns
- Marriage Certificate
- Proof of In-Country Status
Hugging Face Space
This Space is configured to run a Streamlit app on port 8501. Set
OPENAI_API_KEY in Space secrets to enable extraction.