Spaces:
Running
Running
| title: Pdf Extractor | |
| emoji: ๐ | |
| colorFrom: red | |
| colorTo: red | |
| sdk: docker | |
| app_port: 8501 | |
| tags: | |
| - streamlit | |
| pinned: false | |
| short_description: pdf_extractor | |
| # Pdf Extractor | |
| This repository contains a PDF-to-JSON extractor and (optionally) a Hugging Face | |
| Space UI. | |
| ## What's here | |
| - `extractor.py` converts PDFs to images, selects a JSON template, and calls | |
| OpenAI to extract structured data. | |
| - `templates/` holds JSON schemas used for extraction. | |
| - `src/streamlit_app.py` is the Space UI entrypoint (when present). | |
| ## Quick start (local) | |
| 1. `python -m pip install -r requirements.txt` | |
| 2. `python extractor.py` | |
| 3. Provide a PDF filename that matches a keyword in `TEMPLATE_REGISTRY` | |
| (example: `resume.pdf`). | |
| ## Streamlit UI | |
| - Run locally with `streamlit run src/streamlit_app.py`. | |
| - Upload a PDF, preview it, then click Extract to render JSON output. | |
| - Set `OPENAI_API_KEY` in your environment before running. | |
| - Space uses `streamlit==1.29.0` for consistent upload behavior. | |
| ## Samples and Supported Documents | |
| - Sample PDFs are hosted in the Hugging Face dataset `pradyten/pdf-extractor-samples`. | |
| - The Streamlit UI lists dataset PDFs under "Use sample." Override with `SAMPLE_DATASET_REPO`. | |
| - Supported document types (based on templates): | |
| - USCIS Form I-129 H-1B Petition | |
| - Form I-94 Arrival/Departure Record | |
| - Form I-20 Certificate of Eligibility | |
| - Passport | |
| - US Visa | |
| - Academic Transcript | |
| - Diploma | |
| - Employment Letter | |
| - Resume/CV | |
| - Corporate Tax Returns | |
| - Marriage Certificate | |
| - Proof of In-Country Status | |
| ## Hugging Face Space | |
| This Space is configured to run a Streamlit app on port 8501. Set | |
| `OPENAI_API_KEY` in Space secrets to enable extraction. | |