---
title: Pdf Extractor
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: pdf_extractor
---

# Pdf Extractor

This repository contains a PDF-to-JSON extractor and (optionally) a Hugging Face
Space UI.

## What's here
- `extractor.py` converts PDFs to images, selects a JSON template, and calls
  OpenAI to extract structured data.
- `templates/` holds JSON schemas used for extraction.
- `src/streamlit_app.py` is the Space UI entrypoint (when present).

## Quick start (local)
1. `python -m pip install -r requirements.txt`
2. `python extractor.py`
3. Provide a PDF filename that matches a keyword in `TEMPLATE_REGISTRY`
   (example: `resume.pdf`).

## Streamlit UI
- Run locally with `streamlit run src/streamlit_app.py`.
- Upload a PDF, preview it, then click Extract to render JSON output.
- Set `OPENAI_API_KEY` in your environment before running.
- Space uses `streamlit==1.29.0` for consistent upload behavior.

## Samples and Supported Documents
- Sample PDFs are hosted in the Hugging Face dataset `pradyten/pdf-extractor-samples`.
- The Streamlit UI lists dataset PDFs under "Use sample." Override with `SAMPLE_DATASET_REPO`.
- Supported document types (based on templates):
  - USCIS Form I-129 H-1B Petition
  - Form I-94 Arrival/Departure Record
  - Form I-20 Certificate of Eligibility
  - Passport
  - US Visa
  - Academic Transcript
  - Diploma
  - Employment Letter
  - Resume/CV
  - Corporate Tax Returns
  - Marriage Certificate
  - Proof of In-Country Status

## Hugging Face Space
This Space is configured to run a Streamlit app on port 8501. Set
`OPENAI_API_KEY` in Space secrets to enable extraction.