pdf-extractor / README.md
github-actions[bot]
Sync from GitHub
229a366
metadata
title: Pdf Extractor
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: pdf_extractor

Pdf Extractor

This repository contains a PDF-to-JSON extractor and (optionally) a Hugging Face Space UI.

What's here

  • extractor.py converts PDFs to images, selects a JSON template, and calls OpenAI to extract structured data.
  • templates/ holds JSON schemas used for extraction.
  • src/streamlit_app.py is the Space UI entrypoint (when present).

Quick start (local)

  1. python -m pip install -r requirements.txt
  2. python extractor.py
  3. Provide a PDF filename that matches a keyword in TEMPLATE_REGISTRY (example: resume.pdf).

Streamlit UI

  • Run locally with streamlit run src/streamlit_app.py.
  • Upload a PDF, preview it, then click Extract to render JSON output.
  • Set OPENAI_API_KEY in your environment before running.
  • Space uses streamlit==1.29.0 for consistent upload behavior.

Samples and Supported Documents

  • Sample PDFs are hosted in the Hugging Face dataset pradyten/pdf-extractor-samples.
  • The Streamlit UI lists dataset PDFs under "Use sample." Override with SAMPLE_DATASET_REPO.
  • Supported document types (based on templates):
    • USCIS Form I-129 H-1B Petition
    • Form I-94 Arrival/Departure Record
    • Form I-20 Certificate of Eligibility
    • Passport
    • US Visa
    • Academic Transcript
    • Diploma
    • Employment Letter
    • Resume/CV
    • Corporate Tax Returns
    • Marriage Certificate
    • Proof of In-Country Status

Hugging Face Space

This Space is configured to run a Streamlit app on port 8501. Set OPENAI_API_KEY in Space secrets to enable extraction.