Spaces:
Running
Running
File size: 3,363 Bytes
8e52fc5 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 | ---
title: Pdf Extractor
emoji: π
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: pdf_extractor
---
# PDF-to-JSON Extractor with AI
Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.
## π Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Technology Stack](#technology-stack)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Author](#author)
## π― Overview
This application converts PDF documents into structured JSON format using:
- **OpenAI GPT-4 Vision**: For intelligent content extraction
- **Template-based extraction**: Customizable JSON schemas for different document types
- **Streamlit UI**: Interactive web interface for easy PDF processing
- **Docker support**: Containerized deployment for production environments
Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.
## β¨ Features
- **AI-Powered Extraction**: Uses GPT-4 Vision to understand document structure
- **Template System**: Pre-configured JSON templates for common document types
- **Batch Processing**: Handle multiple PDFs efficiently
- **Image Preview**: Visual confirmation of PDF pages before extraction
- **Format Validation**: Ensures extracted JSON matches defined schema
- **Hugging Face Spaces**: Ready for cloud deployment
## π Technology Stack
- **Python 3.9+** - Primary programming language
- **OpenAI API** - GPT-4 Vision for intelligent extraction
- **pypdfium2** - PDF rendering and image conversion
- **Streamlit** - Interactive web UI framework
- **Pillow (PIL)** - Image processing
- **Pandas** - Data manipulation
## π Installation
### Prerequisites
- Python 3.9 or higher
- OpenAI API key ([Get one here](https://platform.openai.com/api-keys))
### Setup
1. Clone the repository:
\`\`\`bash
git clone https://github.com/pradyten/pdf-extractor.git
cd pdf-extractor
\`\`\`
2. Install dependencies:
\`\`\`bash
pip install -r requirements.txt
\`\`\`
3. Configure OpenAI API key:
\`\`\`bash
export OPENAI_API_KEY='your-api-key-here'
\`\`\`
## π» Usage
### Command Line
\`\`\`bash
python extractor.py path/to/document.pdf
\`\`\`
### Streamlit Web UI
\`\`\`bash
streamlit run src/streamlit_app.py
\`\`\`
### Docker
\`\`\`bash
docker build -t pdf-extractor .
docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor
\`\`\`
## βοΈ Configuration
Define custom templates in \`extractor.py\` for different document types (resumes, invoices, forms).
## π Use Cases
- **HR & Recruitment**: Batch process resume PDFs
- **Accounting**: Extract invoice data
- **Data Entry**: Automate form digitization
- **Document Management**: Convert scanned documents to searchable JSON
## π Security & Privacy
- Never commit API keys - use environment variables
- PDFs are processed in-memory, not stored
- Review OpenAI's data usage policies for compliance
## π¨βπ» Author
**Pradyumn Tendulkar**
Data Science Graduate Student | ML Engineer
- GitHub: [@pradyten](https://github.com/pradyten)
- LinkedIn: [Pradyumn Tendulkar](https://www.linkedin.com/in/p-tendulkar/)
- Email: pktendulkar@wpi.edu
---
β If you found this project helpful, please consider giving it a star!
π **License:** MIT
|