Spaces:

pradyten
/

pdf-extractor

Running

File size: 3,363 Bytes

8e52fc5

---
title: Pdf Extractor
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: pdf_extractor
---

# PDF-to-JSON Extractor with AI

Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.

## 📋 Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Technology Stack](#technology-stack)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Author](#author)

## 🎯 Overview

This application converts PDF documents into structured JSON format using:
- **OpenAI GPT-4 Vision**: For intelligent content extraction
- **Template-based extraction**: Customizable JSON schemas for different document types
- **Streamlit UI**: Interactive web interface for easy PDF processing
- **Docker support**: Containerized deployment for production environments

Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.

## ✨ Features

- **AI-Powered Extraction**: Uses GPT-4 Vision to understand document structure
- **Template System**: Pre-configured JSON templates for common document types
- **Batch Processing**: Handle multiple PDFs efficiently
- **Image Preview**: Visual confirmation of PDF pages before extraction
- **Format Validation**: Ensures extracted JSON matches defined schema
- **Hugging Face Spaces**: Ready for cloud deployment

## 🛠 Technology Stack

- **Python 3.9+** - Primary programming language
- **OpenAI API** - GPT-4 Vision for intelligent extraction
- **pypdfium2** - PDF rendering and image conversion
- **Streamlit** - Interactive web UI framework
- **Pillow (PIL)** - Image processing
- **Pandas** - Data manipulation

## 🚀 Installation

### Prerequisites
- Python 3.9 or higher
- OpenAI API key ([Get one here](https://platform.openai.com/api-keys))

### Setup

1. Clone the repository:
\`\`\`bash
git clone https://github.com/pradyten/pdf-extractor.git
cd pdf-extractor
\`\`\`

2. Install dependencies:
\`\`\`bash
pip install -r requirements.txt
\`\`\`

3. Configure OpenAI API key:
\`\`\`bash
export OPENAI_API_KEY='your-api-key-here'
\`\`\`

## 💻 Usage

### Command Line
\`\`\`bash
python extractor.py path/to/document.pdf
\`\`\`

### Streamlit Web UI
\`\`\`bash
streamlit run src/streamlit_app.py
\`\`\`

### Docker
\`\`\`bash
docker build -t pdf-extractor .
docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor
\`\`\`

## ⚙️ Configuration

Define custom templates in \`extractor.py\` for different document types (resumes, invoices, forms).

## 🎓 Use Cases

- **HR & Recruitment**: Batch process resume PDFs
- **Accounting**: Extract invoice data
- **Data Entry**: Automate form digitization
- **Document Management**: Convert scanned documents to searchable JSON

## 🔒 Security & Privacy

- Never commit API keys - use environment variables
- PDFs are processed in-memory, not stored
- Review OpenAI's data usage policies for compliance

## 👨‍💻 Author

**Pradyumn Tendulkar**

Data Science Graduate Student | ML Engineer

- GitHub: [@pradyten](https://github.com/pradyten)
- LinkedIn: [Pradyumn Tendulkar](https://www.linkedin.com/in/p-tendulkar/)
- Email: pktendulkar@wpi.edu

---

⭐ If you found this project helpful, please consider giving it a star!

📝 **License:** MIT