File size: 3,363 Bytes
8e52fc5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
title: Pdf Extractor
emoji: πŸš€
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
- streamlit
pinned: false
short_description: pdf_extractor
---

# PDF-to-JSON Extractor with AI

Intelligent PDF document parser that extracts structured JSON data using OpenAI's GPT models and computer vision.

## πŸ“‹ Table of Contents
- [Overview](#overview)
- [Features](#features)
- [Technology Stack](#technology-stack)
- [Installation](#installation)
- [Usage](#usage)
- [Configuration](#configuration)
- [Author](#author)

## 🎯 Overview

This application converts PDF documents into structured JSON format using:
- **OpenAI GPT-4 Vision**: For intelligent content extraction
- **Template-based extraction**: Customizable JSON schemas for different document types
- **Streamlit UI**: Interactive web interface for easy PDF processing
- **Docker support**: Containerized deployment for production environments

Perfect for automating data extraction from resumes, invoices, forms, and other structured documents.

## ✨ Features

- **AI-Powered Extraction**: Uses GPT-4 Vision to understand document structure
- **Template System**: Pre-configured JSON templates for common document types
- **Batch Processing**: Handle multiple PDFs efficiently
- **Image Preview**: Visual confirmation of PDF pages before extraction
- **Format Validation**: Ensures extracted JSON matches defined schema
- **Hugging Face Spaces**: Ready for cloud deployment

## πŸ›  Technology Stack

- **Python 3.9+** - Primary programming language
- **OpenAI API** - GPT-4 Vision for intelligent extraction
- **pypdfium2** - PDF rendering and image conversion
- **Streamlit** - Interactive web UI framework
- **Pillow (PIL)** - Image processing
- **Pandas** - Data manipulation

## πŸš€ Installation

### Prerequisites
- Python 3.9 or higher
- OpenAI API key ([Get one here](https://platform.openai.com/api-keys))

### Setup

1. Clone the repository:
\`\`\`bash
git clone https://github.com/pradyten/pdf-extractor.git
cd pdf-extractor
\`\`\`

2. Install dependencies:
\`\`\`bash
pip install -r requirements.txt
\`\`\`

3. Configure OpenAI API key:
\`\`\`bash
export OPENAI_API_KEY='your-api-key-here'
\`\`\`

## πŸ’» Usage

### Command Line
\`\`\`bash
python extractor.py path/to/document.pdf
\`\`\`

### Streamlit Web UI
\`\`\`bash
streamlit run src/streamlit_app.py
\`\`\`

### Docker
\`\`\`bash
docker build -t pdf-extractor .
docker run -p 8501:8501 -e OPENAI_API_KEY='your-key' pdf-extractor
\`\`\`

## βš™οΈ Configuration

Define custom templates in \`extractor.py\` for different document types (resumes, invoices, forms).

## πŸŽ“ Use Cases

- **HR & Recruitment**: Batch process resume PDFs
- **Accounting**: Extract invoice data
- **Data Entry**: Automate form digitization
- **Document Management**: Convert scanned documents to searchable JSON

## πŸ”’ Security & Privacy

- Never commit API keys - use environment variables
- PDFs are processed in-memory, not stored
- Review OpenAI's data usage policies for compliance

## πŸ‘¨β€πŸ’» Author

**Pradyumn Tendulkar**

Data Science Graduate Student | ML Engineer

- GitHub: [@pradyten](https://github.com/pradyten)
- LinkedIn: [Pradyumn Tendulkar](https://www.linkedin.com/in/p-tendulkar/)
- Email: pktendulkar@wpi.edu

---

⭐ If you found this project helpful, please consider giving it a star!

πŸ“ **License:** MIT