Spaces:

kevinconka
/

pdf2product

Sleeping

App Files Files Community

pdf2product / README.md

kevinconka

first commit, working app

e23e895 8 months ago

preview code

raw

history blame contribute delete

3.69 kB

A newer version of the Gradio SDK is available: 6.13.0

Upgrade

metadata

title: Document Classification
emoji: 🏷️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: mit
short_description: Classify PDF documents into product categories using AI

Document Classification App

A Gradio-based web application that classifies PDF documents into product categories using AI-powered analysis.

Features

📄 PDF Upload: Upload any PDF document
🔍 Text Extraction: Automatically extract text from PDFs
🏷️ AI-Powered Classification: Classify documents into product categories
🎯 Multiple Methods: Choose between semantic similarity, keyword matching, or hybrid approach
📊 Confidence Scores: Get confidence scores for each classification
⚙️ Customizable Products: Define your own product categories and descriptions

How to Use

Upload PDF: Click the upload button and select your PDF file
Choose Method: Select your preferred classification method (hybrid, semantic, or keyword)
Define Products: Use the default product definitions or customize your own in JSON format
Classify: Click "Classify Document" to analyze the PDF
View Results: See the top 3 product matches with confidence scores

Setup

For Hugging Face Spaces (Production)

Set your OPENAI_API_KEY in the Space settings:
- Go to your Space settings
- Add OPENAI_API_KEY as a secret
- Enter your OpenAI API key

For Local Development

Clone this repository
Install dependencies: pip install -r requirements.txt
Optionally create a .env file in the project root with your API key:
```
OPENAI_API_KEY=your-api-key-here
```
Or set it as an environment variable: export OPENAI_API_KEY="your-api-key"
Run the app: python app.py

Technical Details

Framework: Gradio for the web interface
Embeddings: OpenAI embeddings for semantic similarity
Vector Store: LangChain InMemoryVectorStore for efficient similarity search
Classification Methods: Semantic similarity, keyword matching, and hybrid approach
Text Processing: PyPDF for PDF text extraction
Architecture: Simple modular design with clean separation of concerns

Project Structure

pdf2product/
├── app.py                 # Main Gradio application
├── requirements.txt       # Python dependencies
├── .env                   # Environment variables (create this)
└── pdf_qa/               # Core Q&A package
    ├── __init__.py       # Package initialization
    ├── pdf_processor.py  # PDF text extraction and chunking
    └── qa_engine.py      # Question answering engine

Architecture

Simple and clean architecture:

Separation of Concerns: UI logic (Gradio) is separate from business logic
Modularity: Two main components - PDF processing and Q&A
Simplicity: Minimal, focused modules that do one thing well

Example Product Categories

The app includes several example product configurations:

Invoice-Focused: Invoice, Receipt, Quote/Estimate
Travel-Focused: Flight Ticket, Hotel Reservation, Travel Insurance
Employment-Focused: CV/Resume, Job Offer, Employment Contract

Users can also define their own custom product categories in JSON format.

Limitations

Currently supports one PDF at a time
Requires OpenAI API key
Best results with text-based PDFs (not scanned images)
Processing time depends on document size
Classification accuracy depends on document content quality

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference