pdf2product / README.md
kevinconka's picture
first commit, working app
e23e895

A newer version of the Gradio SDK is available: 6.13.0

Upgrade
metadata
title: Document Classification
emoji: 🏷️
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: mit
short_description: Classify PDF documents into product categories using AI

Document Classification App

A Gradio-based web application that classifies PDF documents into product categories using AI-powered analysis.

Features

  • πŸ“„ PDF Upload: Upload any PDF document
  • πŸ” Text Extraction: Automatically extract text from PDFs
  • 🏷️ AI-Powered Classification: Classify documents into product categories
  • 🎯 Multiple Methods: Choose between semantic similarity, keyword matching, or hybrid approach
  • πŸ“Š Confidence Scores: Get confidence scores for each classification
  • βš™οΈ Customizable Products: Define your own product categories and descriptions

How to Use

  1. Upload PDF: Click the upload button and select your PDF file
  2. Choose Method: Select your preferred classification method (hybrid, semantic, or keyword)
  3. Define Products: Use the default product definitions or customize your own in JSON format
  4. Classify: Click "Classify Document" to analyze the PDF
  5. View Results: See the top 3 product matches with confidence scores

Setup

For Hugging Face Spaces (Production)

  1. Set your OPENAI_API_KEY in the Space settings:
    • Go to your Space settings
    • Add OPENAI_API_KEY as a secret
    • Enter your OpenAI API key

For Local Development

  1. Clone this repository
  2. Install dependencies: pip install -r requirements.txt
  3. Optionally create a .env file in the project root with your API key:
    OPENAI_API_KEY=your-api-key-here
    
    Or set it as an environment variable: export OPENAI_API_KEY="your-api-key"
  4. Run the app: python app.py

Technical Details

  • Framework: Gradio for the web interface
  • Embeddings: OpenAI embeddings for semantic similarity
  • Vector Store: LangChain InMemoryVectorStore for efficient similarity search
  • Classification Methods: Semantic similarity, keyword matching, and hybrid approach
  • Text Processing: PyPDF for PDF text extraction
  • Architecture: Simple modular design with clean separation of concerns

Project Structure

pdf2product/
β”œβ”€β”€ app.py                 # Main Gradio application
β”œβ”€β”€ requirements.txt       # Python dependencies
β”œβ”€β”€ .env                   # Environment variables (create this)
└── pdf_qa/               # Core Q&A package
    β”œβ”€β”€ __init__.py       # Package initialization
    β”œβ”€β”€ pdf_processor.py  # PDF text extraction and chunking
    └── qa_engine.py      # Question answering engine

Architecture

Simple and clean architecture:

  • Separation of Concerns: UI logic (Gradio) is separate from business logic
  • Modularity: Two main components - PDF processing and Q&A
  • Simplicity: Minimal, focused modules that do one thing well

Example Product Categories

The app includes several example product configurations:

  • Invoice-Focused: Invoice, Receipt, Quote/Estimate
  • Travel-Focused: Flight Ticket, Hotel Reservation, Travel Insurance
  • Employment-Focused: CV/Resume, Job Offer, Employment Contract

Users can also define their own custom product categories in JSON format.

Limitations

  • Currently supports one PDF at a time
  • Requires OpenAI API key
  • Best results with text-based PDFs (not scanned images)
  • Processing time depends on document size
  • Classification accuracy depends on document content quality

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference