Spaces:
Sleeping
Sleeping
A newer version of the Gradio SDK is available: 6.13.0
metadata
title: Document Classification
emoji: π·οΈ
colorFrom: green
colorTo: blue
sdk: gradio
sdk_version: 5.44.1
app_file: app.py
pinned: false
license: mit
short_description: Classify PDF documents into product categories using AI
Document Classification App
A Gradio-based web application that classifies PDF documents into product categories using AI-powered analysis.
Features
- π PDF Upload: Upload any PDF document
- π Text Extraction: Automatically extract text from PDFs
- π·οΈ AI-Powered Classification: Classify documents into product categories
- π― Multiple Methods: Choose between semantic similarity, keyword matching, or hybrid approach
- π Confidence Scores: Get confidence scores for each classification
- βοΈ Customizable Products: Define your own product categories and descriptions
How to Use
- Upload PDF: Click the upload button and select your PDF file
- Choose Method: Select your preferred classification method (hybrid, semantic, or keyword)
- Define Products: Use the default product definitions or customize your own in JSON format
- Classify: Click "Classify Document" to analyze the PDF
- View Results: See the top 3 product matches with confidence scores
Setup
For Hugging Face Spaces (Production)
- Set your
OPENAI_API_KEYin the Space settings:- Go to your Space settings
- Add
OPENAI_API_KEYas a secret - Enter your OpenAI API key
For Local Development
- Clone this repository
- Install dependencies:
pip install -r requirements.txt - Optionally create a
.envfile in the project root with your API key:
Or set it as an environment variable:OPENAI_API_KEY=your-api-key-hereexport OPENAI_API_KEY="your-api-key" - Run the app:
python app.py
Technical Details
- Framework: Gradio for the web interface
- Embeddings: OpenAI embeddings for semantic similarity
- Vector Store: LangChain InMemoryVectorStore for efficient similarity search
- Classification Methods: Semantic similarity, keyword matching, and hybrid approach
- Text Processing: PyPDF for PDF text extraction
- Architecture: Simple modular design with clean separation of concerns
Project Structure
pdf2product/
βββ app.py # Main Gradio application
βββ requirements.txt # Python dependencies
βββ .env # Environment variables (create this)
βββ pdf_qa/ # Core Q&A package
βββ __init__.py # Package initialization
βββ pdf_processor.py # PDF text extraction and chunking
βββ qa_engine.py # Question answering engine
Architecture
Simple and clean architecture:
- Separation of Concerns: UI logic (Gradio) is separate from business logic
- Modularity: Two main components - PDF processing and Q&A
- Simplicity: Minimal, focused modules that do one thing well
Example Product Categories
The app includes several example product configurations:
- Invoice-Focused: Invoice, Receipt, Quote/Estimate
- Travel-Focused: Flight Ticket, Hotel Reservation, Travel Insurance
- Employment-Focused: CV/Resume, Job Offer, Employment Contract
Users can also define their own custom product categories in JSON format.
Limitations
- Currently supports one PDF at a time
- Requires OpenAI API key
- Best results with text-based PDFs (not scanned images)
- Processing time depends on document size
- Classification accuracy depends on document content quality
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference