DocClassify / README.md
Seth
Update
f6e574f
---
title: DocClassify
emoji: πŸ“„
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
---
# Document Classifier
A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.
## Features
- πŸ“„ PDF file upload and processing
- πŸ€– BERT-tiny model for document classification
- 🎯 Classifies 20+ document types including:
- Invoice, Receipt, Contract, Resume
- Letter, Report, Memo, Email
- Form, Certificate, License, Passport
- Medical records, Bank statements, Tax documents
- Legal documents, Academic papers, and more
- πŸ’Ύ Model is downloaded and cached locally on first use
- 🎨 Modern, user-friendly interface
## How It Works
1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face
2. On first run, the model is automatically downloaded to the `models/` directory
3. PDF text is extracted using PyPDF2
4. Document embeddings are computed using BERT-tiny
5. Similarity scores are calculated against pre-computed document type embeddings
6. The document is classified with confidence scores
## Setup
### Local Development
1. **Backend Setup:**
```bash
cd backend
pip install -r requirements.txt
```
2. **Frontend Setup:**
```bash
cd frontend
npm install
```
3. **Run Backend:**
```bash
cd backend
uvicorn app.main:app --reload --port 8000
```
4. **Run Frontend:**
```bash
cd frontend
npm run dev
```
5. Open `http://localhost:5173` in your browser
### Docker Deployment
```bash
docker build -t docclassify .
docker run -p 7860:7860 docclassify
```
## Usage
1. Click "Select PDF File" to choose a PDF document
2. Click "Classify Document" to process the file
3. View the classification result with confidence scores
4. See top 5 document type predictions
## Model Information
- **Model:** `prajjwal1/bert-tiny`
- **Size:** ~4.4M parameters
- **Architecture:** BERT (L=2, H=128)
- **Source:** [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny)
## Technical Stack
- **Backend:** FastAPI, PyTorch, Transformers, PyPDF2
- **Frontend:** React, Vite
- **Model:** BERT-tiny (prajjwal1/bert-tiny)
## Notes
- The model will be automatically downloaded on first use (~17MB)
- Classification works best with text-based PDFs
- Image-based PDFs may not work if they don't contain extractable text
- Processing time depends on document size and system resources