Spaces:
Sleeping
Sleeping
File size: 2,443 Bytes
7dfcef5 b434cd3 f6e574f 25bda12 b434cd3 7dfcef5 f6e574f | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | ---
title: DocClassify
emoji: π
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
---
# Document Classifier
A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.
## Features
- π PDF file upload and processing
- π€ BERT-tiny model for document classification
- π― Classifies 20+ document types including:
- Invoice, Receipt, Contract, Resume
- Letter, Report, Memo, Email
- Form, Certificate, License, Passport
- Medical records, Bank statements, Tax documents
- Legal documents, Academic papers, and more
- πΎ Model is downloaded and cached locally on first use
- π¨ Modern, user-friendly interface
## How It Works
1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face
2. On first run, the model is automatically downloaded to the `models/` directory
3. PDF text is extracted using PyPDF2
4. Document embeddings are computed using BERT-tiny
5. Similarity scores are calculated against pre-computed document type embeddings
6. The document is classified with confidence scores
## Setup
### Local Development
1. **Backend Setup:**
```bash
cd backend
pip install -r requirements.txt
```
2. **Frontend Setup:**
```bash
cd frontend
npm install
```
3. **Run Backend:**
```bash
cd backend
uvicorn app.main:app --reload --port 8000
```
4. **Run Frontend:**
```bash
cd frontend
npm run dev
```
5. Open `http://localhost:5173` in your browser
### Docker Deployment
```bash
docker build -t docclassify .
docker run -p 7860:7860 docclassify
```
## Usage
1. Click "Select PDF File" to choose a PDF document
2. Click "Classify Document" to process the file
3. View the classification result with confidence scores
4. See top 5 document type predictions
## Model Information
- **Model:** `prajjwal1/bert-tiny`
- **Size:** ~4.4M parameters
- **Architecture:** BERT (L=2, H=128)
- **Source:** [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny)
## Technical Stack
- **Backend:** FastAPI, PyTorch, Transformers, PyPDF2
- **Frontend:** React, Vite
- **Model:** BERT-tiny (prajjwal1/bert-tiny)
## Notes
- The model will be automatically downloaded on first use (~17MB)
- Classification works best with text-based PDFs
- Image-based PDFs may not work if they don't contain extractable text
- Processing time depends on document size and system resources
|