File size: 2,443 Bytes
7dfcef5
b434cd3
f6e574f
25bda12
b434cd3
7dfcef5
 
 
 
f6e574f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: DocClassify
emoji: πŸ“„
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
---

# Document Classifier

A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.

## Features

- πŸ“„ PDF file upload and processing
- πŸ€– BERT-tiny model for document classification
- 🎯 Classifies 20+ document types including:
  - Invoice, Receipt, Contract, Resume
  - Letter, Report, Memo, Email
  - Form, Certificate, License, Passport
  - Medical records, Bank statements, Tax documents
  - Legal documents, Academic papers, and more
- πŸ’Ύ Model is downloaded and cached locally on first use
- 🎨 Modern, user-friendly interface

## How It Works

1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face
2. On first run, the model is automatically downloaded to the `models/` directory
3. PDF text is extracted using PyPDF2
4. Document embeddings are computed using BERT-tiny
5. Similarity scores are calculated against pre-computed document type embeddings
6. The document is classified with confidence scores

## Setup

### Local Development

1. **Backend Setup:**
   ```bash
   cd backend
   pip install -r requirements.txt
   ```

2. **Frontend Setup:**
   ```bash
   cd frontend
   npm install
   ```

3. **Run Backend:**
   ```bash
   cd backend
   uvicorn app.main:app --reload --port 8000
   ```

4. **Run Frontend:**
   ```bash
   cd frontend
   npm run dev
   ```

5. Open `http://localhost:5173` in your browser

### Docker Deployment

```bash
docker build -t docclassify .
docker run -p 7860:7860 docclassify
```

## Usage

1. Click "Select PDF File" to choose a PDF document
2. Click "Classify Document" to process the file
3. View the classification result with confidence scores
4. See top 5 document type predictions

## Model Information

- **Model:** `prajjwal1/bert-tiny`
- **Size:** ~4.4M parameters
- **Architecture:** BERT (L=2, H=128)
- **Source:** [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny)

## Technical Stack

- **Backend:** FastAPI, PyTorch, Transformers, PyPDF2
- **Frontend:** React, Vite
- **Model:** BERT-tiny (prajjwal1/bert-tiny)

## Notes

- The model will be automatically downloaded on first use (~17MB)
- Classification works best with text-based PDFs
- Image-based PDFs may not work if they don't contain extractable text
- Processing time depends on document size and system resources