Spaces:

Seth0330
/

DocClassify

Sleeping

File size: 2,443 Bytes

7dfcef5
b434cd3
f6e574f
25bda12
b434cd3
7dfcef5
 
 
 
f6e574f

---
title: DocClassify
emoji: 📄
colorFrom: yellow
colorTo: blue
sdk: docker
pinned: false
---

# Document Classifier

A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.

## Features

- 📄 PDF file upload and processing
- 🤖 BERT-tiny model for document classification
- 🎯 Classifies 20+ document types including:
  - Invoice, Receipt, Contract, Resume
  - Letter, Report, Memo, Email
  - Form, Certificate, License, Passport
  - Medical records, Bank statements, Tax documents
  - Legal documents, Academic papers, and more
- 💾 Model is downloaded and cached locally on first use
- 🎨 Modern, user-friendly interface

## How It Works

1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face
2. On first run, the model is automatically downloaded to the `models/` directory
3. PDF text is extracted using PyPDF2
4. Document embeddings are computed using BERT-tiny
5. Similarity scores are calculated against pre-computed document type embeddings
6. The document is classified with confidence scores

## Setup

### Local Development

1. **Backend Setup:**
   ```bash
   cd backend
   pip install -r requirements.txt
   ```

2. **Frontend Setup:**
   ```bash
   cd frontend
   npm install
   ```

3. **Run Backend:**
   ```bash
   cd backend
   uvicorn app.main:app --reload --port 8000
   ```

4. **Run Frontend:**
   ```bash
   cd frontend
   npm run dev
   ```

5. Open `http://localhost:5173` in your browser

### Docker Deployment

```bash
docker build -t docclassify .
docker run -p 7860:7860 docclassify
```

## Usage

1. Click "Select PDF File" to choose a PDF document
2. Click "Classify Document" to process the file
3. View the classification result with confidence scores
4. See top 5 document type predictions

## Model Information

- **Model:** `prajjwal1/bert-tiny`
- **Size:** ~4.4M parameters
- **Architecture:** BERT (L=2, H=128)
- **Source:** [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny)

## Technical Stack

- **Backend:** FastAPI, PyTorch, Transformers, PyPDF2
- **Frontend:** React, Vite
- **Model:** BERT-tiny (prajjwal1/bert-tiny)

## Notes

- The model will be automatically downloaded on first use (~17MB)
- Classification works best with text-based PDFs
- Image-based PDFs may not work if they don't contain extractable text
- Processing time depends on document size and system resources