Spaces:

Seth0330
/

DocClassify

Sleeping

App Files Files Community

DocClassify / README.md

Seth

Update

f6e574f 25 days ago

preview code

raw

history blame contribute delete

2.44 kB

	---
	title: DocClassify
	emoji: 📄
	colorFrom: yellow
	colorTo: blue
	sdk: docker
	pinned: false
	---

	# Document Classifier

	A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results.

	## Features

	- 📄 PDF file upload and processing
	- 🤖 BERT-tiny model for document classification
	- 🎯 Classifies 20+ document types including:
	- Invoice, Receipt, Contract, Resume
	- Letter, Report, Memo, Email
	- Form, Certificate, License, Passport
	- Medical records, Bank statements, Tax documents
	- Legal documents, Academic papers, and more
	- 💾 Model is downloaded and cached locally on first use
	- 🎨 Modern, user-friendly interface

	## How It Works

	1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face
	2. On first run, the model is automatically downloaded to the `models/` directory
	3. PDF text is extracted using PyPDF2
	4. Document embeddings are computed using BERT-tiny
	5. Similarity scores are calculated against pre-computed document type embeddings
	6. The document is classified with confidence scores

	## Setup

	### Local Development

	1. Backend Setup:
	```bash
	cd backend
	pip install -r requirements.txt
	```

	2. Frontend Setup:
	```bash
	cd frontend
	npm install
	```

	3. Run Backend:
	```bash
	cd backend
	uvicorn app.main:app --reload --port 8000
	```

	4. Run Frontend:
	```bash
	cd frontend
	npm run dev
	```

	5. Open `http://localhost:5173` in your browser

	### Docker Deployment

	```bash
	docker build -t docclassify .
	docker run -p 7860:7860 docclassify
	```

	## Usage

	1. Click "Select PDF File" to choose a PDF document
	2. Click "Classify Document" to process the file
	3. View the classification result with confidence scores
	4. See top 5 document type predictions

	## Model Information

	- Model: `prajjwal1/bert-tiny`
	- Size: ~4.4M parameters
	- Architecture: BERT (L=2, H=128)
	- Source: [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny)

	## Technical Stack

	- Backend: FastAPI, PyTorch, Transformers, PyPDF2
	- Frontend: React, Vite
	- Model: BERT-tiny (prajjwal1/bert-tiny)

	## Notes

	- The model will be automatically downloaded on first use (~17MB)
	- Classification works best with text-based PDFs
	- Image-based PDFs may not work if they don't contain extractable text
	- Processing time depends on document size and system resources