--- title: DocClassify emoji: 📄 colorFrom: yellow colorTo: blue sdk: docker pinned: false --- # Document Classifier A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results. ## Features - 📄 PDF file upload and processing - 🤖 BERT-tiny model for document classification - 🎯 Classifies 20+ document types including: - Invoice, Receipt, Contract, Resume - Letter, Report, Memo, Email - Form, Certificate, License, Passport - Medical records, Bank statements, Tax documents - Legal documents, Academic papers, and more - 💾 Model is downloaded and cached locally on first use - 🎨 Modern, user-friendly interface ## How It Works 1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face 2. On first run, the model is automatically downloaded to the `models/` directory 3. PDF text is extracted using PyPDF2 4. Document embeddings are computed using BERT-tiny 5. Similarity scores are calculated against pre-computed document type embeddings 6. The document is classified with confidence scores ## Setup ### Local Development 1. **Backend Setup:** ```bash cd backend pip install -r requirements.txt ``` 2. **Frontend Setup:** ```bash cd frontend npm install ``` 3. **Run Backend:** ```bash cd backend uvicorn app.main:app --reload --port 8000 ``` 4. **Run Frontend:** ```bash cd frontend npm run dev ``` 5. Open `http://localhost:5173` in your browser ### Docker Deployment ```bash docker build -t docclassify . docker run -p 7860:7860 docclassify ``` ## Usage 1. Click "Select PDF File" to choose a PDF document 2. Click "Classify Document" to process the file 3. View the classification result with confidence scores 4. See top 5 document type predictions ## Model Information - **Model:** `prajjwal1/bert-tiny` - **Size:** ~4.4M parameters - **Architecture:** BERT (L=2, H=128) - **Source:** [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny) ## Technical Stack - **Backend:** FastAPI, PyTorch, Transformers, PyPDF2 - **Frontend:** React, Vite - **Model:** BERT-tiny (prajjwal1/bert-tiny) ## Notes - The model will be automatically downloaded on first use (~17MB) - Classification works best with text-based PDFs - Image-based PDFs may not work if they don't contain extractable text - Processing time depends on document size and system resources