Spaces:
Sleeping
Sleeping
| title: DocClassify | |
| emoji: π | |
| colorFrom: yellow | |
| colorTo: blue | |
| sdk: docker | |
| pinned: false | |
| # Document Classifier | |
| A web application that uses BERT-tiny to classify PDF documents by type. Upload a PDF file and get instant classification results. | |
| ## Features | |
| - π PDF file upload and processing | |
| - π€ BERT-tiny model for document classification | |
| - π― Classifies 20+ document types including: | |
| - Invoice, Receipt, Contract, Resume | |
| - Letter, Report, Memo, Email | |
| - Form, Certificate, License, Passport | |
| - Medical records, Bank statements, Tax documents | |
| - Legal documents, Academic papers, and more | |
| - πΎ Model is downloaded and cached locally on first use | |
| - π¨ Modern, user-friendly interface | |
| ## How It Works | |
| 1. The app uses the `prajjwal1/bert-tiny` model from Hugging Face | |
| 2. On first run, the model is automatically downloaded to the `models/` directory | |
| 3. PDF text is extracted using PyPDF2 | |
| 4. Document embeddings are computed using BERT-tiny | |
| 5. Similarity scores are calculated against pre-computed document type embeddings | |
| 6. The document is classified with confidence scores | |
| ## Setup | |
| ### Local Development | |
| 1. **Backend Setup:** | |
| ```bash | |
| cd backend | |
| pip install -r requirements.txt | |
| ``` | |
| 2. **Frontend Setup:** | |
| ```bash | |
| cd frontend | |
| npm install | |
| ``` | |
| 3. **Run Backend:** | |
| ```bash | |
| cd backend | |
| uvicorn app.main:app --reload --port 8000 | |
| ``` | |
| 4. **Run Frontend:** | |
| ```bash | |
| cd frontend | |
| npm run dev | |
| ``` | |
| 5. Open `http://localhost:5173` in your browser | |
| ### Docker Deployment | |
| ```bash | |
| docker build -t docclassify . | |
| docker run -p 7860:7860 docclassify | |
| ``` | |
| ## Usage | |
| 1. Click "Select PDF File" to choose a PDF document | |
| 2. Click "Classify Document" to process the file | |
| 3. View the classification result with confidence scores | |
| 4. See top 5 document type predictions | |
| ## Model Information | |
| - **Model:** `prajjwal1/bert-tiny` | |
| - **Size:** ~4.4M parameters | |
| - **Architecture:** BERT (L=2, H=128) | |
| - **Source:** [Hugging Face Model Card](https://huggingface.co/prajjwal1/bert-tiny) | |
| ## Technical Stack | |
| - **Backend:** FastAPI, PyTorch, Transformers, PyPDF2 | |
| - **Frontend:** React, Vite | |
| - **Model:** BERT-tiny (prajjwal1/bert-tiny) | |
| ## Notes | |
| - The model will be automatically downloaded on first use (~17MB) | |
| - Classification works best with text-based PDFs | |
| - Image-based PDFs may not work if they don't contain extractable text | |
| - Processing time depends on document size and system resources | |