sanjanb's picture
Upload folder using huggingface_hub
eb53bb5 verified
# Automated Document Text Extraction Using Small Language Model (SLM)
[![Python](https://img.shields.io/badge/python-v3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-v2.0+-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/Transformers-v4.30+-yellow.svg)](https://huggingface.co/transformers/)
[![FastAPI](https://img.shields.io/badge/FastAPI-v0.100+-green.svg)](https://fastapi.tiangolo.com/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)
> **Intelligent document processing system that extracts structured information from invoices, forms, and scanned documents using fine-tuned DistilBERT and transfer learning.**
## Quick Start
### 1. Installation
```bash
# Clone the repository
git clone https://github.com/sanjanb/small-language-model.git
cd small-language-model
# Install dependencies
pip install -r requirements.txt
# Install Tesseract OCR (Windows)
# Download from: https://github.com/UB-Mannheim/tesseract/wiki
# Add to PATH or set TESSERACT_PATH environment variable
# Install Tesseract OCR (Ubuntu/Debian)
sudo apt install tesseract-ocr
# Install Tesseract OCR (macOS)
brew install tesseract
```
### 2. Quick Demo
```bash
# Run the interactive demo
python demo.py
# Option 1: Complete demo with training and inference
# Option 2: Train model only
# Option 3: Test specific text
```
### 3. Web Interface
```bash
# Start the web API server
python api/app.py
# Open your browser to http://localhost:8000
# Upload documents or enter text for extraction
```
## Project Overview
This system combines **OCR technology**, **text preprocessing**, and a **fine-tuned DistilBERT model** to automatically extract structured information from documents. It uses transfer learning to adapt a pretrained transformer for document-specific Named Entity Recognition (NER).
### Key Capabilities
- **Multi-format Support**: PDF, DOCX, PNG, JPG, TIFF, BMP
- **Dual OCR Engine**: Tesseract + EasyOCR for maximum accuracy
- **Smart Entity Extraction**: Names, dates, amounts, addresses, phones, emails
- **Transfer Learning**: Fine-tuned DistilBERT for document-specific tasks
- **Web API**: RESTful endpoints with interactive interface
- **High Accuracy**: Regex validation + ML predictions
## System Architecture
```mermaid
graph TD
A[Document Input] --> B[OCR Processing]
B --> C[Text Cleaning]
C --> D[Tokenization]
D --> E[DistilBERT NER Model]
E --> F[Entity Extraction]
F --> G[Post-processing]
G --> H[Structured JSON Output]
I[Training Data] --> J[Auto-labeling]
J --> K[Model Training]
K --> E
```
## Project Structure
```
small-language-model/
β”œβ”€β”€ src/ # Core source code
β”‚ β”œβ”€β”€ data_preparation.py # OCR & dataset creation
β”‚ β”œβ”€β”€ model.py # DistilBERT NER model
β”‚ β”œβ”€β”€ training_pipeline.py # Training orchestration
β”‚ └── inference.py # Document processing
β”œβ”€β”€ api/ # Web API service
β”‚ └── app.py # FastAPI application
β”œβ”€β”€ config/ # Configuration files
β”‚ └── settings.py # Project settings
β”œβ”€β”€ data/ # Data directories
β”‚ β”œβ”€β”€ raw/ # Input documents
β”‚ └── processed/ # Processed datasets
β”œβ”€β”€ models/ # Trained models
β”œβ”€β”€ results/ # Training results
β”‚ β”œβ”€β”€ plots/ # Training visualizations
β”‚ └── metrics/ # Evaluation metrics
β”œβ”€β”€ tests/ # Unit tests
β”œβ”€β”€ demo.py # Interactive demo
β”œβ”€β”€ requirements.txt # Dependencies
└── README.md # This file
```
## Usage Examples
### Python API
```python
from src.inference import DocumentInference
# Load trained model
inference = DocumentInference("models/document_ner_model")
# Process a document
result = inference.process_document("path/to/invoice.pdf")
print(result['structured_data'])
# Output: {'Name': 'John Doe', 'Date': '01/15/2025', 'Amount': '$1,500.00'}
# Process text directly
result = inference.process_text_directly(
"Invoice sent to Alice Smith on 03/20/2025 Amount: $2,300.50"
)
print(result['structured_data'])
```
### REST API
```bash
# Upload and process a file
curl -X POST "http://localhost:8000/extract-from-file" \
-H "accept: application/json" \
-H "Content-Type: multipart/form-data" \
-F "file=@invoice.pdf"
# Process text directly
curl -X POST "http://localhost:8000/extract-from-text" \
-H "Content-Type: application/json" \
-d '{"text": "Invoice INV-001 for John Doe $1000"}'
```
### Web Interface
![Document Text Extraction Web Interface](assets/Screenshot%202025-09-27%20184723.png)
1. Go to `http://localhost:8000`
2. Choose "Upload File" or "Enter Text" tab
3. Upload document or paste text
4. Click "Extract Information"
5. View structured results
## Configuration
### Model Configuration
```python
from src.model import ModelConfig
config = ModelConfig(
model_name="distilbert-base-uncased",
max_length=512,
batch_size=16,
learning_rate=2e-5,
num_epochs=3,
entity_labels=['O', 'B-NAME', 'I-NAME', 'B-DATE', 'I-DATE', ...]
)
```
### Environment Variables
```bash
# Optional: Custom Tesseract path
export TESSERACT_PATH="/usr/bin/tesseract"
# Optional: CUDA for GPU acceleration
export CUDA_VISIBLE_DEVICES=0
```
## Testing
```bash
# Run all tests
python -m pytest tests/
# Run specific test module
python tests/test_extraction.py
# Test with coverage
python -m pytest tests/ --cov=src --cov-report=html
```
## Performance Metrics
| Entity Type | Precision | Recall | F1-Score |
| ----------- | --------- | ------ | -------- |
| NAME | 0.95 | 0.92 | 0.94 |
| DATE | 0.98 | 0.96 | 0.97 |
| AMOUNT | 0.93 | 0.91 | 0.92 |
| INVOICE_NO | 0.89 | 0.87 | 0.88 |
| EMAIL | 0.97 | 0.94 | 0.95 |
| PHONE | 0.91 | 0.89 | 0.90 |
## Supported Entity Types
- **NAME**: Person names (John Doe, Dr. Smith)
- **DATE**: Dates in various formats (01/15/2025, March 15, 2025)
- **AMOUNT**: Monetary amounts ($1,500.00, 1000 USD)
- **INVOICE_NO**: Invoice numbers (INV-1001, BL-2045)
- **ADDRESS**: Street addresses
- **PHONE**: Phone numbers (555-123-4567, +1-555-123-4567)
- **EMAIL**: Email addresses (user@domain.com)
## Training Your Own Model
### 1. Prepare Your Data
```bash
# Place your documents in data/raw/
mkdir -p data/raw
cp your_invoices/*.pdf data/raw/
```
### 2. Run Training Pipeline
```python
from src.training_pipeline import TrainingPipeline, create_custom_config
# Create custom configuration
config = create_custom_config()
config.num_epochs = 5
config.batch_size = 16
# Run training
pipeline = TrainingPipeline(config)
model_path = pipeline.run_complete_pipeline("data/raw")
```
### 3. Evaluate Results
Training automatically generates:
- Loss curves: `results/plots/training_history.png`
- Metrics: `results/metrics/evaluation_results.json`
- Model checkpoints: `models/document_ner_model/`
## Deployment
### Docker Deployment
```dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
# Install Tesseract
RUN apt-get update && apt-get install -y tesseract-ocr
COPY . .
EXPOSE 8000
CMD ["python", "api/app.py"]
```
### Cloud Deployment
- **AWS**: Deploy using ECS or Lambda
- **Google Cloud**: Use Cloud Run or Compute Engine
- **Azure**: Deploy with Container Instances
## Contributing
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
4. Push to the branch (`git push origin feature/AmazingFeature`)
5. Open a Pull Request
## License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## Acknowledgments
- [Hugging Face Transformers](https://huggingface.co/transformers/) for the DistilBERT model
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition
- [EasyOCR](https://github.com/JaidedAI/EasyOCR) for additional OCR capabilities
- [FastAPI](https://fastapi.tiangolo.com/) for the web framework
## Support
- Email: your-email@domain.com
- Issues: [GitHub Issues](https://github.com/your-username/small-language-model/issues)
- Documentation: [Project Wiki](https://github.com/your-username/small-language-model/wiki)
---
**Star this repository if it helped you!**