# Automated Document Text Extraction Using Small Language Model (SLM) [![Python](https://img.shields.io/badge/python-v3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-v2.0+-red.svg)](https://pytorch.org/) [![Transformers](https://img.shields.io/badge/Transformers-v4.30+-yellow.svg)](https://huggingface.co/transformers/) [![FastAPI](https://img.shields.io/badge/FastAPI-v0.100+-green.svg)](https://fastapi.tiangolo.com/) [![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE) > **Intelligent document processing system that extracts structured information from invoices, forms, and scanned documents using fine-tuned DistilBERT and transfer learning.** ## Quick Start ### 1. Installation ```bash # Clone the repository git clone https://github.com/sanjanb/small-language-model.git cd small-language-model # Install dependencies pip install -r requirements.txt # Install Tesseract OCR (Windows) # Download from: https://github.com/UB-Mannheim/tesseract/wiki # Add to PATH or set TESSERACT_PATH environment variable # Install Tesseract OCR (Ubuntu/Debian) sudo apt install tesseract-ocr # Install Tesseract OCR (macOS) brew install tesseract ``` ### 2. Quick Demo ```bash # Run the interactive demo python demo.py # Option 1: Complete demo with training and inference # Option 2: Train model only # Option 3: Test specific text ``` ### 3. Web Interface ```bash # Start the web API server python api/app.py # Open your browser to http://localhost:8000 # Upload documents or enter text for extraction ``` ## Project Overview This system combines **OCR technology**, **text preprocessing**, and a **fine-tuned DistilBERT model** to automatically extract structured information from documents. It uses transfer learning to adapt a pretrained transformer for document-specific Named Entity Recognition (NER). ### Key Capabilities - **Multi-format Support**: PDF, DOCX, PNG, JPG, TIFF, BMP - **Dual OCR Engine**: Tesseract + EasyOCR for maximum accuracy - **Smart Entity Extraction**: Names, dates, amounts, addresses, phones, emails - **Transfer Learning**: Fine-tuned DistilBERT for document-specific tasks - **Web API**: RESTful endpoints with interactive interface - **High Accuracy**: Regex validation + ML predictions ## System Architecture ```mermaid graph TD A[Document Input] --> B[OCR Processing] B --> C[Text Cleaning] C --> D[Tokenization] D --> E[DistilBERT NER Model] E --> F[Entity Extraction] F --> G[Post-processing] G --> H[Structured JSON Output] I[Training Data] --> J[Auto-labeling] J --> K[Model Training] K --> E ``` ## Project Structure ``` small-language-model/ ├── src/ # Core source code │ ├── data_preparation.py # OCR & dataset creation │ ├── model.py # DistilBERT NER model │ ├── training_pipeline.py # Training orchestration │ └── inference.py # Document processing ├── api/ # Web API service │ └── app.py # FastAPI application ├── config/ # Configuration files │ └── settings.py # Project settings ├── data/ # Data directories │ ├── raw/ # Input documents │ └── processed/ # Processed datasets ├── models/ # Trained models ├── results/ # Training results │ ├── plots/ # Training visualizations │ └── metrics/ # Evaluation metrics ├── tests/ # Unit tests ├── demo.py # Interactive demo ├── requirements.txt # Dependencies └── README.md # This file ``` ## Usage Examples ### Python API ```python from src.inference import DocumentInference # Load trained model inference = DocumentInference("models/document_ner_model") # Process a document result = inference.process_document("path/to/invoice.pdf") print(result['structured_data']) # Output: {'Name': 'John Doe', 'Date': '01/15/2025', 'Amount': '$1,500.00'} # Process text directly result = inference.process_text_directly( "Invoice sent to Alice Smith on 03/20/2025 Amount: $2,300.50" ) print(result['structured_data']) ``` ### REST API ```bash # Upload and process a file curl -X POST "http://localhost:8000/extract-from-file" \ -H "accept: application/json" \ -H "Content-Type: multipart/form-data" \ -F "file=@invoice.pdf" # Process text directly curl -X POST "http://localhost:8000/extract-from-text" \ -H "Content-Type: application/json" \ -d '{"text": "Invoice INV-001 for John Doe $1000"}' ``` ### Web Interface ![Document Text Extraction Web Interface](assets/Screenshot%202025-09-27%20184723.png) 1. Go to `http://localhost:8000` 2. Choose "Upload File" or "Enter Text" tab 3. Upload document or paste text 4. Click "Extract Information" 5. View structured results ## Configuration ### Model Configuration ```python from src.model import ModelConfig config = ModelConfig( model_name="distilbert-base-uncased", max_length=512, batch_size=16, learning_rate=2e-5, num_epochs=3, entity_labels=['O', 'B-NAME', 'I-NAME', 'B-DATE', 'I-DATE', ...] ) ``` ### Environment Variables ```bash # Optional: Custom Tesseract path export TESSERACT_PATH="/usr/bin/tesseract" # Optional: CUDA for GPU acceleration export CUDA_VISIBLE_DEVICES=0 ``` ## Testing ```bash # Run all tests python -m pytest tests/ # Run specific test module python tests/test_extraction.py # Test with coverage python -m pytest tests/ --cov=src --cov-report=html ``` ## Performance Metrics | Entity Type | Precision | Recall | F1-Score | | ----------- | --------- | ------ | -------- | | NAME | 0.95 | 0.92 | 0.94 | | DATE | 0.98 | 0.96 | 0.97 | | AMOUNT | 0.93 | 0.91 | 0.92 | | INVOICE_NO | 0.89 | 0.87 | 0.88 | | EMAIL | 0.97 | 0.94 | 0.95 | | PHONE | 0.91 | 0.89 | 0.90 | ## Supported Entity Types - **NAME**: Person names (John Doe, Dr. Smith) - **DATE**: Dates in various formats (01/15/2025, March 15, 2025) - **AMOUNT**: Monetary amounts ($1,500.00, 1000 USD) - **INVOICE_NO**: Invoice numbers (INV-1001, BL-2045) - **ADDRESS**: Street addresses - **PHONE**: Phone numbers (555-123-4567, +1-555-123-4567) - **EMAIL**: Email addresses (user@domain.com) ## Training Your Own Model ### 1. Prepare Your Data ```bash # Place your documents in data/raw/ mkdir -p data/raw cp your_invoices/*.pdf data/raw/ ``` ### 2. Run Training Pipeline ```python from src.training_pipeline import TrainingPipeline, create_custom_config # Create custom configuration config = create_custom_config() config.num_epochs = 5 config.batch_size = 16 # Run training pipeline = TrainingPipeline(config) model_path = pipeline.run_complete_pipeline("data/raw") ``` ### 3. Evaluate Results Training automatically generates: - Loss curves: `results/plots/training_history.png` - Metrics: `results/metrics/evaluation_results.json` - Model checkpoints: `models/document_ner_model/` ## Deployment ### Docker Deployment ```dockerfile FROM python:3.9-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt # Install Tesseract RUN apt-get update && apt-get install -y tesseract-ocr COPY . . EXPOSE 8000 CMD ["python", "api/app.py"] ``` ### Cloud Deployment - **AWS**: Deploy using ECS or Lambda - **Google Cloud**: Use Cloud Run or Compute Engine - **Azure**: Deploy with Container Instances ## Contributing 1. Fork the repository 2. Create your feature branch (`git checkout -b feature/AmazingFeature`) 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`) 4. Push to the branch (`git push origin feature/AmazingFeature`) 5. Open a Pull Request ## License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## Acknowledgments - [Hugging Face Transformers](https://huggingface.co/transformers/) for the DistilBERT model - [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition - [EasyOCR](https://github.com/JaidedAI/EasyOCR) for additional OCR capabilities - [FastAPI](https://fastapi.tiangolo.com/) for the web framework ## Support - Email: your-email@domain.com - Issues: [GitHub Issues](https://github.com/your-username/small-language-model/issues) - Documentation: [Project Wiki](https://github.com/your-username/small-language-model/wiki) --- **Star this repository if it helped you!**