File size: 8,960 Bytes

eb53bb5

# Automated Document Text Extraction Using Small Language Model (SLM)

[![Python](https://img.shields.io/badge/python-v3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-v2.0+-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/Transformers-v4.30+-yellow.svg)](https://huggingface.co/transformers/)
[![FastAPI](https://img.shields.io/badge/FastAPI-v0.100+-green.svg)](https://fastapi.tiangolo.com/)
[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

> **Intelligent document processing system that extracts structured information from invoices, forms, and scanned documents using fine-tuned DistilBERT and transfer learning.**

## Quick Start

### 1. Installation

```bash

# Clone the repository

git clone https://github.com/sanjanb/small-language-model.git

cd small-language-model



# Install dependencies

pip install -r requirements.txt



# Install Tesseract OCR (Windows)

# Download from: https://github.com/UB-Mannheim/tesseract/wiki

# Add to PATH or set TESSERACT_PATH environment variable



# Install Tesseract OCR (Ubuntu/Debian)

sudo apt install tesseract-ocr



# Install Tesseract OCR (macOS)

brew install tesseract

```

### 2. Quick Demo

```bash

# Run the interactive demo

python demo.py



# Option 1: Complete demo with training and inference

# Option 2: Train model only

# Option 3: Test specific text

```

### 3. Web Interface

```bash

# Start the web API server

python api/app.py



# Open your browser to http://localhost:8000

# Upload documents or enter text for extraction

```

## Project Overview

This system combines **OCR technology**, **text preprocessing**, and a **fine-tuned DistilBERT model** to automatically extract structured information from documents. It uses transfer learning to adapt a pretrained transformer for document-specific Named Entity Recognition (NER).

### Key Capabilities

- **Multi-format Support**: PDF, DOCX, PNG, JPG, TIFF, BMP
- **Dual OCR Engine**: Tesseract + EasyOCR for maximum accuracy
- **Smart Entity Extraction**: Names, dates, amounts, addresses, phones, emails
- **Transfer Learning**: Fine-tuned DistilBERT for document-specific tasks
- **Web API**: RESTful endpoints with interactive interface
- **High Accuracy**: Regex validation + ML predictions

## System Architecture

```mermaid

graph TD

    A[Document Input] --> B[OCR Processing]

    B --> C[Text Cleaning]

    C --> D[Tokenization]

    D --> E[DistilBERT NER Model]

    E --> F[Entity Extraction]

    F --> G[Post-processing]

    G --> H[Structured JSON Output]



    I[Training Data] --> J[Auto-labeling]

    J --> K[Model Training]

    K --> E

```

## Project Structure

```

small-language-model/

├── src/                    # Core source code

│   ├── data_preparation.py  # OCR & dataset creation

│   ├── model.py             # DistilBERT NER model

│   ├── training_pipeline.py # Training orchestration

│   └── inference.py         # Document processing

├── api/                    # Web API service

│   └── app.py              # FastAPI application

├── config/                 # Configuration files

│   └── settings.py         # Project settings

├── data/                   # Data directories

│   ├── raw/                # Input documents

│   └── processed/          # Processed datasets

├── models/                 # Trained models

├── results/               # Training results

│   ├── plots/             # Training visualizations

│   └── metrics/           # Evaluation metrics

├── tests/                 # Unit tests

├── demo.py               # Interactive demo

├── requirements.txt      # Dependencies

└── README.md            # This file

```

## Usage Examples

### Python API

```python

from src.inference import DocumentInference



# Load trained model

inference = DocumentInference("models/document_ner_model")



# Process a document

result = inference.process_document("path/to/invoice.pdf")

print(result['structured_data'])

# Output: {'Name': 'John Doe', 'Date': '01/15/2025', 'Amount': '$1,500.00'}



# Process text directly

result = inference.process_text_directly(

    "Invoice sent to Alice Smith on 03/20/2025 Amount: $2,300.50"

)

print(result['structured_data'])

```

### REST API

```bash

# Upload and process a file

curl -X POST "http://localhost:8000/extract-from-file" \

     -H "accept: application/json" \

     -H "Content-Type: multipart/form-data" \

     -F "file=@invoice.pdf"



# Process text directly

curl -X POST "http://localhost:8000/extract-from-text" \

     -H "Content-Type: application/json" \

     -d '{"text": "Invoice INV-001 for John Doe $1000"}'

```

### Web Interface

![Document Text Extraction Web Interface](assets/Screenshot%202025-09-27%20184723.png)

1. Go to `http://localhost:8000`
2. Choose "Upload File" or "Enter Text" tab
3. Upload document or paste text
4. Click "Extract Information"
5. View structured results

## Configuration

### Model Configuration

```python

from src.model import ModelConfig



config = ModelConfig(

    model_name="distilbert-base-uncased",

    max_length=512,

    batch_size=16,

    learning_rate=2e-5,

    num_epochs=3,

    entity_labels=['O', 'B-NAME', 'I-NAME', 'B-DATE', 'I-DATE', ...]

)

```

### Environment Variables

```bash

# Optional: Custom Tesseract path

export TESSERACT_PATH="/usr/bin/tesseract"



# Optional: CUDA for GPU acceleration

export CUDA_VISIBLE_DEVICES=0

```

## Testing

```bash

# Run all tests

python -m pytest tests/



# Run specific test module

python tests/test_extraction.py



# Test with coverage

python -m pytest tests/ --cov=src --cov-report=html

```

## Performance Metrics

| Entity Type | Precision | Recall | F1-Score |
| ----------- | --------- | ------ | -------- |
| NAME        | 0.95      | 0.92   | 0.94     |
| DATE        | 0.98      | 0.96   | 0.97     |
| AMOUNT      | 0.93      | 0.91   | 0.92     |
| INVOICE_NO  | 0.89      | 0.87   | 0.88     |

| EMAIL       | 0.97      | 0.94   | 0.95     |

| PHONE       | 0.91      | 0.89   | 0.90     |



## Supported Entity Types



- **NAME**: Person names (John Doe, Dr. Smith)

- **DATE**: Dates in various formats (01/15/2025, March 15, 2025)

- **AMOUNT**: Monetary amounts ($1,500.00, 1000 USD)

- **INVOICE_NO**: Invoice numbers (INV-1001, BL-2045)

- **ADDRESS**: Street addresses

- **PHONE**: Phone numbers (555-123-4567, +1-555-123-4567)

- **EMAIL**: Email addresses (user@domain.com)



## Training Your Own Model



### 1. Prepare Your Data



```bash

# Place your documents in data/raw/

mkdir -p data/raw

cp your_invoices/*.pdf data/raw/

```



### 2. Run Training Pipeline



```python

from src.training_pipeline import TrainingPipeline, create_custom_config



# Create custom configuration

config = create_custom_config()

config.num_epochs = 5

config.batch_size = 16



# Run training

pipeline = TrainingPipeline(config)

model_path = pipeline.run_complete_pipeline("data/raw")

```



### 3. Evaluate Results



Training automatically generates:



- Loss curves: `results/plots/training_history.png`

- Metrics: `results/metrics/evaluation_results.json`

- Model checkpoints: `models/document_ner_model/`



## Deployment



### Docker Deployment



```dockerfile

FROM python:3.9-slim



WORKDIR /app

COPY requirements.txt .

RUN pip install -r requirements.txt



# Install Tesseract

RUN apt-get update && apt-get install -y tesseract-ocr



COPY . .

EXPOSE 8000



CMD ["python", "api/app.py"]

```



### Cloud Deployment



- **AWS**: Deploy using ECS or Lambda

- **Google Cloud**: Use Cloud Run or Compute Engine

- **Azure**: Deploy with Container Instances



## Contributing



1. Fork the repository

2. Create your feature branch (`git checkout -b feature/AmazingFeature`)

3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)

4. Push to the branch (`git push origin feature/AmazingFeature`)

5. Open a Pull Request



## License



This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.



## Acknowledgments



- [Hugging Face Transformers](https://huggingface.co/transformers/) for the DistilBERT model

- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition

- [EasyOCR](https://github.com/JaidedAI/EasyOCR) for additional OCR capabilities

- [FastAPI](https://fastapi.tiangolo.com/) for the web framework



## Support



- Email: your-email@domain.com

- Issues: [GitHub Issues](https://github.com/your-username/small-language-model/issues)

- Documentation: [Project Wiki](https://github.com/your-username/small-language-model/wiki)



---



**Star this repository if it helped you!**