Upload folder using huggingface_hub

eb53bb5 verified 3 months ago

8.96 kB

	# Automated Document Text Extraction Using Small Language Model (SLM)

	[![Python](https://img.shields.io/badge/python-v3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-v2.0+-red.svg)](https://pytorch.org/)
	[![Transformers](https://img.shields.io/badge/Transformers-v4.30+-yellow.svg)](https://huggingface.co/transformers/)
	[![FastAPI](https://img.shields.io/badge/FastAPI-v0.100+-green.svg)](https://fastapi.tiangolo.com/)
	[![License](https://img.shields.io/badge/license-MIT-blue.svg)](LICENSE)

	> Intelligent document processing system that extracts structured information from invoices, forms, and scanned documents using fine-tuned DistilBERT and transfer learning.

	## Quick Start

	### 1. Installation

	```bash
	# Clone the repository
	git clone https://github.com/sanjanb/small-language-model.git
	cd small-language-model

	# Install dependencies
	pip install -r requirements.txt

	# Install Tesseract OCR (Windows)
	# Download from: https://github.com/UB-Mannheim/tesseract/wiki
	# Add to PATH or set TESSERACT_PATH environment variable

	# Install Tesseract OCR (Ubuntu/Debian)
	sudo apt install tesseract-ocr

	# Install Tesseract OCR (macOS)
	brew install tesseract
	```

	### 2. Quick Demo

	```bash
	# Run the interactive demo
	python demo.py

	# Option 1: Complete demo with training and inference
	# Option 2: Train model only
	# Option 3: Test specific text
	```

	### 3. Web Interface

	```bash
	# Start the web API server
	python api/app.py

	# Open your browser to http://localhost:8000
	# Upload documents or enter text for extraction
	```

	## Project Overview

	This system combines OCR technology, text preprocessing, and a fine-tuned DistilBERT model to automatically extract structured information from documents. It uses transfer learning to adapt a pretrained transformer for document-specific Named Entity Recognition (NER).

	### Key Capabilities

	- Multi-format Support: PDF, DOCX, PNG, JPG, TIFF, BMP
	- Dual OCR Engine: Tesseract + EasyOCR for maximum accuracy
	- Smart Entity Extraction: Names, dates, amounts, addresses, phones, emails
	- Transfer Learning: Fine-tuned DistilBERT for document-specific tasks
	- Web API: RESTful endpoints with interactive interface
	- High Accuracy: Regex validation + ML predictions

	## System Architecture

	```mermaid
	graph TD
	A[Document Input] --> B[OCR Processing]
	B --> C[Text Cleaning]
	C --> D[Tokenization]
	D --> E[DistilBERT NER Model]
	E --> F[Entity Extraction]
	F --> G[Post-processing]
	G --> H[Structured JSON Output]

	I[Training Data] --> J[Auto-labeling]
	J --> K[Model Training]
	K --> E
	```

	## Project Structure

	```
	small-language-model/
	├── src/ # Core source code
	│ ├── data_preparation.py # OCR & dataset creation
	│ ├── model.py # DistilBERT NER model
	│ ├── training_pipeline.py # Training orchestration
	│ └── inference.py # Document processing
	├── api/ # Web API service
	│ └── app.py # FastAPI application
	├── config/ # Configuration files
	│ └── settings.py # Project settings
	├── data/ # Data directories
	│ ├── raw/ # Input documents
	│ └── processed/ # Processed datasets
	├── models/ # Trained models
	├── results/ # Training results
	│ ├── plots/ # Training visualizations
	│ └── metrics/ # Evaluation metrics
	├── tests/ # Unit tests
	├── demo.py # Interactive demo
	├── requirements.txt # Dependencies
	└── README.md # This file
	```

	## Usage Examples

	### Python API

	```python
	from src.inference import DocumentInference

	# Load trained model
	inference = DocumentInference("models/document_ner_model")

	# Process a document
	result = inference.process_document("path/to/invoice.pdf")
	print(result['structured_data'])
	# Output: {'Name': 'John Doe', 'Date': '01/15/2025', 'Amount': '$1,500.00'}

	# Process text directly
	result = inference.process_text_directly(
	"Invoice sent to Alice Smith on 03/20/2025 Amount: $2,300.50"
	)
	print(result['structured_data'])
	```

	### REST API

	```bash
	# Upload and process a file
	curl -X POST "http://localhost:8000/extract-from-file" \
	-H "accept: application/json" \
	-H "Content-Type: multipart/form-data" \
	-F "file=@invoice.pdf"

	# Process text directly
	curl -X POST "http://localhost:8000/extract-from-text" \
	-H "Content-Type: application/json" \
	-d '{"text": "Invoice INV-001 for John Doe $1000"}'
	```

	### Web Interface

	![Document Text Extraction Web Interface](assets/Screenshot%202025-09-27%20184723.png)

	1. Go to `http://localhost:8000`
	2. Choose "Upload File" or "Enter Text" tab
	3. Upload document or paste text
	4. Click "Extract Information"
	5. View structured results

	## Configuration

	### Model Configuration

	```python
	from src.model import ModelConfig

	config = ModelConfig(
	model_name="distilbert-base-uncased",
	max_length=512,
	batch_size=16,
	learning_rate=2e-5,
	num_epochs=3,
	entity_labels=['O', 'B-NAME', 'I-NAME', 'B-DATE', 'I-DATE', ...]
	)
	```

	### Environment Variables

	```bash
	# Optional: Custom Tesseract path
	export TESSERACT_PATH="/usr/bin/tesseract"

	# Optional: CUDA for GPU acceleration
	export CUDA_VISIBLE_DEVICES=0
	```

	## Testing

	```bash
	# Run all tests
	python -m pytest tests/

	# Run specific test module
	python tests/test_extraction.py

	# Test with coverage
	python -m pytest tests/ --cov=src --cov-report=html
	```

	## Performance Metrics

	\| Entity Type \| Precision \| Recall \| F1-Score \|
	\| ----------- \| --------- \| ------ \| -------- \|
	\| NAME \| 0.95 \| 0.92 \| 0.94 \|
	\| DATE \| 0.98 \| 0.96 \| 0.97 \|
	\| AMOUNT \| 0.93 \| 0.91 \| 0.92 \|
	\| INVOICE_NO \| 0.89 \| 0.87 \| 0.88 \|
	\| EMAIL \| 0.97 \| 0.94 \| 0.95 \|
	\| PHONE \| 0.91 \| 0.89 \| 0.90 \|

	## Supported Entity Types

	- NAME: Person names (John Doe, Dr. Smith)
	- DATE: Dates in various formats (01/15/2025, March 15, 2025)
	- AMOUNT: Monetary amounts ($1,500.00, 1000 USD)
	- INVOICE_NO: Invoice numbers (INV-1001, BL-2045)
	- ADDRESS: Street addresses
	- PHONE: Phone numbers (555-123-4567, +1-555-123-4567)
	- EMAIL: Email addresses (user@domain.com)

	## Training Your Own Model

	### 1. Prepare Your Data

	```bash
	# Place your documents in data/raw/
	mkdir -p data/raw
	cp your_invoices/*.pdf data/raw/
	```

	### 2. Run Training Pipeline

	```python
	from src.training_pipeline import TrainingPipeline, create_custom_config

	# Create custom configuration
	config = create_custom_config()
	config.num_epochs = 5
	config.batch_size = 16

	# Run training
	pipeline = TrainingPipeline(config)
	model_path = pipeline.run_complete_pipeline("data/raw")
	```

	### 3. Evaluate Results

	Training automatically generates:

	- Loss curves: `results/plots/training_history.png`
	- Metrics: `results/metrics/evaluation_results.json`
	- Model checkpoints: `models/document_ner_model/`

	## Deployment

	### Docker Deployment

	```dockerfile
	FROM python:3.9-slim

	WORKDIR /app
	COPY requirements.txt .
	RUN pip install -r requirements.txt

	# Install Tesseract
	RUN apt-get update && apt-get install -y tesseract-ocr

	COPY . .
	EXPOSE 8000

	CMD ["python", "api/app.py"]
	```

	### Cloud Deployment

	- AWS: Deploy using ECS or Lambda
	- Google Cloud: Use Cloud Run or Compute Engine
	- Azure: Deploy with Container Instances

	## Contributing

	1. Fork the repository
	2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
	3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
	4. Push to the branch (`git push origin feature/AmazingFeature`)
	5. Open a Pull Request

	## License

	This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.

	## Acknowledgments

	- [Hugging Face Transformers](https://huggingface.co/transformers/) for the DistilBERT model
	- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract) for optical character recognition
	- [EasyOCR](https://github.com/JaidedAI/EasyOCR) for additional OCR capabilities
	- [FastAPI](https://fastapi.tiangolo.com/) for the web framework

	## Support

	- Email: your-email@domain.com
	- Issues: [GitHub Issues](https://github.com/your-username/small-language-model/issues)
	- Documentation: [Project Wiki](https://github.com/your-username/small-language-model/wiki)

	---

	Star this repository if it helped you!