Spaces:

hanz245
/

ocr

Running

App Files Files Community

ocr / README.md

hanz245

fix emoji in hf config

fba3a76 11 days ago

preview code

raw

history blame contribute delete

9.51 kB

	---
	title: LCR OCR API
	emoji: 📄
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_file: app.py
	pinned: false
	---

	# Local Civil Registry Document Digitization and Data Extraction

	## Using CRNN+CTC, Multinomial Naive Bayes, and Named Entity Recognition

	Thesis Project by:
	- Shane Mark C. Blanco
	- Princess A. Pasamonte
	- Irish Faith G. Ramirez

	Institution: Tarlac State University, College of Computer Studies

	---

	## 📋 Project Overview

	This system automates the digitization and data extraction of Philippine Civil Registry documents using advanced machine learning algorithms:

	### Target Documents:
	- Form 1A - Birth Certificate
	- Form 2A - Death Certificate
	- Form 3A - Marriage Certificate
	- Form 90 - Application of Marriage License

	### Key Features:
	✅ OCR for printed and handwritten text
	✅ Automatic document classification
	✅ Named entity extraction (names, dates, places)
	✅ Auto-fill digital forms
	✅ MySQL database storage
	✅ Searchable digital archive
	✅ Data visualization dashboard

	---

	## 🏗️ System Architecture

	```
	Input: Scanned Civil Registry Form
	↓
	1. Image Preprocessing
	↓
	2. CRNN+CTC → Text Recognition
	↓
	3. Multinomial Naive Bayes → Document Classification
	↓
	4. spaCy NER → Entity Extraction
	↓
	5. Data Validation & Storage → MySQL Database
	↓
	Output: Digitized & Searchable Record
	```

	---

	## 🚀 Quick Start

	### Prerequisites

	- Python 3.8+
	- CUDA-capable GPU (recommended) or CPU
	- 8GB RAM minimum

	### Installation

	```bash
	# 1. Clone or download the project
	cd civil_registry_ocr

	# 2. Create virtual environment
	python -m venv venv
	source venv/bin/activate # Linux/Mac
	venv\Scripts\activate # Windows

	# 3. Install dependencies
	pip install -r requirements.txt

	# 4. Download spaCy model
	python -m spacy download en_core_web_sm
	```

	### Quick Test

	```python
	from inference import CivilRegistryOCR

	# Load model
	ocr = CivilRegistryOCR('checkpoints/best_model.pth')

	# Recognize text
	text = ocr.predict('test_images/sample_name.jpg')
	print(f"Recognized: {text}")
	```

	---

	## 📁 Project Files

	### Core Implementation Files:

	1. crnn_model.py - CRNN+CTC neural network architecture
	2. dataset.py - Data loading and preprocessing
	3. train.py - Model training script
	4. inference.py - Prediction and inference
	5. utils.py - Helper functions and metrics
	6. requirements.txt - Python dependencies
	7. IMPLEMENTATION_GUIDE.md - Detailed implementation guide

	### Additional Components (To be created):

	8. document_classifier.py - Multinomial Naive Bayes classifier
	9. ner_extractor.py - Named Entity Recognition
	10. web_app.py - Web application (Flask/FastAPI)
	11. database.py - MySQL integration

	---

	## 📊 Training the Model

	### 1. Prepare Your Data

	Organize images and labels:
	```
	data/
	train/
	form1a/
	name_001.jpg
	name_001.txt
	form2a/
	...
	val/
	...
	```

	### 2. Create Annotations

	```python
	from dataset import create_annotation_file

	create_annotation_file('data/train', 'data/train_annotations.json')
	create_annotation_file('data/val', 'data/val_annotations.json')
	```

	### 3. Train Model

	```bash
	python train.py
	```

	Monitor metrics:
	- Character Error Rate (CER)
	- Word Error Rate (WER)
	- Training/Validation Loss

	### 4. Evaluate

	```python
	from utils import calculate_cer, calculate_wer

	predictions = [ocr.predict(img) for img in test_images]
	cer = calculate_cer(predictions, ground_truths)
	print(f"CER: {cer:.2f}%")
	```

	---

	## 🌐 Web Application

	### Start the Server

	```bash
	python web_app.py
	```

	### API Endpoints

	POST /api/ocr - Process document
	```bash
	curl -X POST -F "file=@birth_cert.jpg" http://localhost:8000/api/ocr
	```

	Response:
	```json
	{
	"text": "Juan Dela Cruz\n01/15/1990\nTarlac City",
	"form_type": "form1a",
	"entities": {
	"persons": ["Juan Dela Cruz"],
	"dates": ["01/15/1990"],
	"locations": ["Tarlac City"]
	}
	}
	```

	---

	## 🎯 Expected Performance

	Based on thesis objectives:

	### CRNN+CTC Model:
	- Target CER: < 5%
	- Target Accuracy: > 95%
	- Handles both printed and handwritten text

	### Document Classifier (MNB):
	- Target Accuracy: > 90%
	- Fast classification (< 100ms)

	### NER (spaCy):
	- F1 Score: > 85%
	- Extracts: Names, Dates, Places

	---

	## 🧪 Testing

	### ISO 25010 Evaluation

	Usability Testing:
	```python
	# Metrics to measure:
	- Task completion rate
	- Average time per task
	- User satisfaction score (SUS)
	```

	Reliability Testing:
	```python
	# Metrics to measure:
	- System uptime %
	- Error rate
	- Recovery time
	```

	### Confusion Matrix

	```python
	from sklearn.metrics import confusion_matrix
	import seaborn as sns

	cm = confusion_matrix(true_labels, predicted_labels)
	sns.heatmap(cm, annot=True)
	```

	---

	## 💾 Database Schema

	### Birth Certificates Table
	```sql
	CREATE TABLE birth_certificates (
	id INT PRIMARY KEY AUTO_INCREMENT,
	child_name VARCHAR(255),
	date_of_birth DATE,
	place_of_birth VARCHAR(255),
	sex CHAR(1),
	father_name VARCHAR(255),
	mother_name VARCHAR(255),
	raw_text TEXT,
	form_image LONGBLOB,
	created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
	);
	```

	---

	## 📈 System Requirements

	### Minimum:
	- CPU: Intel i5 or equivalent
	- RAM: 8GB
	- Storage: 10GB
	- OS: Windows 10, Ubuntu 18.04, macOS 10.14

	### Recommended:
	- CPU: Intel i7 or equivalent
	- GPU: NVIDIA GTX 1060 or better
	- RAM: 16GB
	- Storage: 50GB SSD

	---

	## 🔒 Data Privacy & Security

	Following Philippine Data Privacy Act (RA 10173):

	- ✅ Encrypted data transmission
	- ✅ Access control and authentication
	- ✅ Audit logging
	- ✅ Regular security updates
	- ✅ Data retention policies

	---

	## 📚 Key Algorithms

	### 1. CRNN+CTC
	Purpose: Text recognition from images
	Strengths: Handles variable-length sequences, no character segmentation needed
	Reference: Shi et al. (2016)

	### 2. Multinomial Naive Bayes
	Purpose: Document classification
	Strengths: Fast, efficient, works well with text data
	Reference: McCallum & Nigam (1998)

	### 3. Named Entity Recognition
	Purpose: Extract entities (names, dates, places)
	Strengths: Pre-trained, accurate, easy to use
	Reference: spaCy (Honnibal & Montani, 2017)

	---

	## 🛠️ Troubleshooting

	### Low Accuracy?
	1. Increase training data (target: 10,000+ samples)
	2. Use data augmentation
	3. Train longer (100+ epochs)
	4. Clean your dataset

	### Out of Memory?
	1. Reduce batch size
	2. Use smaller image dimensions
	3. Use gradient accumulation
	4. Enable mixed precision

	### Slow Inference?
	1. Use GPU if available
	2. Batch process images
	3. Optimize model (ONNX)
	4. Cache frequent results

	---

	## 📖 Documentation

	- IMPLEMENTATION_GUIDE.md - Complete step-by-step guide
	- API_DOCUMENTATION.md - API reference (to be created)
	- USER_MANUAL.md - End-user guide (to be created)

	---

	## 🎓 Academic References

	### Key Papers:

	1. CRNN
	Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI.

	2. CTC Loss
	Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML.

	3. Naive Bayes
	McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. AAAI Workshop.

	4. spaCy
	Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.

	---

	## 👥 Contributors

	Researchers:
	- Shane Mark C. Blanco
	- Princess A. Pasamonte
	- Irish Faith G. Ramirez

	Advisers:
	- Mr. Rengel V. Corpuz (Technical Adviser)
	- Mr. Joselito T. Tan (Subject Teacher)

	Institution:
	Tarlac State University
	College of Computer Studies
	Bachelor of Science in Computer Science

	---

	## 📞 Support

	For questions regarding this implementation:

	1. Review IMPLEMENTATION_GUIDE.md
	2. Check code documentation
	3. Consult with thesis advisers

	---

	## 📄 License

	This project is for academic purposes as part of a thesis requirement.

	---

	## ✅ Implementation Checklist

	### Phase 1: Setup ✓
	- [x] Install dependencies
	- [x] Set up project structure
	- [x] Prepare development environment

	### Phase 2: Data Preparation
	- [ ] Collect civil registry form images
	- [ ] Create annotations
	- [ ] Split into train/val/test sets

	### Phase 3: Model Development
	- [ ] Train CRNN+CTC model
	- [ ] Train document classifier
	- [ ] Integrate NER system

	### Phase 4: Web Application
	- [ ] Develop Flask/FastAPI backend
	- [ ] Create frontend interface
	- [ ] Implement database integration

	### Phase 5: Testing
	- [ ] Accuracy testing
	- [ ] Black-box testing
	- [ ] ISO 25010 evaluation
	- [ ] User acceptance testing

	### Phase 6: Deployment
	- [ ] Optimize for production
	- [ ] Set up server
	- [ ] Deploy application
	- [ ] Monitor performance

	---

	## 🎯 Success Metrics

	Target metrics for thesis evaluation:

	\| Metric \| Target \| Status \|
	\|--------\|--------\|--------\|
	\| OCR Accuracy \| > 95% \| Pending \|
	\| CER \| < 5% \| Pending \|
	\| Classifier Accuracy \| > 90% \| Pending \|
	\| NER F1 Score \| > 85% \| Pending \|
	\| Response Time \| < 2s \| Pending \|
	\| System Uptime \| > 99% \| Pending \|

	---

	Good luck with your thesis defense! 🎓✨

	For detailed implementation instructions, see IMPLEMENTATION_GUIDE.md