ocr / README.md
hanz245's picture
fix emoji in hf config
fba3a76
---
title: LCR OCR API
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
---
# Local Civil Registry Document Digitization and Data Extraction
## Using CRNN+CTC, Multinomial Naive Bayes, and Named Entity Recognition
**Thesis Project by:**
- Shane Mark C. Blanco
- Princess A. Pasamonte
- Irish Faith G. Ramirez
**Institution:** Tarlac State University, College of Computer Studies
---
## πŸ“‹ Project Overview
This system automates the digitization and data extraction of Philippine Civil Registry documents using advanced machine learning algorithms:
### Target Documents:
- **Form 1A** - Birth Certificate
- **Form 2A** - Death Certificate
- **Form 3A** - Marriage Certificate
- **Form 90** - Application of Marriage License
### Key Features:
βœ… OCR for printed and handwritten text
βœ… Automatic document classification
βœ… Named entity extraction (names, dates, places)
βœ… Auto-fill digital forms
βœ… MySQL database storage
βœ… Searchable digital archive
βœ… Data visualization dashboard
---
## πŸ—οΈ System Architecture
```
Input: Scanned Civil Registry Form
↓
1. Image Preprocessing
↓
2. CRNN+CTC β†’ Text Recognition
↓
3. Multinomial Naive Bayes β†’ Document Classification
↓
4. spaCy NER β†’ Entity Extraction
↓
5. Data Validation & Storage β†’ MySQL Database
↓
Output: Digitized & Searchable Record
```
---
## πŸš€ Quick Start
### Prerequisites
- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- 8GB RAM minimum
### Installation
```bash
# 1. Clone or download the project
cd civil_registry_ocr
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download spaCy model
python -m spacy download en_core_web_sm
```
### Quick Test
```python
from inference import CivilRegistryOCR
# Load model
ocr = CivilRegistryOCR('checkpoints/best_model.pth')
# Recognize text
text = ocr.predict('test_images/sample_name.jpg')
print(f"Recognized: {text}")
```
---
## πŸ“ Project Files
### Core Implementation Files:
1. **crnn_model.py** - CRNN+CTC neural network architecture
2. **dataset.py** - Data loading and preprocessing
3. **train.py** - Model training script
4. **inference.py** - Prediction and inference
5. **utils.py** - Helper functions and metrics
6. **requirements.txt** - Python dependencies
7. **IMPLEMENTATION_GUIDE.md** - Detailed implementation guide
### Additional Components (To be created):
8. **document_classifier.py** - Multinomial Naive Bayes classifier
9. **ner_extractor.py** - Named Entity Recognition
10. **web_app.py** - Web application (Flask/FastAPI)
11. **database.py** - MySQL integration
---
## πŸ“Š Training the Model
### 1. Prepare Your Data
Organize images and labels:
```
data/
train/
form1a/
name_001.jpg
name_001.txt
form2a/
...
val/
...
```
### 2. Create Annotations
```python
from dataset import create_annotation_file
create_annotation_file('data/train', 'data/train_annotations.json')
create_annotation_file('data/val', 'data/val_annotations.json')
```
### 3. Train Model
```bash
python train.py
```
Monitor metrics:
- Character Error Rate (CER)
- Word Error Rate (WER)
- Training/Validation Loss
### 4. Evaluate
```python
from utils import calculate_cer, calculate_wer
predictions = [ocr.predict(img) for img in test_images]
cer = calculate_cer(predictions, ground_truths)
print(f"CER: {cer:.2f}%")
```
---
## 🌐 Web Application
### Start the Server
```bash
python web_app.py
```
### API Endpoints
**POST /api/ocr** - Process document
```bash
curl -X POST -F "file=@birth_cert.jpg" http://localhost:8000/api/ocr
```
**Response:**
```json
{
"text": "Juan Dela Cruz\n01/15/1990\nTarlac City",
"form_type": "form1a",
"entities": {
"persons": ["Juan Dela Cruz"],
"dates": ["01/15/1990"],
"locations": ["Tarlac City"]
}
}
```
---
## 🎯 Expected Performance
Based on thesis objectives:
### CRNN+CTC Model:
- **Target CER:** < 5%
- **Target Accuracy:** > 95%
- Handles both printed and handwritten text
### Document Classifier (MNB):
- **Target Accuracy:** > 90%
- Fast classification (< 100ms)
### NER (spaCy):
- **F1 Score:** > 85%
- Extracts: Names, Dates, Places
---
## πŸ§ͺ Testing
### ISO 25010 Evaluation
**Usability Testing:**
```python
# Metrics to measure:
- Task completion rate
- Average time per task
- User satisfaction score (SUS)
```
**Reliability Testing:**
```python
# Metrics to measure:
- System uptime %
- Error rate
- Recovery time
```
### Confusion Matrix
```python
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(true_labels, predicted_labels)
sns.heatmap(cm, annot=True)
```
---
## πŸ’Ύ Database Schema
### Birth Certificates Table
```sql
CREATE TABLE birth_certificates (
id INT PRIMARY KEY AUTO_INCREMENT,
child_name VARCHAR(255),
date_of_birth DATE,
place_of_birth VARCHAR(255),
sex CHAR(1),
father_name VARCHAR(255),
mother_name VARCHAR(255),
raw_text TEXT,
form_image LONGBLOB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```
---
## πŸ“ˆ System Requirements
### Minimum:
- CPU: Intel i5 or equivalent
- RAM: 8GB
- Storage: 10GB
- OS: Windows 10, Ubuntu 18.04, macOS 10.14
### Recommended:
- CPU: Intel i7 or equivalent
- GPU: NVIDIA GTX 1060 or better
- RAM: 16GB
- Storage: 50GB SSD
---
## πŸ”’ Data Privacy & Security
Following Philippine Data Privacy Act (RA 10173):
- βœ… Encrypted data transmission
- βœ… Access control and authentication
- βœ… Audit logging
- βœ… Regular security updates
- βœ… Data retention policies
---
## πŸ“š Key Algorithms
### 1. CRNN+CTC
**Purpose:** Text recognition from images
**Strengths:** Handles variable-length sequences, no character segmentation needed
**Reference:** Shi et al. (2016)
### 2. Multinomial Naive Bayes
**Purpose:** Document classification
**Strengths:** Fast, efficient, works well with text data
**Reference:** McCallum & Nigam (1998)
### 3. Named Entity Recognition
**Purpose:** Extract entities (names, dates, places)
**Strengths:** Pre-trained, accurate, easy to use
**Reference:** spaCy (Honnibal & Montani, 2017)
---
## πŸ› οΈ Troubleshooting
### Low Accuracy?
1. Increase training data (target: 10,000+ samples)
2. Use data augmentation
3. Train longer (100+ epochs)
4. Clean your dataset
### Out of Memory?
1. Reduce batch size
2. Use smaller image dimensions
3. Use gradient accumulation
4. Enable mixed precision
### Slow Inference?
1. Use GPU if available
2. Batch process images
3. Optimize model (ONNX)
4. Cache frequent results
---
## πŸ“– Documentation
- **IMPLEMENTATION_GUIDE.md** - Complete step-by-step guide
- **API_DOCUMENTATION.md** - API reference (to be created)
- **USER_MANUAL.md** - End-user guide (to be created)
---
## πŸŽ“ Academic References
### Key Papers:
1. **CRNN**
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. *IEEE TPAMI*.
2. **CTC Loss**
Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. *ICML*.
3. **Naive Bayes**
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. *AAAI Workshop*.
4. **spaCy**
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
---
## πŸ‘₯ Contributors
**Researchers:**
- Shane Mark C. Blanco
- Princess A. Pasamonte
- Irish Faith G. Ramirez
**Advisers:**
- Mr. Rengel V. Corpuz (Technical Adviser)
- Mr. Joselito T. Tan (Subject Teacher)
**Institution:**
Tarlac State University
College of Computer Studies
Bachelor of Science in Computer Science
---
## πŸ“ž Support
For questions regarding this implementation:
1. Review IMPLEMENTATION_GUIDE.md
2. Check code documentation
3. Consult with thesis advisers
---
## πŸ“„ License
This project is for academic purposes as part of a thesis requirement.
---
## βœ… Implementation Checklist
### Phase 1: Setup βœ“
- [x] Install dependencies
- [x] Set up project structure
- [x] Prepare development environment
### Phase 2: Data Preparation
- [ ] Collect civil registry form images
- [ ] Create annotations
- [ ] Split into train/val/test sets
### Phase 3: Model Development
- [ ] Train CRNN+CTC model
- [ ] Train document classifier
- [ ] Integrate NER system
### Phase 4: Web Application
- [ ] Develop Flask/FastAPI backend
- [ ] Create frontend interface
- [ ] Implement database integration
### Phase 5: Testing
- [ ] Accuracy testing
- [ ] Black-box testing
- [ ] ISO 25010 evaluation
- [ ] User acceptance testing
### Phase 6: Deployment
- [ ] Optimize for production
- [ ] Set up server
- [ ] Deploy application
- [ ] Monitor performance
---
## 🎯 Success Metrics
Target metrics for thesis evaluation:
| Metric | Target | Status |
|--------|--------|--------|
| OCR Accuracy | > 95% | Pending |
| CER | < 5% | Pending |
| Classifier Accuracy | > 90% | Pending |
| NER F1 Score | > 85% | Pending |
| Response Time | < 2s | Pending |
| System Uptime | > 99% | Pending |
---
**Good luck with your thesis defense! πŸŽ“βœ¨**
For detailed implementation instructions, see **IMPLEMENTATION_GUIDE.md**