title: LCR OCR API
emoji: π
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false
Local Civil Registry Document Digitization and Data Extraction
Using CRNN+CTC, Multinomial Naive Bayes, and Named Entity Recognition
Thesis Project by:
- Shane Mark C. Blanco
- Princess A. Pasamonte
- Irish Faith G. Ramirez
Institution: Tarlac State University, College of Computer Studies
π Project Overview
This system automates the digitization and data extraction of Philippine Civil Registry documents using advanced machine learning algorithms:
Target Documents:
- Form 1A - Birth Certificate
- Form 2A - Death Certificate
- Form 3A - Marriage Certificate
- Form 90 - Application of Marriage License
Key Features:
β
OCR for printed and handwritten text
β
Automatic document classification
β
Named entity extraction (names, dates, places)
β
Auto-fill digital forms
β
MySQL database storage
β
Searchable digital archive
β
Data visualization dashboard
ποΈ System Architecture
Input: Scanned Civil Registry Form
β
1. Image Preprocessing
β
2. CRNN+CTC β Text Recognition
β
3. Multinomial Naive Bayes β Document Classification
β
4. spaCy NER β Entity Extraction
β
5. Data Validation & Storage β MySQL Database
β
Output: Digitized & Searchable Record
π Quick Start
Prerequisites
- Python 3.8+
- CUDA-capable GPU (recommended) or CPU
- 8GB RAM minimum
Installation
# 1. Clone or download the project
cd civil_registry_ocr
# 2. Create virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Download spaCy model
python -m spacy download en_core_web_sm
Quick Test
from inference import CivilRegistryOCR
# Load model
ocr = CivilRegistryOCR('checkpoints/best_model.pth')
# Recognize text
text = ocr.predict('test_images/sample_name.jpg')
print(f"Recognized: {text}")
π Project Files
Core Implementation Files:
- crnn_model.py - CRNN+CTC neural network architecture
- dataset.py - Data loading and preprocessing
- train.py - Model training script
- inference.py - Prediction and inference
- utils.py - Helper functions and metrics
- requirements.txt - Python dependencies
- IMPLEMENTATION_GUIDE.md - Detailed implementation guide
Additional Components (To be created):
- document_classifier.py - Multinomial Naive Bayes classifier
- ner_extractor.py - Named Entity Recognition
- web_app.py - Web application (Flask/FastAPI)
- database.py - MySQL integration
π Training the Model
1. Prepare Your Data
Organize images and labels:
data/
train/
form1a/
name_001.jpg
name_001.txt
form2a/
...
val/
...
2. Create Annotations
from dataset import create_annotation_file
create_annotation_file('data/train', 'data/train_annotations.json')
create_annotation_file('data/val', 'data/val_annotations.json')
3. Train Model
python train.py
Monitor metrics:
- Character Error Rate (CER)
- Word Error Rate (WER)
- Training/Validation Loss
4. Evaluate
from utils import calculate_cer, calculate_wer
predictions = [ocr.predict(img) for img in test_images]
cer = calculate_cer(predictions, ground_truths)
print(f"CER: {cer:.2f}%")
π Web Application
Start the Server
python web_app.py
API Endpoints
POST /api/ocr - Process document
curl -X POST -F "file=@birth_cert.jpg" http://localhost:8000/api/ocr
Response:
{
"text": "Juan Dela Cruz\n01/15/1990\nTarlac City",
"form_type": "form1a",
"entities": {
"persons": ["Juan Dela Cruz"],
"dates": ["01/15/1990"],
"locations": ["Tarlac City"]
}
}
π― Expected Performance
Based on thesis objectives:
CRNN+CTC Model:
- Target CER: < 5%
- Target Accuracy: > 95%
- Handles both printed and handwritten text
Document Classifier (MNB):
- Target Accuracy: > 90%
- Fast classification (< 100ms)
NER (spaCy):
- F1 Score: > 85%
- Extracts: Names, Dates, Places
π§ͺ Testing
ISO 25010 Evaluation
Usability Testing:
# Metrics to measure:
- Task completion rate
- Average time per task
- User satisfaction score (SUS)
Reliability Testing:
# Metrics to measure:
- System uptime %
- Error rate
- Recovery time
Confusion Matrix
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(true_labels, predicted_labels)
sns.heatmap(cm, annot=True)
πΎ Database Schema
Birth Certificates Table
CREATE TABLE birth_certificates (
id INT PRIMARY KEY AUTO_INCREMENT,
child_name VARCHAR(255),
date_of_birth DATE,
place_of_birth VARCHAR(255),
sex CHAR(1),
father_name VARCHAR(255),
mother_name VARCHAR(255),
raw_text TEXT,
form_image LONGBLOB,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
π System Requirements
Minimum:
- CPU: Intel i5 or equivalent
- RAM: 8GB
- Storage: 10GB
- OS: Windows 10, Ubuntu 18.04, macOS 10.14
Recommended:
- CPU: Intel i7 or equivalent
- GPU: NVIDIA GTX 1060 or better
- RAM: 16GB
- Storage: 50GB SSD
π Data Privacy & Security
Following Philippine Data Privacy Act (RA 10173):
- β Encrypted data transmission
- β Access control and authentication
- β Audit logging
- β Regular security updates
- β Data retention policies
π Key Algorithms
1. CRNN+CTC
Purpose: Text recognition from images
Strengths: Handles variable-length sequences, no character segmentation needed
Reference: Shi et al. (2016)
2. Multinomial Naive Bayes
Purpose: Document classification
Strengths: Fast, efficient, works well with text data
Reference: McCallum & Nigam (1998)
3. Named Entity Recognition
Purpose: Extract entities (names, dates, places)
Strengths: Pre-trained, accurate, easy to use
Reference: spaCy (Honnibal & Montani, 2017)
π οΈ Troubleshooting
Low Accuracy?
- Increase training data (target: 10,000+ samples)
- Use data augmentation
- Train longer (100+ epochs)
- Clean your dataset
Out of Memory?
- Reduce batch size
- Use smaller image dimensions
- Use gradient accumulation
- Enable mixed precision
Slow Inference?
- Use GPU if available
- Batch process images
- Optimize model (ONNX)
- Cache frequent results
π Documentation
- IMPLEMENTATION_GUIDE.md - Complete step-by-step guide
- API_DOCUMENTATION.md - API reference (to be created)
- USER_MANUAL.md - End-user guide (to be created)
π Academic References
Key Papers:
CRNN
Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI.CTC Loss
Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML.Naive Bayes
McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. AAAI Workshop.spaCy
Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.
π₯ Contributors
Researchers:
- Shane Mark C. Blanco
- Princess A. Pasamonte
- Irish Faith G. Ramirez
Advisers:
- Mr. Rengel V. Corpuz (Technical Adviser)
- Mr. Joselito T. Tan (Subject Teacher)
Institution:
Tarlac State University
College of Computer Studies
Bachelor of Science in Computer Science
π Support
For questions regarding this implementation:
- Review IMPLEMENTATION_GUIDE.md
- Check code documentation
- Consult with thesis advisers
π License
This project is for academic purposes as part of a thesis requirement.
β Implementation Checklist
Phase 1: Setup β
- Install dependencies
- Set up project structure
- Prepare development environment
Phase 2: Data Preparation
- Collect civil registry form images
- Create annotations
- Split into train/val/test sets
Phase 3: Model Development
- Train CRNN+CTC model
- Train document classifier
- Integrate NER system
Phase 4: Web Application
- Develop Flask/FastAPI backend
- Create frontend interface
- Implement database integration
Phase 5: Testing
- Accuracy testing
- Black-box testing
- ISO 25010 evaluation
- User acceptance testing
Phase 6: Deployment
- Optimize for production
- Set up server
- Deploy application
- Monitor performance
π― Success Metrics
Target metrics for thesis evaluation:
| Metric | Target | Status |
|---|---|---|
| OCR Accuracy | > 95% | Pending |
| CER | < 5% | Pending |
| Classifier Accuracy | > 90% | Pending |
| NER F1 Score | > 85% | Pending |
| Response Time | < 2s | Pending |
| System Uptime | > 99% | Pending |
Good luck with your thesis defense! πβ¨
For detailed implementation instructions, see IMPLEMENTATION_GUIDE.md