| --- |
| title: LCR OCR API |
| emoji: π |
| colorFrom: blue |
| colorTo: green |
| sdk: docker |
| app_file: app.py |
| pinned: false |
| --- |
| |
| # Local Civil Registry Document Digitization and Data Extraction |
|
|
| ## Using CRNN+CTC, Multinomial Naive Bayes, and Named Entity Recognition |
|
|
| **Thesis Project by:** |
| - Shane Mark C. Blanco |
| - Princess A. Pasamonte |
| - Irish Faith G. Ramirez |
|
|
| **Institution:** Tarlac State University, College of Computer Studies |
|
|
| --- |
|
|
| ## π Project Overview |
|
|
| This system automates the digitization and data extraction of Philippine Civil Registry documents using advanced machine learning algorithms: |
|
|
| ### Target Documents: |
| - **Form 1A** - Birth Certificate |
| - **Form 2A** - Death Certificate |
| - **Form 3A** - Marriage Certificate |
| - **Form 90** - Application of Marriage License |
|
|
| ### Key Features: |
| β
OCR for printed and handwritten text |
| β
Automatic document classification |
| β
Named entity extraction (names, dates, places) |
| β
Auto-fill digital forms |
| β
MySQL database storage |
| β
Searchable digital archive |
| β
Data visualization dashboard |
|
|
| --- |
|
|
| ## ποΈ System Architecture |
|
|
| ``` |
| Input: Scanned Civil Registry Form |
| β |
| 1. Image Preprocessing |
| β |
| 2. CRNN+CTC β Text Recognition |
| β |
| 3. Multinomial Naive Bayes β Document Classification |
| β |
| 4. spaCy NER β Entity Extraction |
| β |
| 5. Data Validation & Storage β MySQL Database |
| β |
| Output: Digitized & Searchable Record |
| ``` |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| ### Prerequisites |
|
|
| - Python 3.8+ |
| - CUDA-capable GPU (recommended) or CPU |
| - 8GB RAM minimum |
|
|
| ### Installation |
|
|
| ```bash |
| # 1. Clone or download the project |
| cd civil_registry_ocr |
| |
| # 2. Create virtual environment |
| python -m venv venv |
| source venv/bin/activate # Linux/Mac |
| venv\Scripts\activate # Windows |
| |
| # 3. Install dependencies |
| pip install -r requirements.txt |
| |
| # 4. Download spaCy model |
| python -m spacy download en_core_web_sm |
| ``` |
|
|
| ### Quick Test |
|
|
| ```python |
| from inference import CivilRegistryOCR |
| |
| # Load model |
| ocr = CivilRegistryOCR('checkpoints/best_model.pth') |
| |
| # Recognize text |
| text = ocr.predict('test_images/sample_name.jpg') |
| print(f"Recognized: {text}") |
| ``` |
|
|
| --- |
|
|
| ## π Project Files |
|
|
| ### Core Implementation Files: |
|
|
| 1. **crnn_model.py** - CRNN+CTC neural network architecture |
| 2. **dataset.py** - Data loading and preprocessing |
| 3. **train.py** - Model training script |
| 4. **inference.py** - Prediction and inference |
| 5. **utils.py** - Helper functions and metrics |
| 6. **requirements.txt** - Python dependencies |
| 7. **IMPLEMENTATION_GUIDE.md** - Detailed implementation guide |
|
|
| ### Additional Components (To be created): |
|
|
| 8. **document_classifier.py** - Multinomial Naive Bayes classifier |
| 9. **ner_extractor.py** - Named Entity Recognition |
| 10. **web_app.py** - Web application (Flask/FastAPI) |
| 11. **database.py** - MySQL integration |
| |
| --- |
| |
| ## π Training the Model |
| |
| ### 1. Prepare Your Data |
| |
| Organize images and labels: |
| ``` |
| data/ |
| train/ |
| form1a/ |
| name_001.jpg |
| name_001.txt |
| form2a/ |
| ... |
| val/ |
| ... |
| ``` |
| |
| ### 2. Create Annotations |
| |
| ```python |
| from dataset import create_annotation_file |
| |
| create_annotation_file('data/train', 'data/train_annotations.json') |
| create_annotation_file('data/val', 'data/val_annotations.json') |
| ``` |
| |
| ### 3. Train Model |
| |
| ```bash |
| python train.py |
| ``` |
| |
| Monitor metrics: |
| - Character Error Rate (CER) |
| - Word Error Rate (WER) |
| - Training/Validation Loss |
| |
| ### 4. Evaluate |
| |
| ```python |
| from utils import calculate_cer, calculate_wer |
| |
| predictions = [ocr.predict(img) for img in test_images] |
| cer = calculate_cer(predictions, ground_truths) |
| print(f"CER: {cer:.2f}%") |
| ``` |
| |
| --- |
| |
| ## π Web Application |
| |
| ### Start the Server |
| |
| ```bash |
| python web_app.py |
| ``` |
| |
| ### API Endpoints |
| |
| **POST /api/ocr** - Process document |
| ```bash |
| curl -X POST -F "file=@birth_cert.jpg" http://localhost:8000/api/ocr |
| ``` |
| |
| **Response:** |
| ```json |
| { |
| "text": "Juan Dela Cruz\n01/15/1990\nTarlac City", |
| "form_type": "form1a", |
| "entities": { |
| "persons": ["Juan Dela Cruz"], |
| "dates": ["01/15/1990"], |
| "locations": ["Tarlac City"] |
| } |
| } |
| ``` |
|
|
| --- |
|
|
| ## π― Expected Performance |
|
|
| Based on thesis objectives: |
|
|
| ### CRNN+CTC Model: |
| - **Target CER:** < 5% |
| - **Target Accuracy:** > 95% |
| - Handles both printed and handwritten text |
|
|
| ### Document Classifier (MNB): |
| - **Target Accuracy:** > 90% |
| - Fast classification (< 100ms) |
|
|
| ### NER (spaCy): |
| - **F1 Score:** > 85% |
| - Extracts: Names, Dates, Places |
|
|
| --- |
|
|
| ## π§ͺ Testing |
|
|
| ### ISO 25010 Evaluation |
|
|
| **Usability Testing:** |
| ```python |
| # Metrics to measure: |
| - Task completion rate |
| - Average time per task |
| - User satisfaction score (SUS) |
| ``` |
|
|
| **Reliability Testing:** |
| ```python |
| # Metrics to measure: |
| - System uptime % |
| - Error rate |
| - Recovery time |
| ``` |
|
|
| ### Confusion Matrix |
|
|
| ```python |
| from sklearn.metrics import confusion_matrix |
| import seaborn as sns |
| |
| cm = confusion_matrix(true_labels, predicted_labels) |
| sns.heatmap(cm, annot=True) |
| ``` |
|
|
| --- |
|
|
| ## πΎ Database Schema |
|
|
| ### Birth Certificates Table |
| ```sql |
| CREATE TABLE birth_certificates ( |
| id INT PRIMARY KEY AUTO_INCREMENT, |
| child_name VARCHAR(255), |
| date_of_birth DATE, |
| place_of_birth VARCHAR(255), |
| sex CHAR(1), |
| father_name VARCHAR(255), |
| mother_name VARCHAR(255), |
| raw_text TEXT, |
| form_image LONGBLOB, |
| created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP |
| ); |
| ``` |
|
|
| --- |
|
|
| ## π System Requirements |
|
|
| ### Minimum: |
| - CPU: Intel i5 or equivalent |
| - RAM: 8GB |
| - Storage: 10GB |
| - OS: Windows 10, Ubuntu 18.04, macOS 10.14 |
|
|
| ### Recommended: |
| - CPU: Intel i7 or equivalent |
| - GPU: NVIDIA GTX 1060 or better |
| - RAM: 16GB |
| - Storage: 50GB SSD |
|
|
| --- |
|
|
| ## π Data Privacy & Security |
|
|
| Following Philippine Data Privacy Act (RA 10173): |
|
|
| - β
Encrypted data transmission |
| - β
Access control and authentication |
| - β
Audit logging |
| - β
Regular security updates |
| - β
Data retention policies |
|
|
| --- |
|
|
| ## π Key Algorithms |
|
|
| ### 1. CRNN+CTC |
| **Purpose:** Text recognition from images |
| **Strengths:** Handles variable-length sequences, no character segmentation needed |
| **Reference:** Shi et al. (2016) |
|
|
| ### 2. Multinomial Naive Bayes |
| **Purpose:** Document classification |
| **Strengths:** Fast, efficient, works well with text data |
| **Reference:** McCallum & Nigam (1998) |
|
|
| ### 3. Named Entity Recognition |
| **Purpose:** Extract entities (names, dates, places) |
| **Strengths:** Pre-trained, accurate, easy to use |
| **Reference:** spaCy (Honnibal & Montani, 2017) |
|
|
| --- |
|
|
| ## π οΈ Troubleshooting |
|
|
| ### Low Accuracy? |
| 1. Increase training data (target: 10,000+ samples) |
| 2. Use data augmentation |
| 3. Train longer (100+ epochs) |
| 4. Clean your dataset |
|
|
| ### Out of Memory? |
| 1. Reduce batch size |
| 2. Use smaller image dimensions |
| 3. Use gradient accumulation |
| 4. Enable mixed precision |
|
|
| ### Slow Inference? |
| 1. Use GPU if available |
| 2. Batch process images |
| 3. Optimize model (ONNX) |
| 4. Cache frequent results |
|
|
| --- |
|
|
| ## π Documentation |
|
|
| - **IMPLEMENTATION_GUIDE.md** - Complete step-by-step guide |
| - **API_DOCUMENTATION.md** - API reference (to be created) |
| - **USER_MANUAL.md** - End-user guide (to be created) |
| |
| --- |
| |
| ## π Academic References |
| |
| ### Key Papers: |
| |
| 1. **CRNN** |
| Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. *IEEE TPAMI*. |
| |
| 2. **CTC Loss** |
| Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. *ICML*. |
| |
| 3. **Naive Bayes** |
| McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. *AAAI Workshop*. |
| |
| 4. **spaCy** |
| Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. |
| |
| --- |
| |
| ## π₯ Contributors |
| |
| **Researchers:** |
| - Shane Mark C. Blanco |
| - Princess A. Pasamonte |
| - Irish Faith G. Ramirez |
| |
| **Advisers:** |
| - Mr. Rengel V. Corpuz (Technical Adviser) |
| - Mr. Joselito T. Tan (Subject Teacher) |
| |
| **Institution:** |
| Tarlac State University |
| College of Computer Studies |
| Bachelor of Science in Computer Science |
| |
| --- |
| |
| ## π Support |
| |
| For questions regarding this implementation: |
| |
| 1. Review IMPLEMENTATION_GUIDE.md |
| 2. Check code documentation |
| 3. Consult with thesis advisers |
| |
| --- |
| |
| ## π License |
| |
| This project is for academic purposes as part of a thesis requirement. |
| |
| --- |
| |
| ## β
Implementation Checklist |
| |
| ### Phase 1: Setup β |
| - [x] Install dependencies |
| - [x] Set up project structure |
| - [x] Prepare development environment |
| |
| ### Phase 2: Data Preparation |
| - [ ] Collect civil registry form images |
| - [ ] Create annotations |
| - [ ] Split into train/val/test sets |
| |
| ### Phase 3: Model Development |
| - [ ] Train CRNN+CTC model |
| - [ ] Train document classifier |
| - [ ] Integrate NER system |
| |
| ### Phase 4: Web Application |
| - [ ] Develop Flask/FastAPI backend |
| - [ ] Create frontend interface |
| - [ ] Implement database integration |
| |
| ### Phase 5: Testing |
| - [ ] Accuracy testing |
| - [ ] Black-box testing |
| - [ ] ISO 25010 evaluation |
| - [ ] User acceptance testing |
| |
| ### Phase 6: Deployment |
| - [ ] Optimize for production |
| - [ ] Set up server |
| - [ ] Deploy application |
| - [ ] Monitor performance |
| |
| --- |
| |
| ## π― Success Metrics |
| |
| Target metrics for thesis evaluation: |
| |
| | Metric | Target | Status | |
| |--------|--------|--------| |
| | OCR Accuracy | > 95% | Pending | |
| | CER | < 5% | Pending | |
| | Classifier Accuracy | > 90% | Pending | |
| | NER F1 Score | > 85% | Pending | |
| | Response Time | < 2s | Pending | |
| | System Uptime | > 99% | Pending | |
| |
| --- |
| |
| **Good luck with your thesis defense! πβ¨** |
|
|
| For detailed implementation instructions, see **IMPLEMENTATION_GUIDE.md** |
| |