--- title: LCR OCR API emoji: ๐Ÿ“„ colorFrom: blue colorTo: green sdk: docker app_file: app.py pinned: false --- # Local Civil Registry Document Digitization and Data Extraction ## Using CRNN+CTC, Multinomial Naive Bayes, and Named Entity Recognition **Thesis Project by:** - Shane Mark C. Blanco - Princess A. Pasamonte - Irish Faith G. Ramirez **Institution:** Tarlac State University, College of Computer Studies --- ## ๐Ÿ“‹ Project Overview This system automates the digitization and data extraction of Philippine Civil Registry documents using advanced machine learning algorithms: ### Target Documents: - **Form 1A** - Birth Certificate - **Form 2A** - Death Certificate - **Form 3A** - Marriage Certificate - **Form 90** - Application of Marriage License ### Key Features: โœ… OCR for printed and handwritten text โœ… Automatic document classification โœ… Named entity extraction (names, dates, places) โœ… Auto-fill digital forms โœ… MySQL database storage โœ… Searchable digital archive โœ… Data visualization dashboard --- ## ๐Ÿ—๏ธ System Architecture ``` Input: Scanned Civil Registry Form โ†“ 1. Image Preprocessing โ†“ 2. CRNN+CTC โ†’ Text Recognition โ†“ 3. Multinomial Naive Bayes โ†’ Document Classification โ†“ 4. spaCy NER โ†’ Entity Extraction โ†“ 5. Data Validation & Storage โ†’ MySQL Database โ†“ Output: Digitized & Searchable Record ``` --- ## ๐Ÿš€ Quick Start ### Prerequisites - Python 3.8+ - CUDA-capable GPU (recommended) or CPU - 8GB RAM minimum ### Installation ```bash # 1. Clone or download the project cd civil_registry_ocr # 2. Create virtual environment python -m venv venv source venv/bin/activate # Linux/Mac venv\Scripts\activate # Windows # 3. Install dependencies pip install -r requirements.txt # 4. Download spaCy model python -m spacy download en_core_web_sm ``` ### Quick Test ```python from inference import CivilRegistryOCR # Load model ocr = CivilRegistryOCR('checkpoints/best_model.pth') # Recognize text text = ocr.predict('test_images/sample_name.jpg') print(f"Recognized: {text}") ``` --- ## ๐Ÿ“ Project Files ### Core Implementation Files: 1. **crnn_model.py** - CRNN+CTC neural network architecture 2. **dataset.py** - Data loading and preprocessing 3. **train.py** - Model training script 4. **inference.py** - Prediction and inference 5. **utils.py** - Helper functions and metrics 6. **requirements.txt** - Python dependencies 7. **IMPLEMENTATION_GUIDE.md** - Detailed implementation guide ### Additional Components (To be created): 8. **document_classifier.py** - Multinomial Naive Bayes classifier 9. **ner_extractor.py** - Named Entity Recognition 10. **web_app.py** - Web application (Flask/FastAPI) 11. **database.py** - MySQL integration --- ## ๐Ÿ“Š Training the Model ### 1. Prepare Your Data Organize images and labels: ``` data/ train/ form1a/ name_001.jpg name_001.txt form2a/ ... val/ ... ``` ### 2. Create Annotations ```python from dataset import create_annotation_file create_annotation_file('data/train', 'data/train_annotations.json') create_annotation_file('data/val', 'data/val_annotations.json') ``` ### 3. Train Model ```bash python train.py ``` Monitor metrics: - Character Error Rate (CER) - Word Error Rate (WER) - Training/Validation Loss ### 4. Evaluate ```python from utils import calculate_cer, calculate_wer predictions = [ocr.predict(img) for img in test_images] cer = calculate_cer(predictions, ground_truths) print(f"CER: {cer:.2f}%") ``` --- ## ๐ŸŒ Web Application ### Start the Server ```bash python web_app.py ``` ### API Endpoints **POST /api/ocr** - Process document ```bash curl -X POST -F "file=@birth_cert.jpg" http://localhost:8000/api/ocr ``` **Response:** ```json { "text": "Juan Dela Cruz\n01/15/1990\nTarlac City", "form_type": "form1a", "entities": { "persons": ["Juan Dela Cruz"], "dates": ["01/15/1990"], "locations": ["Tarlac City"] } } ``` --- ## ๐ŸŽฏ Expected Performance Based on thesis objectives: ### CRNN+CTC Model: - **Target CER:** < 5% - **Target Accuracy:** > 95% - Handles both printed and handwritten text ### Document Classifier (MNB): - **Target Accuracy:** > 90% - Fast classification (< 100ms) ### NER (spaCy): - **F1 Score:** > 85% - Extracts: Names, Dates, Places --- ## ๐Ÿงช Testing ### ISO 25010 Evaluation **Usability Testing:** ```python # Metrics to measure: - Task completion rate - Average time per task - User satisfaction score (SUS) ``` **Reliability Testing:** ```python # Metrics to measure: - System uptime % - Error rate - Recovery time ``` ### Confusion Matrix ```python from sklearn.metrics import confusion_matrix import seaborn as sns cm = confusion_matrix(true_labels, predicted_labels) sns.heatmap(cm, annot=True) ``` --- ## ๐Ÿ’พ Database Schema ### Birth Certificates Table ```sql CREATE TABLE birth_certificates ( id INT PRIMARY KEY AUTO_INCREMENT, child_name VARCHAR(255), date_of_birth DATE, place_of_birth VARCHAR(255), sex CHAR(1), father_name VARCHAR(255), mother_name VARCHAR(255), raw_text TEXT, form_image LONGBLOB, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ); ``` --- ## ๐Ÿ“ˆ System Requirements ### Minimum: - CPU: Intel i5 or equivalent - RAM: 8GB - Storage: 10GB - OS: Windows 10, Ubuntu 18.04, macOS 10.14 ### Recommended: - CPU: Intel i7 or equivalent - GPU: NVIDIA GTX 1060 or better - RAM: 16GB - Storage: 50GB SSD --- ## ๐Ÿ”’ Data Privacy & Security Following Philippine Data Privacy Act (RA 10173): - โœ… Encrypted data transmission - โœ… Access control and authentication - โœ… Audit logging - โœ… Regular security updates - โœ… Data retention policies --- ## ๐Ÿ“š Key Algorithms ### 1. CRNN+CTC **Purpose:** Text recognition from images **Strengths:** Handles variable-length sequences, no character segmentation needed **Reference:** Shi et al. (2016) ### 2. Multinomial Naive Bayes **Purpose:** Document classification **Strengths:** Fast, efficient, works well with text data **Reference:** McCallum & Nigam (1998) ### 3. Named Entity Recognition **Purpose:** Extract entities (names, dates, places) **Strengths:** Pre-trained, accurate, easy to use **Reference:** spaCy (Honnibal & Montani, 2017) --- ## ๐Ÿ› ๏ธ Troubleshooting ### Low Accuracy? 1. Increase training data (target: 10,000+ samples) 2. Use data augmentation 3. Train longer (100+ epochs) 4. Clean your dataset ### Out of Memory? 1. Reduce batch size 2. Use smaller image dimensions 3. Use gradient accumulation 4. Enable mixed precision ### Slow Inference? 1. Use GPU if available 2. Batch process images 3. Optimize model (ONNX) 4. Cache frequent results --- ## ๐Ÿ“– Documentation - **IMPLEMENTATION_GUIDE.md** - Complete step-by-step guide - **API_DOCUMENTATION.md** - API reference (to be created) - **USER_MANUAL.md** - End-user guide (to be created) --- ## ๐ŸŽ“ Academic References ### Key Papers: 1. **CRNN** Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. *IEEE TPAMI*. 2. **CTC Loss** Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. *ICML*. 3. **Naive Bayes** McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. *AAAI Workshop*. 4. **spaCy** Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. --- ## ๐Ÿ‘ฅ Contributors **Researchers:** - Shane Mark C. Blanco - Princess A. Pasamonte - Irish Faith G. Ramirez **Advisers:** - Mr. Rengel V. Corpuz (Technical Adviser) - Mr. Joselito T. Tan (Subject Teacher) **Institution:** Tarlac State University College of Computer Studies Bachelor of Science in Computer Science --- ## ๐Ÿ“ž Support For questions regarding this implementation: 1. Review IMPLEMENTATION_GUIDE.md 2. Check code documentation 3. Consult with thesis advisers --- ## ๐Ÿ“„ License This project is for academic purposes as part of a thesis requirement. --- ## โœ… Implementation Checklist ### Phase 1: Setup โœ“ - [x] Install dependencies - [x] Set up project structure - [x] Prepare development environment ### Phase 2: Data Preparation - [ ] Collect civil registry form images - [ ] Create annotations - [ ] Split into train/val/test sets ### Phase 3: Model Development - [ ] Train CRNN+CTC model - [ ] Train document classifier - [ ] Integrate NER system ### Phase 4: Web Application - [ ] Develop Flask/FastAPI backend - [ ] Create frontend interface - [ ] Implement database integration ### Phase 5: Testing - [ ] Accuracy testing - [ ] Black-box testing - [ ] ISO 25010 evaluation - [ ] User acceptance testing ### Phase 6: Deployment - [ ] Optimize for production - [ ] Set up server - [ ] Deploy application - [ ] Monitor performance --- ## ๐ŸŽฏ Success Metrics Target metrics for thesis evaluation: | Metric | Target | Status | |--------|--------|--------| | OCR Accuracy | > 95% | Pending | | CER | < 5% | Pending | | Classifier Accuracy | > 90% | Pending | | NER F1 Score | > 85% | Pending | | Response Time | < 2s | Pending | | System Uptime | > 99% | Pending | --- **Good luck with your thesis defense! ๐ŸŽ“โœจ** For detailed implementation instructions, see **IMPLEMENTATION_GUIDE.md**