ocr / README.md
hanz245's picture
fix emoji in hf config
fba3a76
metadata
title: LCR OCR API
emoji: πŸ“„
colorFrom: blue
colorTo: green
sdk: docker
app_file: app.py
pinned: false

Local Civil Registry Document Digitization and Data Extraction

Using CRNN+CTC, Multinomial Naive Bayes, and Named Entity Recognition

Thesis Project by:

  • Shane Mark C. Blanco
  • Princess A. Pasamonte
  • Irish Faith G. Ramirez

Institution: Tarlac State University, College of Computer Studies


πŸ“‹ Project Overview

This system automates the digitization and data extraction of Philippine Civil Registry documents using advanced machine learning algorithms:

Target Documents:

  • Form 1A - Birth Certificate
  • Form 2A - Death Certificate
  • Form 3A - Marriage Certificate
  • Form 90 - Application of Marriage License

Key Features:

βœ… OCR for printed and handwritten text
βœ… Automatic document classification
βœ… Named entity extraction (names, dates, places)
βœ… Auto-fill digital forms
βœ… MySQL database storage
βœ… Searchable digital archive
βœ… Data visualization dashboard


πŸ—οΈ System Architecture

Input: Scanned Civil Registry Form
    ↓
1. Image Preprocessing
    ↓
2. CRNN+CTC β†’ Text Recognition
    ↓
3. Multinomial Naive Bayes β†’ Document Classification
    ↓
4. spaCy NER β†’ Entity Extraction
    ↓
5. Data Validation & Storage β†’ MySQL Database
    ↓
Output: Digitized & Searchable Record

πŸš€ Quick Start

Prerequisites

  • Python 3.8+
  • CUDA-capable GPU (recommended) or CPU
  • 8GB RAM minimum

Installation

# 1. Clone or download the project
cd civil_registry_ocr

# 2. Create virtual environment
python -m venv venv
source venv/bin/activate  # Linux/Mac
venv\Scripts\activate     # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Download spaCy model
python -m spacy download en_core_web_sm

Quick Test

from inference import CivilRegistryOCR

# Load model
ocr = CivilRegistryOCR('checkpoints/best_model.pth')

# Recognize text
text = ocr.predict('test_images/sample_name.jpg')
print(f"Recognized: {text}")

πŸ“ Project Files

Core Implementation Files:

  1. crnn_model.py - CRNN+CTC neural network architecture
  2. dataset.py - Data loading and preprocessing
  3. train.py - Model training script
  4. inference.py - Prediction and inference
  5. utils.py - Helper functions and metrics
  6. requirements.txt - Python dependencies
  7. IMPLEMENTATION_GUIDE.md - Detailed implementation guide

Additional Components (To be created):

  1. document_classifier.py - Multinomial Naive Bayes classifier
  2. ner_extractor.py - Named Entity Recognition
  3. web_app.py - Web application (Flask/FastAPI)
  4. database.py - MySQL integration

πŸ“Š Training the Model

1. Prepare Your Data

Organize images and labels:

data/
  train/
    form1a/
      name_001.jpg
      name_001.txt
    form2a/
      ...
  val/
    ...

2. Create Annotations

from dataset import create_annotation_file

create_annotation_file('data/train', 'data/train_annotations.json')
create_annotation_file('data/val', 'data/val_annotations.json')

3. Train Model

python train.py

Monitor metrics:

  • Character Error Rate (CER)
  • Word Error Rate (WER)
  • Training/Validation Loss

4. Evaluate

from utils import calculate_cer, calculate_wer

predictions = [ocr.predict(img) for img in test_images]
cer = calculate_cer(predictions, ground_truths)
print(f"CER: {cer:.2f}%")

🌐 Web Application

Start the Server

python web_app.py

API Endpoints

POST /api/ocr - Process document

curl -X POST -F "file=@birth_cert.jpg" http://localhost:8000/api/ocr

Response:

{
  "text": "Juan Dela Cruz\n01/15/1990\nTarlac City",
  "form_type": "form1a",
  "entities": {
    "persons": ["Juan Dela Cruz"],
    "dates": ["01/15/1990"],
    "locations": ["Tarlac City"]
  }
}

🎯 Expected Performance

Based on thesis objectives:

CRNN+CTC Model:

  • Target CER: < 5%
  • Target Accuracy: > 95%
  • Handles both printed and handwritten text

Document Classifier (MNB):

  • Target Accuracy: > 90%
  • Fast classification (< 100ms)

NER (spaCy):

  • F1 Score: > 85%
  • Extracts: Names, Dates, Places

πŸ§ͺ Testing

ISO 25010 Evaluation

Usability Testing:

# Metrics to measure:
- Task completion rate
- Average time per task
- User satisfaction score (SUS)

Reliability Testing:

# Metrics to measure:
- System uptime %
- Error rate
- Recovery time

Confusion Matrix

from sklearn.metrics import confusion_matrix
import seaborn as sns

cm = confusion_matrix(true_labels, predicted_labels)
sns.heatmap(cm, annot=True)

πŸ’Ύ Database Schema

Birth Certificates Table

CREATE TABLE birth_certificates (
    id INT PRIMARY KEY AUTO_INCREMENT,
    child_name VARCHAR(255),
    date_of_birth DATE,
    place_of_birth VARCHAR(255),
    sex CHAR(1),
    father_name VARCHAR(255),
    mother_name VARCHAR(255),
    raw_text TEXT,
    form_image LONGBLOB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

πŸ“ˆ System Requirements

Minimum:

  • CPU: Intel i5 or equivalent
  • RAM: 8GB
  • Storage: 10GB
  • OS: Windows 10, Ubuntu 18.04, macOS 10.14

Recommended:

  • CPU: Intel i7 or equivalent
  • GPU: NVIDIA GTX 1060 or better
  • RAM: 16GB
  • Storage: 50GB SSD

πŸ”’ Data Privacy & Security

Following Philippine Data Privacy Act (RA 10173):

  • βœ… Encrypted data transmission
  • βœ… Access control and authentication
  • βœ… Audit logging
  • βœ… Regular security updates
  • βœ… Data retention policies

πŸ“š Key Algorithms

1. CRNN+CTC

Purpose: Text recognition from images
Strengths: Handles variable-length sequences, no character segmentation needed
Reference: Shi et al. (2016)

2. Multinomial Naive Bayes

Purpose: Document classification
Strengths: Fast, efficient, works well with text data
Reference: McCallum & Nigam (1998)

3. Named Entity Recognition

Purpose: Extract entities (names, dates, places)
Strengths: Pre-trained, accurate, easy to use
Reference: spaCy (Honnibal & Montani, 2017)


πŸ› οΈ Troubleshooting

Low Accuracy?

  1. Increase training data (target: 10,000+ samples)
  2. Use data augmentation
  3. Train longer (100+ epochs)
  4. Clean your dataset

Out of Memory?

  1. Reduce batch size
  2. Use smaller image dimensions
  3. Use gradient accumulation
  4. Enable mixed precision

Slow Inference?

  1. Use GPU if available
  2. Batch process images
  3. Optimize model (ONNX)
  4. Cache frequent results

πŸ“– Documentation

  • IMPLEMENTATION_GUIDE.md - Complete step-by-step guide
  • API_DOCUMENTATION.md - API reference (to be created)
  • USER_MANUAL.md - End-user guide (to be created)

πŸŽ“ Academic References

Key Papers:

  1. CRNN
    Shi, B., Bai, X., & Yao, C. (2016). An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE TPAMI.

  2. CTC Loss
    Graves, A., et al. (2006). Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML.

  3. Naive Bayes
    McCallum, A., & Nigam, K. (1998). A comparison of event models for naive bayes text classification. AAAI Workshop.

  4. spaCy
    Honnibal, M., & Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing.


πŸ‘₯ Contributors

Researchers:

  • Shane Mark C. Blanco
  • Princess A. Pasamonte
  • Irish Faith G. Ramirez

Advisers:

  • Mr. Rengel V. Corpuz (Technical Adviser)
  • Mr. Joselito T. Tan (Subject Teacher)

Institution:
Tarlac State University
College of Computer Studies
Bachelor of Science in Computer Science


πŸ“ž Support

For questions regarding this implementation:

  1. Review IMPLEMENTATION_GUIDE.md
  2. Check code documentation
  3. Consult with thesis advisers

πŸ“„ License

This project is for academic purposes as part of a thesis requirement.


βœ… Implementation Checklist

Phase 1: Setup βœ“

  • Install dependencies
  • Set up project structure
  • Prepare development environment

Phase 2: Data Preparation

  • Collect civil registry form images
  • Create annotations
  • Split into train/val/test sets

Phase 3: Model Development

  • Train CRNN+CTC model
  • Train document classifier
  • Integrate NER system

Phase 4: Web Application

  • Develop Flask/FastAPI backend
  • Create frontend interface
  • Implement database integration

Phase 5: Testing

  • Accuracy testing
  • Black-box testing
  • ISO 25010 evaluation
  • User acceptance testing

Phase 6: Deployment

  • Optimize for production
  • Set up server
  • Deploy application
  • Monitor performance

🎯 Success Metrics

Target metrics for thesis evaluation:

Metric Target Status
OCR Accuracy > 95% Pending
CER < 5% Pending
Classifier Accuracy > 90% Pending
NER F1 Score > 85% Pending
Response Time < 2s Pending
System Uptime > 99% Pending

Good luck with your thesis defense! πŸŽ“βœ¨

For detailed implementation instructions, see IMPLEMENTATION_GUIDE.md