Add writing_pattern_classifier package for live demo
Browse files- writing_pattern_classifier/README.md +163 -0
- writing_pattern_classifier/__init__.py +0 -0
- writing_pattern_classifier/__pycache__/__init__.cpython-312.pyc +0 -0
- writing_pattern_classifier/artifacts/pattern_taxonomy.md +79 -0
- writing_pattern_classifier/notebooks/01_sentence_level_dyslexic_pattern_inference.ipynb +0 -0
- writing_pattern_classifier/notebooks/02_essay_level_dyslexic_pattern_profiling.ipynb +0 -0
- writing_pattern_classifier/notebooks/OLD_02_essay_level_dyslexic_pattern_profiling.ipynb +0 -0
- writing_pattern_classifier/src/__init__.py +0 -0
- writing_pattern_classifier/src/__pycache__/__init__.cpython-312.pyc +0 -0
- writing_pattern_classifier/src/__pycache__/essay_profile.cpython-312.pyc +0 -0
- writing_pattern_classifier/src/__pycache__/feature_extraction.cpython-312.pyc +0 -0
- writing_pattern_classifier/src/__pycache__/pattern_rules.cpython-312.pyc +0 -0
- writing_pattern_classifier/src/__pycache__/pipeline.cpython-312.pyc +0 -0
- writing_pattern_classifier/src/essay_profile.py +86 -0
- writing_pattern_classifier/src/feature_extraction.py +75 -0
- writing_pattern_classifier/src/pattern_rules.py +46 -0
- writing_pattern_classifier/src/pipeline.py +66 -0
writing_pattern_classifier/README.md
ADDED
|
@@ -0,0 +1,163 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Dyslexic Writing-Pattern Classifier (Sinhala)
|
| 2 |
+
|
| 3 |
+
This module implements an **interpretable, rule-based dyslexic writing-pattern classifier** for Sinhala text.
|
| 4 |
+
|
| 5 |
+
Unlike traditional machine-learning classifiers, this component focuses on **pattern inference and explainability**, rather than predictive accuracy.
|
| 6 |
+
It is designed to analyze _how_ dyslexic writing manifests, not merely _whether_ dyslexia is present.
|
| 7 |
+
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
## Purpose
|
| 11 |
+
|
| 12 |
+
- Identify **dominant dyslexic writing patterns** in Sinhala text
|
| 13 |
+
- Provide **explainable, linguistically grounded analysis**
|
| 14 |
+
- Support educational and research-oriented dyslexia-aware systems
|
| 15 |
+
|
| 16 |
+
This module is executed **only after** an essay has been identified as dyslexic by the Binary Dyslexia Detector.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Core Design Principle
|
| 21 |
+
|
| 22 |
+
> Dyslexia is expressed through **consistent patterns of surface-level writing errors**, not isolated mistakes.
|
| 23 |
+
|
| 24 |
+
Therefore, this classifier infers patterns using **rule-based dominance of error signals**, rather than supervised learning.
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## Writing Patterns Identified
|
| 29 |
+
|
| 30 |
+
The system currently identifies the following dyslexic writing patterns:
|
| 31 |
+
|
| 32 |
+
- **Orthographic Instability**
|
| 33 |
+
Frequent character omissions, additions, or diacritic loss
|
| 34 |
+
|
| 35 |
+
- **Phonetic Confusion**
|
| 36 |
+
Character substitutions reflecting phonetic similarity
|
| 37 |
+
|
| 38 |
+
- **Mixed Dyslexic Pattern**
|
| 39 |
+
Co-occurrence of multiple dominant error types
|
| 40 |
+
|
| 41 |
+
- **No Dominant Pattern**
|
| 42 |
+
Absence of consistent dyslexic error behavior
|
| 43 |
+
|
| 44 |
+
- **Word Boundary Confusion** (when applicable)
|
| 45 |
+
Spacing and word segmentation errors
|
| 46 |
+
|
| 47 |
+
These patterns are derived from dyslexia-related literature and adapted for Sinhala writing.
|
| 48 |
+
|
| 49 |
+
---
|
| 50 |
+
|
| 51 |
+
## Processing Pipeline
|
| 52 |
+
|
| 53 |
+
### 1. Sentence-Level Analysis
|
| 54 |
+
|
| 55 |
+
For each sentence:
|
| 56 |
+
|
| 57 |
+
- Clean and dyslexic versions are compared
|
| 58 |
+
- Surface error features are extracted:
|
| 59 |
+
- Character addition
|
| 60 |
+
- Character omission
|
| 61 |
+
- Character substitution
|
| 62 |
+
- Diacritic loss
|
| 63 |
+
- Spacing issues
|
| 64 |
+
- A **rule-based inference engine** assigns a sentence-level writing pattern
|
| 65 |
+
|
| 66 |
+
### 2. Essay-Level Aggregation
|
| 67 |
+
|
| 68 |
+
Because the dataset does not provide explicit essay boundaries:
|
| 69 |
+
|
| 70 |
+
- Essays are approximated using **fixed-size sentence windows** (pseudo-essays)
|
| 71 |
+
- Sentence-level patterns are aggregated per essay
|
| 72 |
+
|
| 73 |
+
### 3. Dominant Pattern Classification
|
| 74 |
+
|
| 75 |
+
For each essay:
|
| 76 |
+
|
| 77 |
+
- The most frequent pattern is selected as the **dominant pattern**
|
| 78 |
+
- A **confidence score** is computed as:
|
| 79 |
+
|
| 80 |
+
\[
|
| 81 |
+
Confidence = \frac{\text{Number of sentences supporting dominant pattern}}
|
| 82 |
+
{\text{Total number of sentences in essay}}
|
| 83 |
+
\]
|
| 84 |
+
|
| 85 |
+
- Dominance strength is categorized as:
|
| 86 |
+
- Strong Dominance
|
| 87 |
+
- Moderate Dominance
|
| 88 |
+
- Weak / Mixed
|
| 89 |
+
|
| 90 |
+
---
|
| 91 |
+
|
| 92 |
+
## Outputs
|
| 93 |
+
|
| 94 |
+
For each essay, the classifier produces:
|
| 95 |
+
|
| 96 |
+
- Dominant dyslexic writing pattern
|
| 97 |
+
- Pattern dominance confidence
|
| 98 |
+
- Dominance strength label
|
| 99 |
+
- Sentence-level pattern breakdown (for explainability)
|
| 100 |
+
|
| 101 |
+
### Example Output
|
| 102 |
+
|
| 103 |
+
```json
|
| 104 |
+
{
|
| 105 |
+
"dominant_pattern": "Orthographic Instability",
|
| 106 |
+
"confidence": 0.6,
|
| 107 |
+
"dominance_strength": "Strong Dominance"
|
| 108 |
+
}
|
| 109 |
+
|
| 110 |
+
---
|
| 111 |
+
|
| 112 |
+
## Evaluation Strategy
|
| 113 |
+
|
| 114 |
+
This component does not use supervised evaluation metrics such as accuracy or F1-score.
|
| 115 |
+
|
| 116 |
+
Reason:
|
| 117 |
+
|
| 118 |
+
- Essay-level pattern labels are inferred, not manually annotated
|
| 119 |
+
|
| 120 |
+
- Reporting accuracy would result in label leakage
|
| 121 |
+
|
| 122 |
+
Instead, evaluation is performed using:
|
| 123 |
+
|
| 124 |
+
- Pattern distribution analysis
|
| 125 |
+
|
| 126 |
+
- Confidence distribution statistics
|
| 127 |
+
|
| 128 |
+
- Qualitative case studies with sentence-level evidence
|
| 129 |
+
|
| 130 |
+
This approach aligns with best practices in dyslexia-related linguistic analysis.
|
| 131 |
+
|
| 132 |
+
## Notebooks
|
| 133 |
+
|
| 134 |
+
notebooks/
|
| 135 |
+
├── 01_surface_feature_extraction_and_pattern_inference_v3.ipynb
|
| 136 |
+
└── 02_essay_level_dyslexic_pattern_profiling.ipynb
|
| 137 |
+
|
| 138 |
+
These notebooks document the full development and validation process.
|
| 139 |
+
|
| 140 |
+
## Limitations
|
| 141 |
+
|
| 142 |
+
Essay boundaries are approximated using fixed-size sentence windows
|
| 143 |
+
|
| 144 |
+
The system does not perform clinical diagnosis
|
| 145 |
+
|
| 146 |
+
Pattern definitions may evolve with expert validation
|
| 147 |
+
|
| 148 |
+
## Role in the Overall System
|
| 149 |
+
|
| 150 |
+
(Binary Dyslexia Detector)
|
| 151 |
+
↓
|
| 152 |
+
Dyslexic Essay
|
| 153 |
+
↓
|
| 154 |
+
Writing-Pattern Classifier
|
| 155 |
+
↓
|
| 156 |
+
Pattern Profile + Confidence
|
| 157 |
+
|
| 158 |
+
## Disclaimer
|
| 159 |
+
|
| 160 |
+
This module is intended for research and educational purposes only and should not be used for clinical diagnosis.
|
| 161 |
+
|
| 162 |
+
Generated CSV artifacts are intentionally excluded from version control and can be reproduced by executing the notebooks or pipeline.
|
| 163 |
+
```
|
writing_pattern_classifier/__init__.py
ADDED
|
File without changes
|
writing_pattern_classifier/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (185 Bytes). View file
|
|
|
writing_pattern_classifier/artifacts/pattern_taxonomy.md
ADDED
|
@@ -0,0 +1,79 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Sinhala Dyslexic Writing Pattern Taxonomy
|
| 2 |
+
|
| 3 |
+
This document defines the interpretable dyslexic writing-pattern taxonomy used in this project.
|
| 4 |
+
|
| 5 |
+
The taxonomy is derived from surface-level orthographic and phonetic deviations observed in Sinhala dyslexic writing.
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Orthographic Instability
|
| 10 |
+
|
| 11 |
+
**Definition:**
|
| 12 |
+
Inconsistent or incorrect written forms of characters without strong phonetic substitution.
|
| 13 |
+
|
| 14 |
+
**Surface Signals:**
|
| 15 |
+
|
| 16 |
+
- Character omission
|
| 17 |
+
- Character addition
|
| 18 |
+
- Diacritic loss
|
| 19 |
+
- Inconsistent spelling
|
| 20 |
+
|
| 21 |
+
**Example:**
|
| 22 |
+
|
| 23 |
+
- Clean: රුපියල් දෙදාහක් තියෙනවා
|
| 24 |
+
- Dyslexic: රුපියල් දෙදාහක් තියනව
|
| 25 |
+
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 2. Phonetic Confusion
|
| 29 |
+
|
| 30 |
+
**Definition:**
|
| 31 |
+
Errors that reflect confusion between phonologically similar sounds.
|
| 32 |
+
|
| 33 |
+
**Surface Signals:**
|
| 34 |
+
|
| 35 |
+
- Character substitution
|
| 36 |
+
- Phonetically similar replacements
|
| 37 |
+
|
| 38 |
+
**Example:**
|
| 39 |
+
|
| 40 |
+
- Clean: ගණිත
|
| 41 |
+
- Dyslexic: ගනිත
|
| 42 |
+
|
| 43 |
+
---
|
| 44 |
+
|
| 45 |
+
## 3. Word Boundary Confusion
|
| 46 |
+
|
| 47 |
+
**Definition:**
|
| 48 |
+
Difficulty maintaining correct word segmentation.
|
| 49 |
+
|
| 50 |
+
**Surface Signals:**
|
| 51 |
+
|
| 52 |
+
- Word merges
|
| 53 |
+
- Extra spaces
|
| 54 |
+
- Missing spaces
|
| 55 |
+
|
| 56 |
+
---
|
| 57 |
+
|
| 58 |
+
## 4. Mixed Dyslexic Pattern
|
| 59 |
+
|
| 60 |
+
**Definition:**
|
| 61 |
+
Presence of multiple dyslexic patterns within the same sentence or essay.
|
| 62 |
+
|
| 63 |
+
**Criteria:**
|
| 64 |
+
|
| 65 |
+
- More than one dominant surface error type
|
| 66 |
+
|
| 67 |
+
---
|
| 68 |
+
|
| 69 |
+
## 5. No Dominant Pattern
|
| 70 |
+
|
| 71 |
+
**Definition:**
|
| 72 |
+
No consistent dyslexic pattern detected or very low error density.
|
| 73 |
+
|
| 74 |
+
---
|
| 75 |
+
|
| 76 |
+
## Notes
|
| 77 |
+
|
| 78 |
+
- Patterns are assigned using rule-based dominance logic.
|
| 79 |
+
- This system prioritizes explainability over raw accuracy.
|
writing_pattern_classifier/notebooks/01_sentence_level_dyslexic_pattern_inference.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
writing_pattern_classifier/notebooks/02_essay_level_dyslexic_pattern_profiling.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
writing_pattern_classifier/notebooks/OLD_02_essay_level_dyslexic_pattern_profiling.ipynb
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
writing_pattern_classifier/src/__init__.py
ADDED
|
File without changes
|
writing_pattern_classifier/src/__pycache__/__init__.cpython-312.pyc
ADDED
|
Binary file (189 Bytes). View file
|
|
|
writing_pattern_classifier/src/__pycache__/essay_profile.cpython-312.pyc
ADDED
|
Binary file (2.84 kB). View file
|
|
|
writing_pattern_classifier/src/__pycache__/feature_extraction.cpython-312.pyc
ADDED
|
Binary file (3.34 kB). View file
|
|
|
writing_pattern_classifier/src/__pycache__/pattern_rules.cpython-312.pyc
ADDED
|
Binary file (1.38 kB). View file
|
|
|
writing_pattern_classifier/src/__pycache__/pipeline.cpython-312.pyc
ADDED
|
Binary file (2.44 kB). View file
|
|
|
writing_pattern_classifier/src/essay_profile.py
ADDED
|
@@ -0,0 +1,86 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Essay-level dyslexic writing pattern profiling.
|
| 3 |
+
|
| 4 |
+
This module aggregates sentence-level dyslexic writing patterns
|
| 5 |
+
into dominance-based essay profiles.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import pandas as pd
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
def assign_essay_ids(df: pd.DataFrame, essay_size: int = 5) -> pd.DataFrame:
|
| 12 |
+
"""
|
| 13 |
+
Assign essay IDs to sentence-level data using fixed-size grouping.
|
| 14 |
+
|
| 15 |
+
Parameters
|
| 16 |
+
----------
|
| 17 |
+
df : pd.DataFrame
|
| 18 |
+
DataFrame containing sentence-level patterns.
|
| 19 |
+
essay_size : int
|
| 20 |
+
Number of sentences per essay abstraction.
|
| 21 |
+
|
| 22 |
+
Returns
|
| 23 |
+
-------
|
| 24 |
+
pd.DataFrame
|
| 25 |
+
DataFrame with an added 'essay_id' column.
|
| 26 |
+
"""
|
| 27 |
+
df = df.copy()
|
| 28 |
+
df["essay_id"] = df.index // essay_size
|
| 29 |
+
return df
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
def profile_essays(df: pd.DataFrame) -> pd.DataFrame:
|
| 33 |
+
"""
|
| 34 |
+
Aggregate sentence-level patterns into essay-level dominance profiles.
|
| 35 |
+
|
| 36 |
+
Parameters
|
| 37 |
+
----------
|
| 38 |
+
df : pd.DataFrame
|
| 39 |
+
DataFrame containing 'essay_id' and 'writing_pattern'.
|
| 40 |
+
|
| 41 |
+
Returns
|
| 42 |
+
-------
|
| 43 |
+
pd.DataFrame
|
| 44 |
+
Essay-level pattern profiles with dominance and confidence.
|
| 45 |
+
"""
|
| 46 |
+
|
| 47 |
+
# Count patterns per essay
|
| 48 |
+
pattern_counts = (
|
| 49 |
+
df
|
| 50 |
+
.groupby("essay_id")["writing_pattern"]
|
| 51 |
+
.value_counts()
|
| 52 |
+
.unstack(fill_value=0)
|
| 53 |
+
)
|
| 54 |
+
|
| 55 |
+
essay_summary = pattern_counts.copy()
|
| 56 |
+
|
| 57 |
+
# Dominant pattern
|
| 58 |
+
essay_summary["dominant_pattern"] = essay_summary.idxmax(axis=1)
|
| 59 |
+
|
| 60 |
+
# Compute dominance metrics
|
| 61 |
+
pattern_columns = pattern_counts.columns
|
| 62 |
+
essay_summary["max_count"] = essay_summary[pattern_columns].max(axis=1)
|
| 63 |
+
essay_summary["total_sentences"] = essay_summary[pattern_columns].sum(axis=1)
|
| 64 |
+
|
| 65 |
+
essay_summary["confidence"] = (
|
| 66 |
+
essay_summary["max_count"] / essay_summary["total_sentences"]
|
| 67 |
+
)
|
| 68 |
+
|
| 69 |
+
# Dominance strength categorization
|
| 70 |
+
essay_summary["dominance_strength"] = essay_summary["confidence"].apply(
|
| 71 |
+
dominance_strength
|
| 72 |
+
)
|
| 73 |
+
|
| 74 |
+
return essay_summary.reset_index()
|
| 75 |
+
|
| 76 |
+
|
| 77 |
+
def dominance_strength(confidence: float) -> str:
|
| 78 |
+
"""
|
| 79 |
+
Categorize dominance strength based on confidence score.
|
| 80 |
+
"""
|
| 81 |
+
if confidence >= 0.6:
|
| 82 |
+
return "Strong"
|
| 83 |
+
elif confidence >= 0.4:
|
| 84 |
+
return "Moderate"
|
| 85 |
+
else:
|
| 86 |
+
return "Weak / Mixed"
|
writing_pattern_classifier/src/feature_extraction.py
ADDED
|
@@ -0,0 +1,75 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Sentence-level surface feature extraction for Sinhala dyslexic writing analysis.
|
| 3 |
+
|
| 4 |
+
This module computes interpretable surface-level error signals
|
| 5 |
+
by comparing clean and dyslexic sentence pairs.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import difflib
|
| 9 |
+
|
| 10 |
+
# Sinhala diacritic characters
|
| 11 |
+
SINHALA_DIACRITICS = set([
|
| 12 |
+
"ා", "ැ", "ෑ", "ි", "ී", "ු", "ූ", "ෘ", "ෙ", "ේ", "ො", "ෝ", "ං", "ඃ"
|
| 13 |
+
])
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def char_level_diff(clean: str, dyslexic: str) -> dict:
|
| 17 |
+
"""
|
| 18 |
+
Compute character-level edit operations between clean and dyslexic sentences.
|
| 19 |
+
"""
|
| 20 |
+
matcher = difflib.SequenceMatcher(None, clean, dyslexic)
|
| 21 |
+
|
| 22 |
+
additions = omissions = substitutions = 0
|
| 23 |
+
|
| 24 |
+
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
|
| 25 |
+
if tag == "insert":
|
| 26 |
+
additions += (j2 - j1)
|
| 27 |
+
elif tag == "delete":
|
| 28 |
+
omissions += (i2 - i1)
|
| 29 |
+
elif tag == "replace":
|
| 30 |
+
substitutions += max(i2 - i1, j2 - j1)
|
| 31 |
+
|
| 32 |
+
return {
|
| 33 |
+
"char_addition": additions,
|
| 34 |
+
"char_omission": omissions,
|
| 35 |
+
"char_substitution": substitutions,
|
| 36 |
+
"has_addition": additions > 0,
|
| 37 |
+
"has_omission": omissions > 0,
|
| 38 |
+
"has_substitution": substitutions > 0,
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
|
| 42 |
+
def spacing_diff(clean: str, dyslexic: str) -> dict:
|
| 43 |
+
"""
|
| 44 |
+
Detect word boundary (spacing) inconsistencies.
|
| 45 |
+
"""
|
| 46 |
+
diff = abs(len(clean.split()) - len(dyslexic.split()))
|
| 47 |
+
return {
|
| 48 |
+
"word_count_diff": diff,
|
| 49 |
+
"has_spacing_issue": diff > 0,
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
+
|
| 53 |
+
def diacritic_loss(clean: str, dyslexic: str) -> dict:
|
| 54 |
+
"""
|
| 55 |
+
Detect diacritic loss in dyslexic writing.
|
| 56 |
+
"""
|
| 57 |
+
clean_count = sum(1 for c in clean if c in SINHALA_DIACRITICS)
|
| 58 |
+
dys_count = sum(1 for c in dyslexic if c in SINHALA_DIACRITICS)
|
| 59 |
+
|
| 60 |
+
return {
|
| 61 |
+
"has_diacritic_loss": clean_count > dys_count
|
| 62 |
+
}
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def extract_surface_features(clean_sentence: str, dyslexic_sentence: str) -> dict:
|
| 66 |
+
"""
|
| 67 |
+
Extract all sentence-level surface features.
|
| 68 |
+
"""
|
| 69 |
+
features = {}
|
| 70 |
+
|
| 71 |
+
features.update(char_level_diff(clean_sentence, dyslexic_sentence))
|
| 72 |
+
features.update(spacing_diff(clean_sentence, dyslexic_sentence))
|
| 73 |
+
features.update(diacritic_loss(clean_sentence, dyslexic_sentence))
|
| 74 |
+
|
| 75 |
+
return features
|
writing_pattern_classifier/src/pattern_rules.py
ADDED
|
@@ -0,0 +1,46 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Rule-based sentence-level dyslexic writing pattern inference.
|
| 3 |
+
|
| 4 |
+
This module implements dominance-aware, interpretable rules
|
| 5 |
+
for identifying dyslexic writing patterns from surface features.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
def infer_pattern(features: dict) -> str:
|
| 10 |
+
"""
|
| 11 |
+
Infer the dominant dyslexic writing pattern for a sentence
|
| 12 |
+
using surface-level error signals.
|
| 13 |
+
|
| 14 |
+
Parameters
|
| 15 |
+
----------
|
| 16 |
+
features : dict
|
| 17 |
+
Dictionary containing extracted surface features.
|
| 18 |
+
|
| 19 |
+
Returns
|
| 20 |
+
-------
|
| 21 |
+
str
|
| 22 |
+
One of the predefined dyslexic writing pattern labels.
|
| 23 |
+
"""
|
| 24 |
+
|
| 25 |
+
# Priority 1: Word boundary confusion
|
| 26 |
+
if features.get("has_spacing_issue"):
|
| 27 |
+
return "Word Boundary Confusion"
|
| 28 |
+
|
| 29 |
+
has_sub = features.get("has_substitution", False)
|
| 30 |
+
has_omit = features.get("has_omission", False)
|
| 31 |
+
has_diacritic = features.get("has_diacritic_loss", False)
|
| 32 |
+
|
| 33 |
+
# Priority 2: Mixed dyslexic pattern
|
| 34 |
+
if has_sub and has_omit:
|
| 35 |
+
return "Mixed Dyslexic Pattern"
|
| 36 |
+
|
| 37 |
+
# Priority 3: Phonetic confusion
|
| 38 |
+
if has_sub:
|
| 39 |
+
return "Phonetic Confusion"
|
| 40 |
+
|
| 41 |
+
# Priority 4: Orthographic instability
|
| 42 |
+
if has_omit or has_diacritic:
|
| 43 |
+
return "Orthographic Instability"
|
| 44 |
+
|
| 45 |
+
# Fallback
|
| 46 |
+
return "No Dominant Pattern"
|
writing_pattern_classifier/src/pipeline.py
ADDED
|
@@ -0,0 +1,66 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
End-to-end pipeline for Sinhala dyslexic writing pattern analysis.
|
| 3 |
+
|
| 4 |
+
This module orchestrates sentence-level feature extraction,
|
| 5 |
+
pattern inference, and essay-level profiling.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import pandas as pd
|
| 9 |
+
|
| 10 |
+
from .feature_extraction import extract_surface_features
|
| 11 |
+
from .pattern_rules import infer_pattern
|
| 12 |
+
from .essay_profile import assign_essay_ids, profile_essays
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def run_pattern_analysis(
|
| 16 |
+
df: pd.DataFrame,
|
| 17 |
+
essay_size: int = 5
|
| 18 |
+
) -> tuple[pd.DataFrame, pd.DataFrame]:
|
| 19 |
+
"""
|
| 20 |
+
Run the complete dyslexic writing pattern analysis pipeline.
|
| 21 |
+
|
| 22 |
+
Parameters
|
| 23 |
+
----------
|
| 24 |
+
df : pd.DataFrame
|
| 25 |
+
Input DataFrame containing:
|
| 26 |
+
- 'clean_sentence'
|
| 27 |
+
- 'dyslexic_sentence'
|
| 28 |
+
essay_size : int
|
| 29 |
+
Number of sentences per essay abstraction.
|
| 30 |
+
|
| 31 |
+
Returns
|
| 32 |
+
-------
|
| 33 |
+
tuple (sentence_df, essay_df)
|
| 34 |
+
sentence_df : pd.DataFrame
|
| 35 |
+
Sentence-level features and inferred patterns.
|
| 36 |
+
essay_df : pd.DataFrame
|
| 37 |
+
Essay-level dominance profiles.
|
| 38 |
+
"""
|
| 39 |
+
|
| 40 |
+
df = df.copy()
|
| 41 |
+
|
| 42 |
+
# --- Sentence-level feature extraction ---
|
| 43 |
+
surface_features = df.apply(
|
| 44 |
+
lambda row: extract_surface_features(
|
| 45 |
+
row["clean_sentence"],
|
| 46 |
+
row["dyslexic_sentence"]
|
| 47 |
+
),
|
| 48 |
+
axis=1
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
feature_df = pd.concat(
|
| 52 |
+
[df.reset_index(drop=True), surface_features.apply(pd.Series)],
|
| 53 |
+
axis=1
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
# --- Sentence-level pattern inference ---
|
| 57 |
+
feature_df["writing_pattern"] = feature_df.apply(
|
| 58 |
+
lambda row: infer_pattern(row),
|
| 59 |
+
axis=1
|
| 60 |
+
)
|
| 61 |
+
|
| 62 |
+
# --- Essay-level profiling ---
|
| 63 |
+
feature_df = assign_essay_ids(feature_df, essay_size=essay_size)
|
| 64 |
+
essay_df = profile_essays(feature_df)
|
| 65 |
+
|
| 66 |
+
return feature_df, essay_df
|