tinykavi commited on
Commit
5548ff6
·
1 Parent(s): da6c237

Add writing_pattern_classifier package for live demo

Browse files
writing_pattern_classifier/README.md ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dyslexic Writing-Pattern Classifier (Sinhala)
2
+
3
+ This module implements an **interpretable, rule-based dyslexic writing-pattern classifier** for Sinhala text.
4
+
5
+ Unlike traditional machine-learning classifiers, this component focuses on **pattern inference and explainability**, rather than predictive accuracy.
6
+ It is designed to analyze _how_ dyslexic writing manifests, not merely _whether_ dyslexia is present.
7
+
8
+ ---
9
+
10
+ ## Purpose
11
+
12
+ - Identify **dominant dyslexic writing patterns** in Sinhala text
13
+ - Provide **explainable, linguistically grounded analysis**
14
+ - Support educational and research-oriented dyslexia-aware systems
15
+
16
+ This module is executed **only after** an essay has been identified as dyslexic by the Binary Dyslexia Detector.
17
+
18
+ ---
19
+
20
+ ## Core Design Principle
21
+
22
+ > Dyslexia is expressed through **consistent patterns of surface-level writing errors**, not isolated mistakes.
23
+
24
+ Therefore, this classifier infers patterns using **rule-based dominance of error signals**, rather than supervised learning.
25
+
26
+ ---
27
+
28
+ ## Writing Patterns Identified
29
+
30
+ The system currently identifies the following dyslexic writing patterns:
31
+
32
+ - **Orthographic Instability**
33
+ Frequent character omissions, additions, or diacritic loss
34
+
35
+ - **Phonetic Confusion**
36
+ Character substitutions reflecting phonetic similarity
37
+
38
+ - **Mixed Dyslexic Pattern**
39
+ Co-occurrence of multiple dominant error types
40
+
41
+ - **No Dominant Pattern**
42
+ Absence of consistent dyslexic error behavior
43
+
44
+ - **Word Boundary Confusion** (when applicable)
45
+ Spacing and word segmentation errors
46
+
47
+ These patterns are derived from dyslexia-related literature and adapted for Sinhala writing.
48
+
49
+ ---
50
+
51
+ ## Processing Pipeline
52
+
53
+ ### 1. Sentence-Level Analysis
54
+
55
+ For each sentence:
56
+
57
+ - Clean and dyslexic versions are compared
58
+ - Surface error features are extracted:
59
+ - Character addition
60
+ - Character omission
61
+ - Character substitution
62
+ - Diacritic loss
63
+ - Spacing issues
64
+ - A **rule-based inference engine** assigns a sentence-level writing pattern
65
+
66
+ ### 2. Essay-Level Aggregation
67
+
68
+ Because the dataset does not provide explicit essay boundaries:
69
+
70
+ - Essays are approximated using **fixed-size sentence windows** (pseudo-essays)
71
+ - Sentence-level patterns are aggregated per essay
72
+
73
+ ### 3. Dominant Pattern Classification
74
+
75
+ For each essay:
76
+
77
+ - The most frequent pattern is selected as the **dominant pattern**
78
+ - A **confidence score** is computed as:
79
+
80
+ \[
81
+ Confidence = \frac{\text{Number of sentences supporting dominant pattern}}
82
+ {\text{Total number of sentences in essay}}
83
+ \]
84
+
85
+ - Dominance strength is categorized as:
86
+ - Strong Dominance
87
+ - Moderate Dominance
88
+ - Weak / Mixed
89
+
90
+ ---
91
+
92
+ ## Outputs
93
+
94
+ For each essay, the classifier produces:
95
+
96
+ - Dominant dyslexic writing pattern
97
+ - Pattern dominance confidence
98
+ - Dominance strength label
99
+ - Sentence-level pattern breakdown (for explainability)
100
+
101
+ ### Example Output
102
+
103
+ ```json
104
+ {
105
+ "dominant_pattern": "Orthographic Instability",
106
+ "confidence": 0.6,
107
+ "dominance_strength": "Strong Dominance"
108
+ }
109
+
110
+ ---
111
+
112
+ ## Evaluation Strategy
113
+
114
+ This component does not use supervised evaluation metrics such as accuracy or F1-score.
115
+
116
+ Reason:
117
+
118
+ - Essay-level pattern labels are inferred, not manually annotated
119
+
120
+ - Reporting accuracy would result in label leakage
121
+
122
+ Instead, evaluation is performed using:
123
+
124
+ - Pattern distribution analysis
125
+
126
+ - Confidence distribution statistics
127
+
128
+ - Qualitative case studies with sentence-level evidence
129
+
130
+ This approach aligns with best practices in dyslexia-related linguistic analysis.
131
+
132
+ ## Notebooks
133
+
134
+ notebooks/
135
+ ├── 01_surface_feature_extraction_and_pattern_inference_v3.ipynb
136
+ └── 02_essay_level_dyslexic_pattern_profiling.ipynb
137
+
138
+ These notebooks document the full development and validation process.
139
+
140
+ ## Limitations
141
+
142
+ Essay boundaries are approximated using fixed-size sentence windows
143
+
144
+ The system does not perform clinical diagnosis
145
+
146
+ Pattern definitions may evolve with expert validation
147
+
148
+ ## Role in the Overall System
149
+
150
+ (Binary Dyslexia Detector)
151
+
152
+ Dyslexic Essay
153
+
154
+ Writing-Pattern Classifier
155
+
156
+ Pattern Profile + Confidence
157
+
158
+ ## Disclaimer
159
+
160
+ This module is intended for research and educational purposes only and should not be used for clinical diagnosis.
161
+
162
+ Generated CSV artifacts are intentionally excluded from version control and can be reproduced by executing the notebooks or pipeline.
163
+ ```
writing_pattern_classifier/__init__.py ADDED
File without changes
writing_pattern_classifier/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (185 Bytes). View file
 
writing_pattern_classifier/artifacts/pattern_taxonomy.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Sinhala Dyslexic Writing Pattern Taxonomy
2
+
3
+ This document defines the interpretable dyslexic writing-pattern taxonomy used in this project.
4
+
5
+ The taxonomy is derived from surface-level orthographic and phonetic deviations observed in Sinhala dyslexic writing.
6
+
7
+ ---
8
+
9
+ ## 1. Orthographic Instability
10
+
11
+ **Definition:**
12
+ Inconsistent or incorrect written forms of characters without strong phonetic substitution.
13
+
14
+ **Surface Signals:**
15
+
16
+ - Character omission
17
+ - Character addition
18
+ - Diacritic loss
19
+ - Inconsistent spelling
20
+
21
+ **Example:**
22
+
23
+ - Clean: රුපියල් දෙදාහක් තියෙනවා
24
+ - Dyslexic: රුපියල් දෙදාහක් තියනව
25
+
26
+ ---
27
+
28
+ ## 2. Phonetic Confusion
29
+
30
+ **Definition:**
31
+ Errors that reflect confusion between phonologically similar sounds.
32
+
33
+ **Surface Signals:**
34
+
35
+ - Character substitution
36
+ - Phonetically similar replacements
37
+
38
+ **Example:**
39
+
40
+ - Clean: ගණිත
41
+ - Dyslexic: ගනිත
42
+
43
+ ---
44
+
45
+ ## 3. Word Boundary Confusion
46
+
47
+ **Definition:**
48
+ Difficulty maintaining correct word segmentation.
49
+
50
+ **Surface Signals:**
51
+
52
+ - Word merges
53
+ - Extra spaces
54
+ - Missing spaces
55
+
56
+ ---
57
+
58
+ ## 4. Mixed Dyslexic Pattern
59
+
60
+ **Definition:**
61
+ Presence of multiple dyslexic patterns within the same sentence or essay.
62
+
63
+ **Criteria:**
64
+
65
+ - More than one dominant surface error type
66
+
67
+ ---
68
+
69
+ ## 5. No Dominant Pattern
70
+
71
+ **Definition:**
72
+ No consistent dyslexic pattern detected or very low error density.
73
+
74
+ ---
75
+
76
+ ## Notes
77
+
78
+ - Patterns are assigned using rule-based dominance logic.
79
+ - This system prioritizes explainability over raw accuracy.
writing_pattern_classifier/notebooks/01_sentence_level_dyslexic_pattern_inference.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
writing_pattern_classifier/notebooks/02_essay_level_dyslexic_pattern_profiling.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
writing_pattern_classifier/notebooks/OLD_02_essay_level_dyslexic_pattern_profiling.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
writing_pattern_classifier/src/__init__.py ADDED
File without changes
writing_pattern_classifier/src/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (189 Bytes). View file
 
writing_pattern_classifier/src/__pycache__/essay_profile.cpython-312.pyc ADDED
Binary file (2.84 kB). View file
 
writing_pattern_classifier/src/__pycache__/feature_extraction.cpython-312.pyc ADDED
Binary file (3.34 kB). View file
 
writing_pattern_classifier/src/__pycache__/pattern_rules.cpython-312.pyc ADDED
Binary file (1.38 kB). View file
 
writing_pattern_classifier/src/__pycache__/pipeline.cpython-312.pyc ADDED
Binary file (2.44 kB). View file
 
writing_pattern_classifier/src/essay_profile.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Essay-level dyslexic writing pattern profiling.
3
+
4
+ This module aggregates sentence-level dyslexic writing patterns
5
+ into dominance-based essay profiles.
6
+ """
7
+
8
+ import pandas as pd
9
+
10
+
11
+ def assign_essay_ids(df: pd.DataFrame, essay_size: int = 5) -> pd.DataFrame:
12
+ """
13
+ Assign essay IDs to sentence-level data using fixed-size grouping.
14
+
15
+ Parameters
16
+ ----------
17
+ df : pd.DataFrame
18
+ DataFrame containing sentence-level patterns.
19
+ essay_size : int
20
+ Number of sentences per essay abstraction.
21
+
22
+ Returns
23
+ -------
24
+ pd.DataFrame
25
+ DataFrame with an added 'essay_id' column.
26
+ """
27
+ df = df.copy()
28
+ df["essay_id"] = df.index // essay_size
29
+ return df
30
+
31
+
32
+ def profile_essays(df: pd.DataFrame) -> pd.DataFrame:
33
+ """
34
+ Aggregate sentence-level patterns into essay-level dominance profiles.
35
+
36
+ Parameters
37
+ ----------
38
+ df : pd.DataFrame
39
+ DataFrame containing 'essay_id' and 'writing_pattern'.
40
+
41
+ Returns
42
+ -------
43
+ pd.DataFrame
44
+ Essay-level pattern profiles with dominance and confidence.
45
+ """
46
+
47
+ # Count patterns per essay
48
+ pattern_counts = (
49
+ df
50
+ .groupby("essay_id")["writing_pattern"]
51
+ .value_counts()
52
+ .unstack(fill_value=0)
53
+ )
54
+
55
+ essay_summary = pattern_counts.copy()
56
+
57
+ # Dominant pattern
58
+ essay_summary["dominant_pattern"] = essay_summary.idxmax(axis=1)
59
+
60
+ # Compute dominance metrics
61
+ pattern_columns = pattern_counts.columns
62
+ essay_summary["max_count"] = essay_summary[pattern_columns].max(axis=1)
63
+ essay_summary["total_sentences"] = essay_summary[pattern_columns].sum(axis=1)
64
+
65
+ essay_summary["confidence"] = (
66
+ essay_summary["max_count"] / essay_summary["total_sentences"]
67
+ )
68
+
69
+ # Dominance strength categorization
70
+ essay_summary["dominance_strength"] = essay_summary["confidence"].apply(
71
+ dominance_strength
72
+ )
73
+
74
+ return essay_summary.reset_index()
75
+
76
+
77
+ def dominance_strength(confidence: float) -> str:
78
+ """
79
+ Categorize dominance strength based on confidence score.
80
+ """
81
+ if confidence >= 0.6:
82
+ return "Strong"
83
+ elif confidence >= 0.4:
84
+ return "Moderate"
85
+ else:
86
+ return "Weak / Mixed"
writing_pattern_classifier/src/feature_extraction.py ADDED
@@ -0,0 +1,75 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Sentence-level surface feature extraction for Sinhala dyslexic writing analysis.
3
+
4
+ This module computes interpretable surface-level error signals
5
+ by comparing clean and dyslexic sentence pairs.
6
+ """
7
+
8
+ import difflib
9
+
10
+ # Sinhala diacritic characters
11
+ SINHALA_DIACRITICS = set([
12
+ "ා", "ැ", "ෑ", "ි", "ී", "ු", "ූ", "ෘ", "ෙ", "ේ", "ො", "ෝ", "ං", "ඃ"
13
+ ])
14
+
15
+
16
+ def char_level_diff(clean: str, dyslexic: str) -> dict:
17
+ """
18
+ Compute character-level edit operations between clean and dyslexic sentences.
19
+ """
20
+ matcher = difflib.SequenceMatcher(None, clean, dyslexic)
21
+
22
+ additions = omissions = substitutions = 0
23
+
24
+ for tag, i1, i2, j1, j2 in matcher.get_opcodes():
25
+ if tag == "insert":
26
+ additions += (j2 - j1)
27
+ elif tag == "delete":
28
+ omissions += (i2 - i1)
29
+ elif tag == "replace":
30
+ substitutions += max(i2 - i1, j2 - j1)
31
+
32
+ return {
33
+ "char_addition": additions,
34
+ "char_omission": omissions,
35
+ "char_substitution": substitutions,
36
+ "has_addition": additions > 0,
37
+ "has_omission": omissions > 0,
38
+ "has_substitution": substitutions > 0,
39
+ }
40
+
41
+
42
+ def spacing_diff(clean: str, dyslexic: str) -> dict:
43
+ """
44
+ Detect word boundary (spacing) inconsistencies.
45
+ """
46
+ diff = abs(len(clean.split()) - len(dyslexic.split()))
47
+ return {
48
+ "word_count_diff": diff,
49
+ "has_spacing_issue": diff > 0,
50
+ }
51
+
52
+
53
+ def diacritic_loss(clean: str, dyslexic: str) -> dict:
54
+ """
55
+ Detect diacritic loss in dyslexic writing.
56
+ """
57
+ clean_count = sum(1 for c in clean if c in SINHALA_DIACRITICS)
58
+ dys_count = sum(1 for c in dyslexic if c in SINHALA_DIACRITICS)
59
+
60
+ return {
61
+ "has_diacritic_loss": clean_count > dys_count
62
+ }
63
+
64
+
65
+ def extract_surface_features(clean_sentence: str, dyslexic_sentence: str) -> dict:
66
+ """
67
+ Extract all sentence-level surface features.
68
+ """
69
+ features = {}
70
+
71
+ features.update(char_level_diff(clean_sentence, dyslexic_sentence))
72
+ features.update(spacing_diff(clean_sentence, dyslexic_sentence))
73
+ features.update(diacritic_loss(clean_sentence, dyslexic_sentence))
74
+
75
+ return features
writing_pattern_classifier/src/pattern_rules.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Rule-based sentence-level dyslexic writing pattern inference.
3
+
4
+ This module implements dominance-aware, interpretable rules
5
+ for identifying dyslexic writing patterns from surface features.
6
+ """
7
+
8
+
9
+ def infer_pattern(features: dict) -> str:
10
+ """
11
+ Infer the dominant dyslexic writing pattern for a sentence
12
+ using surface-level error signals.
13
+
14
+ Parameters
15
+ ----------
16
+ features : dict
17
+ Dictionary containing extracted surface features.
18
+
19
+ Returns
20
+ -------
21
+ str
22
+ One of the predefined dyslexic writing pattern labels.
23
+ """
24
+
25
+ # Priority 1: Word boundary confusion
26
+ if features.get("has_spacing_issue"):
27
+ return "Word Boundary Confusion"
28
+
29
+ has_sub = features.get("has_substitution", False)
30
+ has_omit = features.get("has_omission", False)
31
+ has_diacritic = features.get("has_diacritic_loss", False)
32
+
33
+ # Priority 2: Mixed dyslexic pattern
34
+ if has_sub and has_omit:
35
+ return "Mixed Dyslexic Pattern"
36
+
37
+ # Priority 3: Phonetic confusion
38
+ if has_sub:
39
+ return "Phonetic Confusion"
40
+
41
+ # Priority 4: Orthographic instability
42
+ if has_omit or has_diacritic:
43
+ return "Orthographic Instability"
44
+
45
+ # Fallback
46
+ return "No Dominant Pattern"
writing_pattern_classifier/src/pipeline.py ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ End-to-end pipeline for Sinhala dyslexic writing pattern analysis.
3
+
4
+ This module orchestrates sentence-level feature extraction,
5
+ pattern inference, and essay-level profiling.
6
+ """
7
+
8
+ import pandas as pd
9
+
10
+ from .feature_extraction import extract_surface_features
11
+ from .pattern_rules import infer_pattern
12
+ from .essay_profile import assign_essay_ids, profile_essays
13
+
14
+
15
+ def run_pattern_analysis(
16
+ df: pd.DataFrame,
17
+ essay_size: int = 5
18
+ ) -> tuple[pd.DataFrame, pd.DataFrame]:
19
+ """
20
+ Run the complete dyslexic writing pattern analysis pipeline.
21
+
22
+ Parameters
23
+ ----------
24
+ df : pd.DataFrame
25
+ Input DataFrame containing:
26
+ - 'clean_sentence'
27
+ - 'dyslexic_sentence'
28
+ essay_size : int
29
+ Number of sentences per essay abstraction.
30
+
31
+ Returns
32
+ -------
33
+ tuple (sentence_df, essay_df)
34
+ sentence_df : pd.DataFrame
35
+ Sentence-level features and inferred patterns.
36
+ essay_df : pd.DataFrame
37
+ Essay-level dominance profiles.
38
+ """
39
+
40
+ df = df.copy()
41
+
42
+ # --- Sentence-level feature extraction ---
43
+ surface_features = df.apply(
44
+ lambda row: extract_surface_features(
45
+ row["clean_sentence"],
46
+ row["dyslexic_sentence"]
47
+ ),
48
+ axis=1
49
+ )
50
+
51
+ feature_df = pd.concat(
52
+ [df.reset_index(drop=True), surface_features.apply(pd.Series)],
53
+ axis=1
54
+ )
55
+
56
+ # --- Sentence-level pattern inference ---
57
+ feature_df["writing_pattern"] = feature_df.apply(
58
+ lambda row: infer_pattern(row),
59
+ axis=1
60
+ )
61
+
62
+ # --- Essay-level profiling ---
63
+ feature_df = assign_essay_ids(feature_df, essay_size=essay_size)
64
+ essay_df = profile_essays(feature_df)
65
+
66
+ return feature_df, essay_df