Add files using upload-large-folder tool
Browse files
README.md
CHANGED
|
@@ -1,206 +1,622 @@
|
|
| 1 |
---
|
| 2 |
-
|
| 3 |
-
|
| 4 |
tags:
|
| 5 |
-
-
|
|
|
|
|
|
|
|
|
|
| 6 |
- lora
|
| 7 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 8 |
---
|
| 9 |
|
| 10 |
-
#
|
| 11 |
|
| 12 |
-
|
| 13 |
|
|
|
|
| 14 |
|
|
|
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
|
|
|
| 22 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
-
|
| 25 |
-
- **Funded by [optional]:** [More Information Needed]
|
| 26 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 27 |
-
- **Model type:** [More Information Needed]
|
| 28 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 29 |
-
- **License:** [More Information Needed]
|
| 30 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 31 |
|
| 32 |
-
|
| 33 |
|
| 34 |
-
|
| 35 |
|
| 36 |
-
- **
|
| 37 |
-
- **Paper [optional]:** [More Information Needed]
|
| 38 |
-
- **Demo [optional]:** [More Information Needed]
|
| 39 |
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
|
|
|
|
|
|
| 45 |
|
| 46 |
-
|
| 47 |
|
| 48 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
-
###
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
##
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
###
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
|
| 92 |
-
###
|
| 93 |
|
| 94 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 95 |
|
|
|
|
| 96 |
|
| 97 |
-
###
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
| 102 |
|
| 103 |
-
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
##
|
| 112 |
|
| 113 |
-
###
|
| 114 |
|
| 115 |
-
|
|
|
|
|
|
|
| 116 |
|
| 117 |
-
|
| 118 |
|
| 119 |
-
|
| 120 |
|
| 121 |
-
|
|
|
|
|
|
|
| 122 |
|
| 123 |
-
|
|
|
|
|
|
|
| 124 |
|
| 125 |
-
|
|
|
|
| 126 |
|
| 127 |
-
|
| 128 |
|
| 129 |
-
|
|
|
|
|
|
|
|
|
|
| 130 |
|
| 131 |
-
#
|
|
|
|
| 132 |
|
| 133 |
-
|
|
|
|
| 134 |
|
| 135 |
-
#
|
|
|
|
| 136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
|
|
|
|
|
|
| 138 |
|
| 139 |
-
#
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
|
| 142 |
|
| 143 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 144 |
|
| 145 |
-
|
| 146 |
|
| 147 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
|
| 154 |
-
|
| 155 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
-
##
|
| 160 |
|
| 161 |
-
|
|
|
|
|
|
|
| 162 |
|
| 163 |
-
#
|
|
|
|
|
|
|
|
|
|
| 164 |
|
| 165 |
-
|
|
|
|
|
|
|
| 166 |
|
| 167 |
-
|
| 168 |
|
| 169 |
-
|
| 170 |
|
| 171 |
-
##
|
| 172 |
|
| 173 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 174 |
|
| 175 |
-
|
| 176 |
|
| 177 |
-
|
| 178 |
|
| 179 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 180 |
|
| 181 |
-
|
| 182 |
|
| 183 |
-
|
| 184 |
|
| 185 |
-
|
| 186 |
|
| 187 |
-
|
| 188 |
|
| 189 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 190 |
|
| 191 |
-
|
| 192 |
|
| 193 |
-
##
|
| 194 |
|
| 195 |
-
|
| 196 |
|
| 197 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 198 |
|
| 199 |
-
|
| 200 |
|
| 201 |
-
|
| 202 |
|
| 203 |
-
|
| 204 |
-
### Framework versions
|
| 205 |
|
| 206 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
tags:
|
| 5 |
+
- text2text-generation
|
| 6 |
+
- dyslexia
|
| 7 |
+
- grammar-correction
|
| 8 |
+
- style-preservation
|
| 9 |
- lora
|
| 10 |
+
- flan-t5
|
| 11 |
+
license: mit
|
| 12 |
+
base_model: google/flan-t5-small
|
| 13 |
+
datasets:
|
| 14 |
+
- jhu-clsp/jfleg
|
| 15 |
+
- bea2019st/wi_locness
|
| 16 |
+
pipeline_tag: translation
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# Dyslexia Academic Writing Correction System
|
| 20 |
|
| 21 |
+
> **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
|
| 22 |
|
| 23 |
+
## Overview
|
| 24 |
|
| 25 |
+
This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
|
| 26 |
|
| 27 |
+
1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
|
| 28 |
+
2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
|
| 29 |
+
3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
|
| 30 |
+
4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
|
| 31 |
|
| 32 |
+
The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
|
| 33 |
|
| 34 |
+
---
|
| 35 |
|
| 36 |
+
## Latest Evaluation Results (v2)
|
| 37 |
|
| 38 |
+
| Metric | Score | Description |
|
| 39 |
+
|--------|-------|-------------|
|
| 40 |
+
| **GLEU** | **0.7506** | Grammar + fluency correction quality |
|
| 41 |
+
| **BERTScore F1** | **0.9733** | Semantic closeness to reference corrections |
|
| 42 |
+
| **1 − WER** | **0.8488** | Word-level accuracy (WER = 15.12%) |
|
| 43 |
+
| **Composite** | **0.8576** | `(GLEU + BERTScore F1 + (1−WER)) / 3` — gating score for Hub push |
|
| 44 |
|
| 45 |
+
> The model is only pushed to the Hub when the composite score strictly beats the saved baseline from the previous run, ensuring the Hub always holds the best-seen weights.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
+
---
|
| 48 |
|
| 49 |
+
## What Changed in v2
|
| 50 |
|
| 51 |
+
The original model had a critical bug: `CorrectionTrainer.compute_loss()` only used cross-entropy loss. The multi-objective loss (`L_CE + λ_style + λ_semantic + λ_human`) was fully designed in `loss_functions.py` but was **never wired into the trainer**. v2 fixes this and upgrades several other parameters.
|
|
|
|
|
|
|
| 52 |
|
| 53 |
+
| Parameter | v1 (Original) | v2 (Upgraded) |
|
| 54 |
+
|-----------|--------------|---------------|
|
| 55 |
+
| LoRA rank | r=8, α=16 | **r=16, α=32** |
|
| 56 |
+
| Epochs | 5 | **10** |
|
| 57 |
+
| Effective batch size | 32 (4×8 accum) | **64 (2×32 accum)** |
|
| 58 |
+
| Learning rate | 3e-4 | **2e-4** (more stable over longer run) |
|
| 59 |
+
| Warmup ratio | 5% | **10%** |
|
| 60 |
+
| Label smoothing | none | **0.1** (reduces overconfidence) |
|
| 61 |
+
| Loss function | CE only *(bug)* | **CE + Style + Semantic** *(fixed)* |
|
| 62 |
+
| Human-pattern loss | designed, unused | omitted on CPU; falls back to CE+style+sem |
|
| 63 |
+
| Evaluation | GLEU only | **GLEU + BERTScore F1 + (1−WER) composite** |
|
| 64 |
+
| Eval/save strategy | every 100 steps | **per epoch** |
|
| 65 |
+
| Early stopping | none | **patience=3** |
|
| 66 |
+
| Hub gate | none | **composite must beat saved baseline** |
|
| 67 |
+
| Warm-start strategy | cold start | **merge r=8 adapter → apply fresh r=16 LoRA** |
|
| 68 |
+
| Data split | 90%/10% train/val | **88%/7%/5% train/val/test** |
|
| 69 |
+
| Dyslexia augmentation error rate | 15% | **20%** |
|
| 70 |
|
| 71 |
+
### Combined Loss (v2)
|
| 72 |
|
| 73 |
+
```
|
| 74 |
+
L = L_CE + 0.3·L_style + 0.5·L_semantic
|
| 75 |
+
```
|
| 76 |
|
| 77 |
+
The human-pattern loss (`λ₃·L_human`) is kept in the design but skipped on CPU (requires GPT-2 perplexity scoring). Style and semantic losses use a lightweight `StyleMLP` — no spaCy or external models required at training time.
|
| 78 |
|
| 79 |
+
| Term | Purpose | Weight |
|
| 80 |
+
|------|---------|--------|
|
| 81 |
+
| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
|
| 82 |
+
| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
|
| 83 |
+
| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
|
| 84 |
+
| `L_human` | `1 − HumanPatternClassifier(output)` — anti-AI penalty | 0.4 *(GPU only)* |
|
| 85 |
|
| 86 |
+
### Warm-Start Merge Strategy
|
| 87 |
|
| 88 |
+
Rather than fine-tuning from scratch, v2 preserves the corrections learned by the original r=8 adapter:
|
| 89 |
|
| 90 |
+
1. Load existing LoRA adapter (r=8) from Hub
|
| 91 |
+
2. Merge adapter weights into base model (`merge_and_unload()`)
|
| 92 |
+
3. Apply a **fresh LoRA at r=16** on top of the merged base
|
| 93 |
+
4. Train with combined loss for 10 epochs
|
| 94 |
|
| 95 |
+
This doubles the adapter's representational capacity while retaining previously learned correction patterns.
|
| 96 |
|
| 97 |
+
---
|
| 98 |
|
| 99 |
+
## Features
|
| 100 |
+
|
| 101 |
+
| Feature | Description |
|
| 102 |
+
|---------|-------------|
|
| 103 |
+
| **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
|
| 104 |
+
| **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
|
| 105 |
+
| **LoRA fine-tuning** | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
|
| 106 |
+
| **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
|
| 107 |
+
| **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
|
| 108 |
+
| **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic (+ λ₃·L_human on GPU)` |
|
| 109 |
+
| **Sentence-chunked inference** | Long texts split into 128-token chunks matching training window |
|
| 110 |
+
| **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
|
| 111 |
+
| **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
|
| 112 |
+
| **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text (20% error rate) |
|
| 113 |
+
| **Composite score gating** | Hub push only if new model strictly beats saved baseline |
|
| 114 |
|
| 115 |
+
---
|
| 116 |
|
| 117 |
+
## Project Structure
|
| 118 |
+
|
| 119 |
+
```
|
| 120 |
+
Rewriter/
|
| 121 |
+
├── configs/
|
| 122 |
+
│ ├── training_config.yaml # Full training hyperparameters
|
| 123 |
+
│ ├── training_config_fast.yaml # Quick iteration config
|
| 124 |
+
│ ├── inference_config.yaml # Inference & generation settings
|
| 125 |
+
│ ├── model_config.yaml # Model architecture registry
|
| 126 |
+
│ └── awl_config.yaml # Academic Word List settings
|
| 127 |
+
├── scripts/
|
| 128 |
+
│ ├── train.py # Main training script (Click CLI)
|
| 129 |
+
│ ├── evaluate.py # Test set evaluation (GLEU, ERRANT, BERTScore)
|
| 130 |
+
│ ├── run_inference.py # Interactive CLI inference
|
| 131 |
+
│ ├── preprocess_data.py # Raw datasets → unified JSONL
|
| 132 |
+
│ ├── pretrain_human_pattern_classifier.py # Stage 3: anti-AI classifier
|
| 133 |
+
│ ├── download_datasets.sh # BEA-2019 dataset downloader
|
| 134 |
+
│ └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
|
| 135 |
+
├── src/
|
| 136 |
+
│ ├── model/
|
| 137 |
+
│ │ ├── base_model.py # Model loader (T5/BART/Llama + LoRA + quantization)
|
| 138 |
+
│ │ ├── style_conditioner.py # Prefix tuning: style → virtual tokens
|
| 139 |
+
│ │ ├── generation_utils.py # Beam search, sampling, batch generation
|
| 140 |
+
│ │ └── lora_adapter.py # LoRA configuration helpers
|
| 141 |
+
│ ├── preprocessing/
|
| 142 |
+
│ │ ├── pipeline.py # Full preprocessing orchestrator
|
| 143 |
+
│ │ ├── spell_corrector.py # LanguageTool + dyslexia-aware correction
|
| 144 |
+
│ │ ├── dyslexia_simulator.py # Synthetic error generation (Rello et al.)
|
| 145 |
+
│ │ ├── dependency_parser.py # spaCy dependency tree analysis
|
| 146 |
+
│ │ ├── ner_tagger.py # Named entity protection
|
| 147 |
+
│ │ └── sentence_segmenter.py # Sentence boundary detection
|
| 148 |
+
│ ├── style/
|
| 149 |
+
│ │ ├── fingerprinter.py # 41 features → 512-dim style vector
|
| 150 |
+
│ │ ├── style_vector.py # Style vector dataclass
|
| 151 |
+
│ │ ├── formality_classifier.py # Rule-based formality scoring
|
| 152 |
+
│ │ └── emotion_classifier.py # Emotion detection
|
| 153 |
+
│ ├── training/
|
| 154 |
+
│ │ ├── dataset.py # Pre-tokenized cached dataset with style vectors
|
| 155 |
+
│ │ ├── trainer.py # CorrectionTrainer (HF Trainer + PEFT fixes)
|
| 156 |
+
│ │ ├── loss_functions.py # V1 and V2 combined losses
|
| 157 |
+
│ │ ├── human_pattern_extractor.py # 17-dim feature extraction + classifier
|
| 158 |
+
│ │ └── callbacks.py # Evaluation logging callbacks
|
| 159 |
+
│ ├── vocabulary/
|
| 160 |
+
│ │ ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
|
| 161 |
+
│ │ ├── awl_loader.py # Coxhead Academic Word List loader
|
| 162 |
+
│ │ └── register_filter.py # Contraction expansion + colloquial replacement
|
| 163 |
+
│ ├── inference/
|
| 164 |
+
│ │ ├── corrector.py # End-to-end inference pipeline orchestrator
|
| 165 |
+
│ │ └── postprocessor.py # Cleanup, entity restore, formatting
|
| 166 |
+
│ ├── evaluation/
|
| 167 |
+
│ │ ├── gleu_scorer.py # GLEU + BERTScore computation
|
| 168 |
+
│ │ ├── errant_evaluator.py # ERRANT P/R/F0.5 evaluation
|
| 169 |
+
│ │ ├── style_metrics.py # Style similarity + AWL coverage
|
| 170 |
+
│ │ └── authorship_verifier.py # AI detection resistance testing
|
| 171 |
+
│ └── api/
|
| 172 |
+
│ ├── main.py # FastAPI application
|
| 173 |
+
│ ├── schemas.py # Pydantic request/response models
|
| 174 |
+
│ └── middleware.py # Rate limiting + CORS
|
| 175 |
+
├── train_and_upgrade.py # v2 upgrade pipeline (self-improving Hub push)
|
| 176 |
+
├── data/
|
| 177 |
+
│ ├── raw/ # Original datasets (JFLEG, W&I+LOCNESS)
|
| 178 |
+
│ ├── processed/ # Unified JSONL (train/val/test splits)
|
| 179 |
+
│ ├── cache/ # Pre-tokenized dataset caches (.pt files)
|
| 180 |
+
│ └── awl/ # Coxhead Academic Word List
|
| 181 |
+
├── train.sh # Multi-stage training orchestrator
|
| 182 |
+
├── start.sh # Inference launcher (CLI or API mode)
|
| 183 |
+
├── baseline_score.json # Saved composite score — gate for Hub push
|
| 184 |
+
├── Dockerfile # Production container
|
| 185 |
+
├── docker-compose.yml # Docker deployment
|
| 186 |
+
├── requirements.txt # Python dependencies
|
| 187 |
+
└── pyproject.toml # Project metadata
|
| 188 |
+
```
|
| 189 |
|
| 190 |
+
---
|
| 191 |
|
| 192 |
+
## Model Architecture
|
| 193 |
+
|
| 194 |
+
### PNG:
|
| 195 |
+

|
| 196 |
+
|
| 197 |
+
### Mermaid Diagram:
|
| 198 |
+
```mermaid
|
| 199 |
+
graph TB
|
| 200 |
+
subgraph INFERENCE["🔮 Inference Pipeline"]
|
| 201 |
+
direction TB
|
| 202 |
+
INPUT["📝 Raw Dyslectic Text"]
|
| 203 |
+
subgraph PREPROCESS["Pre-Processing"]
|
| 204 |
+
SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
|
| 205 |
+
SENT_SEG["Sentence Segmenter"]
|
| 206 |
+
DEP_PARSE["Dependency Parser"]
|
| 207 |
+
NER["NER Tagger"]
|
| 208 |
+
end
|
| 209 |
+
subgraph STYLE["Style Analysis"]
|
| 210 |
+
FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
|
| 211 |
+
EMOTION["Emotion Classifier"]
|
| 212 |
+
FORMALITY["Formality Classifier"]
|
| 213 |
+
STYLE_VEC["Style Vector Composer"]
|
| 214 |
+
end
|
| 215 |
+
subgraph GENERATION["Core Generation"]
|
| 216 |
+
STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
|
| 217 |
+
BASE_MODEL["Base LM<br/><i>Flan-T5-Small (warm-merged)</i>"]
|
| 218 |
+
LORA["LoRA Adapter<br/><i>r=16</i>"]
|
| 219 |
+
GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
|
| 220 |
+
end
|
| 221 |
+
subgraph POSTPROCESS["Post-Processing"]
|
| 222 |
+
POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
|
| 223 |
+
VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
|
| 224 |
+
AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
|
| 225 |
+
REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
|
| 226 |
+
end
|
| 227 |
+
OUTPUT["✅ Corrected Academic Text"]
|
| 228 |
+
INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
|
| 229 |
+
INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
|
| 230 |
+
NER --> STYLE_COND
|
| 231 |
+
STYLE_VEC --> STYLE_COND
|
| 232 |
+
STYLE_COND --> BASE_MODEL
|
| 233 |
+
LORA -.->|"merged weights"| BASE_MODEL
|
| 234 |
+
BASE_MODEL --> GEN_UTILS --> POSTPROC
|
| 235 |
+
POSTPROC --> VOCAB_SUB
|
| 236 |
+
AWL --> VOCAB_SUB
|
| 237 |
+
VOCAB_SUB --> REG_FILTER --> OUTPUT
|
| 238 |
+
end
|
| 239 |
+
|
| 240 |
+
subgraph TRAINING["🏋️ Training Pipeline (v2)"]
|
| 241 |
+
direction TB
|
| 242 |
+
subgraph WARMSTART["Warm-Start Merge"]
|
| 243 |
+
HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=8 (existing)</i>"]
|
| 244 |
+
MERGE["merge_and_unload()"]
|
| 245 |
+
FRESH_LORA["Fresh LoRA r=16"]
|
| 246 |
+
end
|
| 247 |
+
subgraph DATA["Data Pipeline"]
|
| 248 |
+
JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
|
| 249 |
+
WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
|
| 250 |
+
DYSLEXIA_AUG["DyslexiaSimulator<br/><i>20% error rate augmentation</i>"]
|
| 251 |
+
SPLIT["88% train / 7% val / 5% test"]
|
| 252 |
+
end
|
| 253 |
+
subgraph LOSS["Combined Loss (v2 — now active)"]
|
| 254 |
+
L_CE["L_CE + label_smoothing=0.1"]
|
| 255 |
+
L_STYLE["0.3 · L_style"]
|
| 256 |
+
L_SEM["0.5 · L_semantic"]
|
| 257 |
+
L_HUMAN["0.4 · L_human<br/><i>(GPU only)</i>"]
|
| 258 |
+
end
|
| 259 |
+
subgraph EVAL["Composite Evaluation"]
|
| 260 |
+
GLEU_E["GLEU"]
|
| 261 |
+
BERT_E["BERTScore F1"]
|
| 262 |
+
WER_E["1 − WER"]
|
| 263 |
+
COMPOSITE["Composite = mean(3)"]
|
| 264 |
+
GATE["Beat baseline?"]
|
| 265 |
+
HUB_PUSH["Push to Hub ✅"]
|
| 266 |
+
end
|
| 267 |
+
HUB_ADAPTER --> MERGE --> FRESH_LORA
|
| 268 |
+
JFLEG --> DYSLEXIA_AUG
|
| 269 |
+
WILOCNESS --> DYSLEXIA_AUG
|
| 270 |
+
DYSLEXIA_AUG --> SPLIT
|
| 271 |
+
L_CE --> COMPOSITE
|
| 272 |
+
L_STYLE --> COMPOSITE
|
| 273 |
+
L_SEM --> COMPOSITE
|
| 274 |
+
GLEU_E --> COMPOSITE
|
| 275 |
+
BERT_E --> COMPOSITE
|
| 276 |
+
WER_E --> COMPOSITE
|
| 277 |
+
COMPOSITE --> GATE --> HUB_PUSH
|
| 278 |
+
end
|
| 279 |
+
```
|
| 280 |
|
| 281 |
+
---
|
| 282 |
|
| 283 |
+
## Design Choices & Rationale
|
| 284 |
|
| 285 |
+
### Why Flan-T5-Small?
|
| 286 |
|
| 287 |
+
| Consideration | Decision |
|
| 288 |
+
|---------------|----------|
|
| 289 |
+
| **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
|
| 290 |
+
| **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
|
| 291 |
+
| **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
|
| 292 |
+
| **LoRA efficiency** | Trainable params scale with r: r=16 → ~2.56M (3.3%) — still fits in 4GB |
|
| 293 |
|
| 294 |
+
### Why LoRA over Full Fine-Tuning?
|
| 295 |
|
| 296 |
+
- **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA r=16 needs ~400MB
|
| 297 |
+
- **Warm-start safety**: Merging r=8 weights preserves corrections before expanding capacity to r=16
|
| 298 |
+
- **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
|
| 299 |
+
- **Configuration**: `r=16, alpha=32, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
|
| 300 |
|
| 301 |
+
### Why a Combined Multi-Objective Loss?
|
| 302 |
|
| 303 |
+
The system uses (on CPU): `L = L_CE + 0.3·L_style + 0.5·L_semantic`
|
| 304 |
|
| 305 |
+
On GPU (with human-pattern classifier available): `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
|
| 306 |
|
| 307 |
+
| Term | Purpose | Weight |
|
| 308 |
+
|------|---------|--------|
|
| 309 |
+
| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
|
| 310 |
+
| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
|
| 311 |
+
| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
|
| 312 |
+
| `L_human` | `1 − HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
|
| 313 |
|
| 314 |
+
The style and semantic losses use a lightweight `StyleMLP` (token embedding mean-pool → linear projection) that adds no external dependencies at training time.
|
| 315 |
|
| 316 |
+
### Why a Human Pattern Classifier?
|
| 317 |
|
| 318 |
+
AI-generated text has detectable statistical signatures:
|
| 319 |
+
- **Lower GPT-2 perplexity** (AI text is more "predictable")
|
| 320 |
+
- **Lower burstiness** (AI has uniform sentence lengths; humans vary)
|
| 321 |
+
- **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
|
| 322 |
+
- **Lower n-gram novelty** (AI reuses phrases more)
|
| 323 |
|
| 324 |
+
The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal. Requires GPU for GPT-2 perplexity scoring; falls back gracefully on CPU.
|
| 325 |
|
| 326 |
+
### Why Sentence-Chunked Inference?
|
| 327 |
|
| 328 |
+
The model was trained with `max_input_length=128` tokens. The task prefix alone consumes ~40 tokens, leaving ~86 tokens for actual text. Long inputs are:
|
| 329 |
|
| 330 |
+
1. Split into sentences using spaCy
|
| 331 |
+
2. Grouped into chunks that fit the 128-token budget
|
| 332 |
+
3. Each chunk is corrected independently
|
| 333 |
+
4. Results are joined back together
|
| 334 |
|
| 335 |
+
### Why Post-Generation Vocabulary Elevation?
|
| 336 |
|
| 337 |
+
Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), a separate BERT-based lexical substitution pipeline is applied:
|
| 338 |
|
| 339 |
+
1. POS-tag the output with spaCy
|
| 340 |
+
2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
|
| 341 |
+
3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
|
| 342 |
+
4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
|
| 343 |
+
5. Track used substitutions to prevent duplicate replacements
|
| 344 |
|
| 345 |
+
---
|
| 346 |
|
| 347 |
+
## Quick Start
|
| 348 |
|
| 349 |
+
### Prerequisites
|
| 350 |
|
| 351 |
+
- Python ≥ 3.10
|
| 352 |
+
- NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
|
| 353 |
+
- ~10GB disk space for models and datasets
|
| 354 |
|
| 355 |
+
### Option A: Self-Improving Upgrade Pipeline (v2)
|
| 356 |
|
| 357 |
+
This pipeline loads the existing Hub adapter, upgrades it, evaluates, and only pushes if it improves.
|
| 358 |
|
| 359 |
+
```bash
|
| 360 |
+
git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
|
| 361 |
+
pip install -r requirements.txt
|
| 362 |
|
| 363 |
+
export HF_TOKEN="your-hf-token-with-write-access"
|
| 364 |
+
python train_and_upgrade.py
|
| 365 |
+
```
|
| 366 |
|
| 367 |
+
The pipeline handles all 10 steps automatically:
|
| 368 |
+
**Load adapter → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Gate → Save → Merge → Push**
|
| 369 |
|
| 370 |
+
### Option B: Manual Step-by-Step (original pipeline)
|
| 371 |
|
| 372 |
+
```bash
|
| 373 |
+
# 1. Install dependencies
|
| 374 |
+
pip install -r requirements.txt
|
| 375 |
+
python -m spacy download en_core_web_sm
|
| 376 |
|
| 377 |
+
# 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
|
| 378 |
+
python scripts/preprocess_data.py
|
| 379 |
|
| 380 |
+
# 3. Pre-train the human pattern classifier
|
| 381 |
+
python scripts/pretrain_human_pattern_classifier.py
|
| 382 |
|
| 383 |
+
# 4. Train the correction model
|
| 384 |
+
PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
|
| 385 |
|
| 386 |
+
# 5. Merge LoRA adapter into base model for inference
|
| 387 |
+
python -c "
|
| 388 |
+
from peft import PeftModel
|
| 389 |
+
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
|
| 390 |
+
import torch
|
| 391 |
+
model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
|
| 392 |
+
model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
|
| 393 |
+
model = model.merge_and_unload()
|
| 394 |
+
model.save_pretrained('checkpoints/best_model_merged')
|
| 395 |
+
AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
|
| 396 |
+
"
|
| 397 |
|
| 398 |
+
# 6. Run inference
|
| 399 |
+
PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
|
| 400 |
|
| 401 |
+
# 7. Or start the API server
|
| 402 |
+
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
|
| 403 |
+
```
|
| 404 |
|
| 405 |
+
---
|
| 406 |
|
| 407 |
+
## Training Pipeline
|
| 408 |
+
|
| 409 |
+
### v2 Upgrade Pipeline (`train_and_upgrade.py`) — 10 Steps
|
| 410 |
+
|
| 411 |
+
| Step | Action |
|
| 412 |
+
|------|--------|
|
| 413 |
+
| 1 | Load existing LoRA adapter (r=8) from Hub |
|
| 414 |
+
| 2 | Merge into base weights (`merge_and_unload`) — warm start |
|
| 415 |
+
| 3 | Apply fresh LoRA r=16 on merged base |
|
| 416 |
+
| 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (20% error rate) |
|
| 417 |
+
| 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
|
| 418 |
+
| 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) |
|
| 419 |
+
| 7 | Compare composite score against `baseline_score.json` |
|
| 420 |
+
| 8 | If improved: save LoRA adapter |
|
| 421 |
+
| 9 | Merge adapter → save full model |
|
| 422 |
+
| 10 | Push adapter + merged model to Hub; update baseline |
|
| 423 |
+
|
| 424 |
+
### v1 Original Pipeline (`train.sh`) — 5 Stages
|
| 425 |
+
|
| 426 |
+
| Stage | Action |
|
| 427 |
+
|-------|--------|
|
| 428 |
+
| 1 | Setup & Dependencies |
|
| 429 |
+
| 2 | Data Preprocessing (FCE + W&I+LOCNESS + JFLEG → JSONL) |
|
| 430 |
+
| 3 | Human Pattern Classifier Pre-Training |
|
| 431 |
+
| 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
|
| 432 |
+
| 5 | Evaluation (GLEU only) |
|
| 433 |
|
| 434 |
+
---
|
| 435 |
|
| 436 |
+
## Hyperparameter Reference
|
| 437 |
+
|
| 438 |
+
### v2 (`train_and_upgrade.py`)
|
| 439 |
+
|
| 440 |
+
```python
|
| 441 |
+
LORA_R = 16
|
| 442 |
+
LORA_ALPHA = 32
|
| 443 |
+
LORA_DROPOUT = 0.05
|
| 444 |
+
TARGET_MODULES = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
|
| 445 |
+
|
| 446 |
+
EPOCHS = 10
|
| 447 |
+
BATCH_SIZE = 2 # per device
|
| 448 |
+
GRAD_ACCUM = 32 # effective batch = 64
|
| 449 |
+
LR = 2e-4
|
| 450 |
+
WARMUP_RATIO = 0.10
|
| 451 |
+
LABEL_SMOOTHING = 0.1
|
| 452 |
+
MAX_INPUT_LEN = 128
|
| 453 |
+
MAX_TARGET_LEN = 128
|
| 454 |
+
|
| 455 |
+
LAMBDA_STYLE = 0.3
|
| 456 |
+
LAMBDA_SEMANTIC = 0.5
|
| 457 |
+
LAMBDA_HUMAN = 0.4 # GPU only
|
| 458 |
+
```
|
| 459 |
+
|
| 460 |
+
### v1 (`configs/training_config.yaml`)
|
| 461 |
+
|
| 462 |
+
```yaml
|
| 463 |
+
lora:
|
| 464 |
+
r: 8
|
| 465 |
+
lora_alpha: 16
|
| 466 |
+
lora_dropout: 0.05
|
| 467 |
+
target_modules: [q, v, k, o, wi_0, wi_1, wo]
|
| 468 |
+
|
| 469 |
+
training:
|
| 470 |
+
per_device_train_batch_size: 4
|
| 471 |
+
gradient_accumulation_steps: 8 # effective batch = 32
|
| 472 |
+
learning_rate: 3.0e-4
|
| 473 |
+
lr_scheduler_type: cosine
|
| 474 |
+
bf16: true
|
| 475 |
+
|
| 476 |
+
loss:
|
| 477 |
+
lambda_style: 0.3
|
| 478 |
+
lambda_semantic: 0.5
|
| 479 |
+
lambda_human_pattern: 0.4
|
| 480 |
+
```
|
| 481 |
+
|
| 482 |
+
### `configs/inference_config.yaml`
|
| 483 |
+
|
| 484 |
+
```yaml
|
| 485 |
+
model:
|
| 486 |
+
key: "flan-t5-small"
|
| 487 |
+
checkpoint_path: "checkpoints/best_model_merged"
|
| 488 |
+
use_lora: false
|
| 489 |
+
|
| 490 |
+
generation:
|
| 491 |
+
num_beams: 5
|
| 492 |
+
length_penalty: 1.2
|
| 493 |
+
no_repeat_ngram_size: 3
|
| 494 |
+
max_new_tokens: 128
|
| 495 |
+
|
| 496 |
+
vocabulary:
|
| 497 |
+
semantic_threshold: 0.82
|
| 498 |
+
```
|
| 499 |
|
| 500 |
+
---
|
| 501 |
|
| 502 |
+
## Inference Pipeline (7 Steps)
|
| 503 |
+
|
| 504 |
+
```
|
| 505 |
+
Raw Text
|
| 506 |
+
│
|
| 507 |
+
▼
|
| 508 |
+
1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
|
| 509 |
+
│
|
| 510 |
+
▼
|
| 511 |
+
2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
|
| 512 |
+
│
|
| 513 |
+
▼
|
| 514 |
+
3. Sentence-Chunked Generation ─ Split into 128-token chunks → Flan-T5 → rejoin
|
| 515 |
+
│
|
| 516 |
+
▼
|
| 517 |
+
4. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
|
| 518 |
+
│
|
| 519 |
+
▼
|
| 520 |
+
5. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
|
| 521 |
+
│
|
| 522 |
+
▼
|
| 523 |
+
6. Register Filtering ── Expand contractions, replace colloquialisms
|
| 524 |
+
│
|
| 525 |
+
▼
|
| 526 |
+
7. Metrics ──────────── Style similarity, AWL coverage, readability scores
|
| 527 |
+
│
|
| 528 |
+
▼
|
| 529 |
+
Corrected Text
|
| 530 |
+
```
|
| 531 |
|
| 532 |
+
---
|
| 533 |
|
| 534 |
+
## API Usage
|
| 535 |
|
| 536 |
+
```bash
|
| 537 |
+
# Start the server
|
| 538 |
+
PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
|
| 539 |
|
| 540 |
+
# Correct text
|
| 541 |
+
curl -X POST http://localhost:8000/correct \
|
| 542 |
+
-H "Content-Type: application/json" \
|
| 543 |
+
-d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
|
| 544 |
|
| 545 |
+
# Health check
|
| 546 |
+
curl http://localhost:8000/health
|
| 547 |
+
```
|
| 548 |
|
| 549 |
+
Interactive docs at `http://localhost:8000/docs`.
|
| 550 |
|
| 551 |
+
---
|
| 552 |
|
| 553 |
+
## Hardware Requirements
|
| 554 |
|
| 555 |
+
| Tier | GPU | LoRA Config | Training Time |
|
| 556 |
+
|------|-----|-------------|---------------|
|
| 557 |
+
| **Tested (v1)** | RTX 3050 4GB | r=8, 5 epochs | ~45 min |
|
| 558 |
+
| **Tested (v2 CPU)** | None (HF Space CPU Basic) | r=16, 10 epochs | ~12–24 hours |
|
| 559 |
+
| Recommended | RTX 3090 24GB | r=16, 10 epochs + human-pattern loss | ~2–3h |
|
| 560 |
+
| Maximum | A100 80GB | Full pipeline with GPT-2 perplexity scoring | ~12h |
|
| 561 |
|
| 562 |
+
---
|
| 563 |
|
| 564 |
+
## Data Sources
|
| 565 |
|
| 566 |
+
| Dataset | Type | Size | Access |
|
| 567 |
+
|---------|------|------|--------|
|
| 568 |
+
| JFLEG (`jhu-clsp/jfleg`) | Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
|
| 569 |
+
| W&I+LOCNESS (`bea2019st/wi_locness`) | Learner errors + corrections | ~34k pairs | HF Hub, no registration |
|
| 570 |
+
| FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
|
| 571 |
+
| Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
|
| 572 |
+
| Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
|
| 573 |
+
| Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
|
| 574 |
|
| 575 |
+
> Note: `train_and_upgrade.py` uses only JFLEG + W&I+LOCNESS (freely accessible via HF Hub). FCE and Kaggle datasets are used in the full manual pipeline only.
|
| 576 |
|
| 577 |
+
---
|
| 578 |
|
| 579 |
+
## Dyslexia Error Simulation
|
| 580 |
|
| 581 |
+
The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017). v2 uses a 20% per-word error rate (up from 15%).
|
| 582 |
|
| 583 |
+
| Error Type | Frequency | Example |
|
| 584 |
+
|-----------|-----------|---------|
|
| 585 |
+
| Phonetic substitution | 35% | "because" → "becaus" |
|
| 586 |
+
| Letter transposition | 18% | "the" → "teh" |
|
| 587 |
+
| Letter omission | 16% | "important" → "importnt" |
|
| 588 |
+
| Letter doubling | 12% | "letter" → "lettter" |
|
| 589 |
+
| Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
|
| 590 |
+
| Word boundary errors | 9% | "a lot" → "alot" |
|
| 591 |
|
| 592 |
+
---
|
| 593 |
|
| 594 |
+
## Style Fingerprint Vector
|
| 595 |
|
| 596 |
+
The 512-dimensional style vector captures 41 raw features:
|
| 597 |
|
| 598 |
+
| Group | Features | Count |
|
| 599 |
+
|-------|----------|-------|
|
| 600 |
+
| Sentence stats | mean, std, skew of sentence lengths | 3 |
|
| 601 |
+
| Word stats | mean, std of word lengths | 2 |
|
| 602 |
+
| Lexical | type-token ratio, lexical density | 2 |
|
| 603 |
+
| Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
|
| 604 |
+
| Discourse | 20 academic discourse markers (per 100 words) | 20 |
|
| 605 |
+
| Register | hedging frequency, formality score, nominalization ratio | 3 |
|
| 606 |
+
| Readability | Flesch reading ease, avg syllables per word | 2 |
|
| 607 |
+
| Pronouns | first-person ratio, third-person ratio | 2 |
|
| 608 |
+
| Other | question ratio, exclamation ratio, AWL coverage | 3 |
|
| 609 |
|
| 610 |
+
Projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
|
| 611 |
|
| 612 |
+
---
|
| 613 |
|
| 614 |
+
## Known Limitations
|
|
|
|
| 615 |
|
| 616 |
+
1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models. Doubling LoRA rank (r=8 → r=16) partially addresses this.
|
| 617 |
+
2. **Training window**: 128-token max input means very long sentences may be split mid-clause.
|
| 618 |
+
3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
|
| 619 |
+
4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
|
| 620 |
+
5. **LanguageTool latency**: Spell correction takes ~15–20s due to JVM startup on first call.
|
| 621 |
+
6. **Human-pattern loss on CPU**: The GPT-2 perplexity-based loss is skipped on CPU for performance. Full loss is only active on GPU.
|
| 622 |
+
<!-- 7. **Semantic drift in correction**: The pipeline can introduce meaning-level errors — dyslexic phonetic patterns misread by LanguageTool can produce plausible-but-wrong word substitutions. BERTScore F1 and WER (now primary evaluation signals in v2) help detect but don't eliminate this. A dedicated post-correction semantic faithfulness check remains a future improvement. -->
|