morpheuslord commited on
Commit
6332b0b
·
verified ·
1 Parent(s): 07a0d00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +599 -106
README.md CHANGED
@@ -1,206 +1,699 @@
1
  ---
2
- base_model: google/flan-t5-small
3
- library_name: peft
4
  tags:
5
- - base_model:adapter:google/flan-t5-small
 
 
 
6
  - lora
7
- - transformers
 
 
 
 
 
 
8
  ---
9
 
10
- # Model Card for Model ID
11
 
12
- <!-- Provide a quick summary of what the model is/does. -->
13
 
 
14
 
 
15
 
16
- ## Model Details
 
 
 
17
 
18
- ### Model Description
19
 
20
- <!-- Provide a longer summary of what this model is. -->
21
 
 
22
 
 
 
 
 
 
 
 
23
 
24
- - **Developed by:** [More Information Needed]
25
- - **Funded by [optional]:** [More Information Needed]
26
- - **Shared by [optional]:** [More Information Needed]
27
- - **Model type:** [More Information Needed]
28
- - **Language(s) (NLP):** [More Information Needed]
29
- - **License:** [More Information Needed]
30
- - **Finetuned from model [optional]:** [More Information Needed]
31
 
32
- ### Model Sources [optional]
33
 
34
- <!-- Provide the basic links for the model. -->
 
 
 
 
 
35
 
36
- - **Repository:** [More Information Needed]
37
- - **Paper [optional]:** [More Information Needed]
38
- - **Demo [optional]:** [More Information Needed]
39
 
40
- ## Uses
41
 
42
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
43
 
44
- ### Direct Use
 
 
 
 
 
 
 
 
 
45
 
46
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
47
 
48
- [More Information Needed]
49
 
50
- ### Downstream Use [optional]
51
 
52
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
53
 
54
- [More Information Needed]
 
 
 
55
 
56
- ### Out-of-Scope Use
 
 
 
 
 
57
 
58
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
59
 
60
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
- ## Bias, Risks, and Limitations
63
 
64
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67
 
68
- ### Recommendations
69
 
70
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
71
 
72
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
73
 
74
- ## How to Get Started with the Model
 
 
 
 
 
75
 
76
- Use the code below to get started with the model.
77
 
78
- [More Information Needed]
 
 
 
79
 
80
- ## Training Details
81
 
82
- ### Training Data
83
 
84
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
85
 
86
- [More Information Needed]
 
 
 
 
 
87
 
88
- ### Training Procedure
89
 
90
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
91
 
92
- #### Preprocessing [optional]
93
 
94
- [More Information Needed]
 
 
 
 
95
 
 
96
 
97
- #### Training Hyperparameters
98
 
99
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
100
 
101
- #### Speeds, Sizes, Times [optional]
 
 
 
102
 
103
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
104
 
105
- [More Information Needed]
106
 
107
- ## Evaluation
 
 
 
 
108
 
109
- <!-- This section describes the evaluation protocols and provides the results. -->
110
 
111
- ### Testing Data, Factors & Metrics
112
 
113
- #### Testing Data
114
 
115
- <!-- This should link to a Dataset Card if possible. -->
 
 
116
 
117
- [More Information Needed]
118
 
119
- #### Factors
120
 
121
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
122
 
123
- [More Information Needed]
 
 
124
 
125
- #### Metrics
 
126
 
127
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
128
 
129
- [More Information Needed]
 
 
 
130
 
131
- ### Results
 
132
 
133
- [More Information Needed]
 
134
 
135
- #### Summary
 
136
 
 
 
 
 
 
 
 
 
 
 
 
137
 
 
 
138
 
139
- ## Model Examination [optional]
 
 
140
 
141
- <!-- Relevant interpretability work for the model goes here -->
142
 
143
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
 
145
- ## Environmental Impact
146
 
147
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
150
 
151
- - **Hardware Type:** [More Information Needed]
152
- - **Hours used:** [More Information Needed]
153
- - **Cloud Provider:** [More Information Needed]
154
- - **Compute Region:** [More Information Needed]
155
- - **Carbon Emitted:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
- ## Technical Specifications [optional]
158
 
159
- ### Model Architecture and Objective
160
 
161
- [More Information Needed]
 
 
162
 
163
- ### Compute Infrastructure
 
 
 
164
 
165
- [More Information Needed]
 
 
166
 
167
- #### Hardware
168
 
169
- [More Information Needed]
170
 
171
- #### Software
172
 
173
- [More Information Needed]
 
 
 
 
 
 
174
 
175
- ## Citation [optional]
176
 
177
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
178
 
179
- **BibTeX:**
 
 
 
 
 
 
 
 
180
 
181
- [More Information Needed]
182
 
183
- **APA:**
184
 
185
- [More Information Needed]
186
 
187
- ## Glossary [optional]
188
 
189
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
 
 
 
 
 
 
190
 
191
- [More Information Needed]
192
 
193
- ## More Information [optional]
194
 
195
- [More Information Needed]
196
 
197
- ## Model Card Authors [optional]
 
 
 
 
 
 
 
 
 
 
198
 
199
- [More Information Needed]
200
 
201
- ## Model Card Contact
202
 
203
- [More Information Needed]
204
- ### Framework versions
205
 
206
- - PEFT 0.19.1
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
  tags:
5
+ - text2text-generation
6
+ - dyslexia
7
+ - grammar-correction
8
+ - style-preservation
9
  - lora
10
+ - flan-t5
11
+ license: mit
12
+ base_model: google/flan-t5-small
13
+ datasets:
14
+ - jhu-clsp/jfleg
15
+ - bea2019st/wi_locness
16
+ pipeline_tag: translation
17
  ---
18
 
19
+ # Dyslexia Academic Writing Correction System
20
 
21
+ > **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
22
 
23
+ ## Overview
24
 
25
+ This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
26
 
27
+ 1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
28
+ 2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
29
+ 3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
30
+ 4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
31
 
32
+ The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
33
 
34
+ ---
35
 
36
+ ## Latest Evaluation Results (v3)
37
 
38
+ | Metric | Score | Description |
39
+ |--------|-------|-------------|
40
+ | **GLEU** | **0.7593** | Grammar + fluency correction quality |
41
+ | **BERTScore F1** | **0.9758** | Semantic closeness to reference corrections |
42
+ | **1 − WER** | **0.8552** | Word-level accuracy (WER = 14.48%) |
43
+ | **Composite** | **0.8634** | `(GLEU + BERTScore F1 + (1−WER)) / 3` — gating score for Hub push |
44
+ | **Faithfulness reverts** | **11** | Outputs whose cosine sim to input fell below 0.75 — reverted to source |
45
 
46
+ > The model is only pushed to the Hub when the composite score strictly beats the saved baseline from the previous run, ensuring the Hub always holds the best-seen weights.
 
 
 
 
 
 
47
 
48
+ ### Score Progression
49
 
50
+ | Metric | v1 | v2 | v3 | Δ v2→v3 |
51
+ |--------|----|----|-----|---------|
52
+ | GLEU | — | 0.7506 | **0.7593** | +0.0087 |
53
+ | BERTScore F1 | — | 0.9733 | **0.9758** | +0.0025 |
54
+ | 1 − WER | — | 0.8488 | **0.8552** | +0.0064 |
55
+ | Composite | — | 0.8576 | **0.8634** | +0.0058 |
56
 
57
+ ---
 
 
58
 
59
+ ## What Changed in v3
60
 
61
+ v3 keeps the same base model and LoRA rank as v2 but improves every other stage of the pipeline: wider context window, better generation, a semantic faithfulness gate that prevents meaning-destroying corrections, and optional ERRANT F0.5 evaluation.
62
 
63
+ | Parameter | v2 | v3 |
64
+ |-----------|----|----|
65
+ | Context window | 128 tokens | **256 tokens** |
66
+ | Additional data | JFLEG + W&I only | **+ C4-200M-GEC (~100k pairs, falls back if unavailable)** |
67
+ | Beam search | `num_beams=2` | **`num_beams=5`, `length_penalty=1.2`, `repetition_penalty=1.3`, `no_repeat_ngram_size=3`** |
68
+ | Faithfulness gate | none | **cosine sim < 0.75 → revert output to source** |
69
+ | Human-pattern loss | skipped on CPU | **active on GPU** (loads classifier from Hub if present) |
70
+ | Evaluation cap | always 200 samples | **200 on CPU, full test set on GPU** |
71
+ | ERRANT F0.5 | not present | **optional metric** (install `errant` + `en_core_web_sm`) |
72
+ | Composite | mean(GLEU, BERTScore, 1-WER) | **mean(GLEU, BERTScore, 1-WER [, ERRANT F0.5 if available])** |
73
 
74
+ ### Semantic Faithfulness Gate (v3)
75
 
76
+ After generation, each output is checked against its source input using `all-MiniLM-L6-v2` sentence embeddings. If cosine similarity falls below **0.75**, the output is discarded and the original input is returned as fallback — preventing corrections that accidentally change meaning.
77
 
78
+ In the v3 evaluation run, **11 outputs** (of 228 test pairs evaluated) were reverted. Without the gate, those would have been incorrect predictions dragging all three metrics down.
79
 
80
+ ### Combined Loss (v3 unchanged from v2 on CPU)
81
 
82
+ ```
83
+ L = L_CE + 0.3·L_style + 0.5·L_semantic (CPU)
84
+ L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human (GPU)
85
+ ```
86
 
87
+ | Term | Purpose | Weight |
88
+ |------|---------|--------|
89
+ | `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
90
+ | `L_style` | `1 − cos_sim(style(input), style(output))` | 0.3 |
91
+ | `L_semantic` | `1 − cos_sim(input_emb, output_emb)` | 0.5 |
92
+ | `L_human` | `1 − HumanPatternClassifier(output)` — anti-AI penalty | 0.4 *(GPU only)* |
93
 
94
+ ---
95
 
96
+ ## What Changed in v2
97
+
98
+ The original model had a critical bug: `CorrectionTrainer.compute_loss()` only used cross-entropy loss. The multi-objective loss (`L_CE + λ_style + λ_semantic + λ_human`) was fully designed in `loss_functions.py` but was **never wired into the trainer**. v2 fixes this and upgrades several other parameters.
99
+
100
+ | Parameter | v1 (Original) | v2 (Upgraded) |
101
+ |-----------|--------------|---------------|
102
+ | LoRA rank | r=8, α=16 | **r=16, α=32** |
103
+ | Epochs | 5 | **10** |
104
+ | Effective batch size | 32 (4×8 accum) | **64 (2×32 accum)** |
105
+ | Learning rate | 3e-4 | **2e-4** |
106
+ | Warmup ratio | 5% | **10%** |
107
+ | Label smoothing | none | **0.1** |
108
+ | Loss function | CE only *(bug)* | **CE + Style + Semantic** *(fixed)* |
109
+ | Human-pattern loss | designed, unused | omitted on CPU; falls back to CE+style+sem |
110
+ | Evaluation | GLEU only | **GLEU + BERTScore F1 + (1−WER) composite** |
111
+ | Eval/save strategy | every 100 steps | **per epoch** |
112
+ | Early stopping | none | **patience=3** |
113
+ | Hub gate | none | **composite must beat saved baseline** |
114
+ | Warm-start strategy | cold start | **merge r=8 adapter → apply fresh r=16 LoRA** |
115
+ | Data split | 90%/10% train/val | **88%/7%/5% train/val/test** |
116
+ | Dyslexia augmentation error rate | 15% | **20%** |
117
 
118
+ ---
119
 
120
+ ## Features
121
+
122
+ | Feature | Description |
123
+ |---------|-------------|
124
+ | **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
125
+ | **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
126
+ | **LoRA fine-tuning** | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
127
+ | **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
128
+ | **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
129
+ | **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic (+ λ₃·L_human on GPU)` |
130
+ | **Semantic faithfulness gate** | Outputs with cosine sim < 0.75 to source are reverted — prevents meaning drift |
131
+ | **Sentence-chunked inference** | Long texts split into 256-token chunks matching training window |
132
+ | **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
133
+ | **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
134
+ | **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text (20% error rate) |
135
+ | **Composite score gating** | Hub push only if new model strictly beats saved baseline |
136
+
137
+ ---
138
+
139
+ ## Project Structure
140
+
141
+ ```
142
+ Rewriter/
143
+ ├── configs/
144
+ │ ├── training_config.yaml # Full training hyperparameters
145
+ │ ├── training_config_fast.yaml # Quick iteration config
146
+ │ ├── inference_config.yaml # Inference & generation settings
147
+ │ ├── model_config.yaml # Model architecture registry
148
+ │ └── awl_config.yaml # Academic Word List settings
149
+ ├── scripts/
150
+ │ ├── train.py # Main training script (Click CLI)
151
+ │ ├── evaluate.py # Test set evaluation (GLEU, ERRANT, BERTScore)
152
+ │ ├── run_inference.py # Interactive CLI inference
153
+ │ ├── preprocess_data.py # Raw datasets → unified JSONL
154
+ │ ├── pretrain_human_pattern_classifier.py # Stage 3: anti-AI classifier
155
+ │ ├── download_datasets.sh # BEA-2019 dataset downloader
156
+ │ └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
157
+ ├── src/
158
+ │ ├── model/
159
+ │ │ ├── base_model.py # Model loader (T5/BART/Llama + LoRA + quantization)
160
+ │ │ ├── style_conditioner.py # Prefix tuning: style → virtual tokens
161
+ │ │ ├── generation_utils.py # Beam search, sampling, batch generation
162
+ │ │ └── lora_adapter.py # LoRA configuration helpers
163
+ │ ├── preprocessing/
164
+ │ │ ├── pipeline.py # Full preprocessing orchestrator
165
+ │ │ ├── spell_corrector.py # LanguageTool + dyslexia-aware correction
166
+ │ │ ├── dyslexia_simulator.py # Synthetic error generation (Rello et al.)
167
+ │ │ ├── dependency_parser.py # spaCy dependency tree analysis
168
+ │ │ ├── ner_tagger.py # Named entity protection
169
+ │ │ └── sentence_segmenter.py # Sentence boundary detection
170
+ │ ├── style/
171
+ │ │ ├── fingerprinter.py # 41 features → 512-dim style vector
172
+ │ │ ├── style_vector.py # Style vector dataclass
173
+ │ │ ├── formality_classifier.py # Rule-based formality scoring
174
+ │ │ └── emotion_classifier.py # Emotion detection
175
+ │ ├── training/
176
+ │ │ ├── dataset.py # Pre-tokenized cached dataset with style vectors
177
+ │ │ ├── trainer.py # CorrectionTrainer (HF Trainer + PEFT fixes)
178
+ │ │ ├── loss_functions.py # V1 and V2 combined losses
179
+ │ │ ├── human_pattern_extractor.py # 17-dim feature extraction + classifier
180
+ │ │ └── callbacks.py # Evaluation logging callbacks
181
+ │ ├── vocabulary/
182
+ │ │ ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
183
+ │ │ ├── awl_loader.py # Coxhead Academic Word List loader
184
+ │ │ └── register_filter.py # Contraction expansion + colloquial replacement
185
+ │ ├── inference/
186
+ │ │ ├── corrector.py # End-to-end inference pipeline orchestrator
187
+ │ │ └── postprocessor.py # Cleanup, entity restore, formatting
188
+ │ ├── evaluation/
189
+ │ │ ├── gleu_scorer.py # GLEU + BERTScore computation
190
+ │ │ ├── errant_evaluator.py # ERRANT P/R/F0.5 evaluation
191
+ │ │ ├── style_metrics.py # Style similarity + AWL coverage
192
+ │ │ └── authorship_verifier.py # AI detection resistance testing
193
+ │ └── api/
194
+ │ ├── main.py # FastAPI application
195
+ │ ├── schemas.py # Pydantic request/response models
196
+ │ └── middleware.py # Rate limiting + CORS
197
+ ├── train_and_upgrade.py # v3 upgrade pipeline (self-improving Hub push)
198
+ ├── data/
199
+ │ ├── raw/ # Original datasets (JFLEG, W&I+LOCNESS)
200
+ │ ├── processed/ # Unified JSONL (train/val/test splits)
201
+ │ ├── cache/ # Pre-tokenized dataset caches (.pt files)
202
+ │ └── awl/ # Coxhead Academic Word List
203
+ ├── train.sh # Multi-stage training orchestrator
204
+ ├── start.sh # Inference launcher (CLI or API mode)
205
+ ├── baseline_score.json # Saved composite score (0.8634) — gate for Hub push
206
+ ├── Dockerfile # Production container
207
+ ├── docker-compose.yml # Docker deployment
208
+ ├── requirements.txt # Python dependencies
209
+ └── pyproject.toml # Project metadata
210
+ ```
211
+
212
+ ---
213
 
214
+ ## Model Architecture
215
+
216
+ ### PNG:
217
+ ![Architecture](arch.png)
218
+
219
+ ### Mermaid Diagram:
220
+ ```mermaid
221
+ graph TB
222
+ subgraph INFERENCE["🔮 Inference Pipeline"]
223
+ direction TB
224
+ INPUT["📝 Raw Dyslectic Text"]
225
+ subgraph PREPROCESS["Pre-Processing"]
226
+ SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
227
+ SENT_SEG["Sentence Segmenter"]
228
+ DEP_PARSE["Dependency Parser"]
229
+ NER["NER Tagger"]
230
+ end
231
+ subgraph STYLE["Style Analysis"]
232
+ FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
233
+ EMOTION["Emotion Classifier"]
234
+ FORMALITY["Formality Classifier"]
235
+ STYLE_VEC["Style Vector Composer"]
236
+ end
237
+ subgraph GENERATION["Core Generation"]
238
+ STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
239
+ BASE_MODEL["Base LM<br/><i>Flan-T5-Small (warm-merged)</i>"]
240
+ LORA["LoRA Adapter<br/><i>r=16</i>"]
241
+ GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
242
+ end
243
+ subgraph POSTPROCESS["Post-Processing"]
244
+ FAITH["Faithfulness Gate<br/><i>cos sim &lt; 0.75 → revert</i>"]
245
+ POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
246
+ VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
247
+ AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
248
+ REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
249
+ end
250
+ OUTPUT["✅ Corrected Academic Text"]
251
+ INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
252
+ INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
253
+ NER --> STYLE_COND
254
+ STYLE_VEC --> STYLE_COND
255
+ STYLE_COND --> BASE_MODEL
256
+ LORA -.->|"merged weights"| BASE_MODEL
257
+ BASE_MODEL --> GEN_UTILS --> FAITH --> POSTPROC
258
+ POSTPROC --> VOCAB_SUB
259
+ AWL --> VOCAB_SUB
260
+ VOCAB_SUB --> REG_FILTER --> OUTPUT
261
+ end
262
+
263
+ subgraph TRAINING["🏋️ Training Pipeline (v3)"]
264
+ direction TB
265
+ subgraph WARMSTART["Warm-Start Merge"]
266
+ HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=16 (v2)</i>"]
267
+ MERGE["merge_and_unload()"]
268
+ FRESH_LORA["Fresh LoRA r=16"]
269
+ end
270
+ subgraph DATA["Data Pipeline"]
271
+ JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
272
+ WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
273
+ C4GEC["C4-200M-GEC<br/><i>~100k pairs (optional)</i>"]
274
+ DYSLEXIA_AUG["DyslexiaSimulator<br/><i>20% error rate augmentation</i>"]
275
+ SPLIT["88% train / 7% val / 5% test"]
276
+ end
277
+ subgraph LOSS["Combined Loss (v3)"]
278
+ L_CE["L_CE + label_smoothing=0.1"]
279
+ L_STYLE["0.3 · L_style"]
280
+ L_SEM["0.5 · L_semantic"]
281
+ L_HUMAN["0.4 · L_human<br/><i>(GPU only)</i>"]
282
+ end
283
+ subgraph EVAL["Composite Evaluation"]
284
+ GLEU_E["GLEU"]
285
+ BERT_E["BERTScore F1"]
286
+ WER_E["1 − WER"]
287
+ ERRANT_E["ERRANT F0.5<br/><i>(optional)</i>"]
288
+ COMPOSITE["Composite = mean(3 or 4)"]
289
+ GATE["Beat baseline?"]
290
+ HUB_PUSH["Push to Hub ✅"]
291
+ end
292
+ HUB_ADAPTER --> MERGE --> FRESH_LORA
293
+ JFLEG --> DYSLEXIA_AUG
294
+ WILOCNESS --> DYSLEXIA_AUG
295
+ C4GEC --> DYSLEXIA_AUG
296
+ DYSLEXIA_AUG --> SPLIT
297
+ L_CE --> COMPOSITE
298
+ L_STYLE --> COMPOSITE
299
+ L_SEM --> COMPOSITE
300
+ GLEU_E --> COMPOSITE
301
+ BERT_E --> COMPOSITE
302
+ WER_E --> COMPOSITE
303
+ ERRANT_E -.->|"if installed"| COMPOSITE
304
+ COMPOSITE --> GATE --> HUB_PUSH
305
+ end
306
+ ```
307
 
308
+ ---
309
 
310
+ ## Design Choices & Rationale
311
 
312
+ ### Why Flan-T5-Small?
313
 
314
+ | Consideration | Decision |
315
+ |---------------|----------|
316
+ | **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
317
+ | **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
318
+ | **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
319
+ | **LoRA efficiency** | Trainable params scale with r: r=16 → ~2.56M (3.3%) — still fits in 4GB |
320
 
321
+ ### Why LoRA over Full Fine-Tuning?
322
 
323
+ - **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA r=16 needs ~400MB
324
+ - **Warm-start safety**: Merging r=8 weights preserves corrections before expanding capacity to r=16
325
+ - **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
326
+ - **Configuration**: `r=16, alpha=32, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
327
 
328
+ ### Why a Combined Multi-Objective Loss?
329
 
330
+ The system uses (on CPU): `L = L_CE + 0.3·L_style + 0.5·L_semantic`
331
 
332
+ On GPU (with human-pattern classifier available): `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
333
 
334
+ | Term | Purpose | Weight |
335
+ |------|---------|--------|
336
+ | `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
337
+ | `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
338
+ | `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
339
+ | `L_human` | `1 − HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
340
 
341
+ ### Why a Semantic Faithfulness Gate?
342
 
343
+ Even a well-trained correction model can occasionally produce outputs that drift semantically from the input — particularly when dyslexic phonetic patterns are ambiguous (e.g. "becaus" could be "because" or "becaused"). Rather than accepting every model output blindly, v3 computes cosine similarity between the source and output using `all-MiniLM-L6-v2` sentence embeddings. Outputs below **0.75 similarity** are treated as unreliable and the original input is returned unchanged. This is conservative by design: a correct-but-awkward source is always better than a fluent-but-wrong correction.
344
 
345
+ ### Why a Human Pattern Classifier?
346
 
347
+ AI-generated text has detectable statistical signatures:
348
+ - **Lower GPT-2 perplexity** (AI text is more "predictable")
349
+ - **Lower burstiness** (AI has uniform sentence lengths; humans vary)
350
+ - **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
351
+ - **Lower n-gram novelty** (AI reuses phrases more)
352
 
353
+ The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal. Requires GPU for GPT-2 perplexity scoring; falls back gracefully on CPU.
354
 
355
+ ### Why Sentence-Chunked Inference?
356
 
357
+ The model is trained with `max_input_length=256` tokens. The task prefix alone consumes ~40 tokens, leaving ~216 tokens for actual text. Long inputs are:
358
 
359
+ 1. Split into sentences using spaCy
360
+ 2. Grouped into chunks that fit the 256-token budget
361
+ 3. Each chunk is corrected independently
362
+ 4. Results are joined back together
363
 
364
+ ### Why Post-Generation Vocabulary Elevation?
365
 
366
+ Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), a separate BERT-based lexical substitution pipeline is applied:
367
 
368
+ 1. POS-tag the output with spaCy
369
+ 2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
370
+ 3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
371
+ 4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
372
+ 5. Track used substitutions to prevent duplicate replacements
373
 
374
+ ---
375
 
376
+ ## Quick Start
377
 
378
+ ### Prerequisites
379
 
380
+ - Python 3.10
381
+ - NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
382
+ - ~10GB disk space for models and datasets
383
 
384
+ ### Option A: Self-Improving Upgrade Pipeline (v3)
385
 
386
+ This pipeline loads the existing Hub adapter, upgrades it, evaluates, and only pushes if it improves.
387
 
388
+ ```bash
389
+ git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
390
+ pip install -r requirements.txt
391
 
392
+ export HF_TOKEN="your-hf-token-with-write-access"
393
+ python train_and_upgrade.py
394
+ ```
395
 
396
+ The pipeline handles all 10 steps automatically:
397
+ **Load adapter → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Gate → Save → Merge → Push**
398
 
399
+ ### Option B: Manual Step-by-Step (original pipeline)
400
 
401
+ ```bash
402
+ # 1. Install dependencies
403
+ pip install -r requirements.txt
404
+ python -m spacy download en_core_web_sm
405
 
406
+ # 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
407
+ python scripts/preprocess_data.py
408
 
409
+ # 3. Pre-train the human pattern classifier
410
+ python scripts/pretrain_human_pattern_classifier.py
411
 
412
+ # 4. Train the correction model
413
+ PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
414
 
415
+ # 5. Merge LoRA adapter into base model for inference
416
+ python -c "
417
+ from peft import PeftModel
418
+ from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
419
+ import torch
420
+ model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
421
+ model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
422
+ model = model.merge_and_unload()
423
+ model.save_pretrained('checkpoints/best_model_merged')
424
+ AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
425
+ "
426
 
427
+ # 6. Run inference
428
+ PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
429
 
430
+ # 7. Or start the API server
431
+ PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
432
+ ```
433
 
434
+ ---
435
 
436
+ ## Training Pipeline
437
+
438
+ ### v3 Upgrade Pipeline (`train_and_upgrade.py`) — 10 Steps
439
+
440
+ | Step | Action |
441
+ |------|--------|
442
+ | 1 | Load existing LoRA adapter (r=16, v2) from Hub |
443
+ | 2 | Merge into base weights (`merge_and_unload`) — warm start |
444
+ | 3 | Apply fresh LoRA r=16 on merged base |
445
+ | 4 | Load JFLEG + W&I+LOCNESS + C4-GEC (optional); augment with DyslexiaSimulator (20% error rate) |
446
+ | 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
447
+ | 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) [+ ERRANT F0.5 if installed] |
448
+ | 7 | Apply semantic faithfulness gate — revert outputs with cosine sim < 0.75 to source |
449
+ | 8 | Compare composite score against `baseline_score.json` |
450
+ | 9 | If improved: merge adapter → save full model |
451
+ | 10 | Push adapter (repo root) + merged model (`merged/` subfolder) to Hub; update baseline |
452
+
453
+ ### v2 Upgrade Pipeline — 10 Steps
454
+
455
+ | Step | Action |
456
+ |------|--------|
457
+ | 1 | Load existing LoRA adapter (r=8) from Hub |
458
+ | 2 | Merge into base weights (`merge_and_unload`) — warm start |
459
+ | 3 | Apply fresh LoRA r=16 on merged base |
460
+ | 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (20% error rate) |
461
+ | 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
462
+ | 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) |
463
+ | 7 | Compare composite score against `baseline_score.json` |
464
+ | 8 | If improved: save LoRA adapter |
465
+ | 9 | Merge adapter → save full model |
466
+ | 10 | Push adapter + merged model to Hub; update baseline |
467
+
468
+ ### v1 Original Pipeline (`train.sh`) — 5 Stages
469
+
470
+ | Stage | Action |
471
+ |-------|--------|
472
+ | 1 | Setup & Dependencies |
473
+ | 2 | Data Preprocessing (FCE + W&I+LOCNESS + JFLEG → JSONL) |
474
+ | 3 | Human Pattern Classifier Pre-Training |
475
+ | 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
476
+ | 5 | Evaluation (GLEU only) |
477
 
478
+ ---
479
 
480
+ ## Hyperparameter Reference
481
+
482
+ ### v3 (`train_and_upgrade.py`)
483
+
484
+ ```python
485
+ LORA_R = 16
486
+ LORA_ALPHA = 32
487
+ LORA_DROPOUT = 0.05
488
+ TARGET_MODULES = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
489
+
490
+ EPOCHS = 10
491
+ BATCH_SIZE = 2 # per device (CPU); 8 on GPU
492
+ GRAD_ACCUM = 32 # effective batch = 64
493
+ LR = 2e-4
494
+ WARMUP_RATIO = 0.10
495
+ LABEL_SMOOTHING = 0.1
496
+ MAX_INPUT_LEN = 256 # up from 128 in v2
497
+ MAX_TARGET_LEN = 256
498
+
499
+ LAMBDA_STYLE = 0.3
500
+ LAMBDA_SEMANTIC = 0.5
501
+ LAMBDA_HUMAN = 0.4 # GPU only
502
+
503
+ FAITHFULNESS_THRESHOLD = 0.75 # new in v3
504
+ ```
505
+
506
+ ### v2 (`train_and_upgrade.py`)
507
+
508
+ ```python
509
+ LORA_R = 16
510
+ LORA_ALPHA = 32
511
+ LORA_DROPOUT = 0.05
512
+ TARGET_MODULES = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
513
+
514
+ EPOCHS = 10
515
+ BATCH_SIZE = 2
516
+ GRAD_ACCUM = 32 # effective batch = 64
517
+ LR = 2e-4
518
+ WARMUP_RATIO = 0.10
519
+ LABEL_SMOOTHING = 0.1
520
+ MAX_INPUT_LEN = 128
521
+ MAX_TARGET_LEN = 128
522
+
523
+ LAMBDA_STYLE = 0.3
524
+ LAMBDA_SEMANTIC = 0.5
525
+ LAMBDA_HUMAN = 0.4 # GPU only
526
+ ```
527
+
528
+ ### v1 (`configs/training_config.yaml`)
529
+
530
+ ```yaml
531
+ lora:
532
+ r: 8
533
+ lora_alpha: 16
534
+ lora_dropout: 0.05
535
+ target_modules: [q, v, k, o, wi_0, wi_1, wo]
536
+
537
+ training:
538
+ per_device_train_batch_size: 4
539
+ gradient_accumulation_steps: 8 # effective batch = 32
540
+ learning_rate: 3.0e-4
541
+ lr_scheduler_type: cosine
542
+ bf16: true
543
+
544
+ loss:
545
+ lambda_style: 0.3
546
+ lambda_semantic: 0.5
547
+ lambda_human_pattern: 0.4
548
+ ```
549
+
550
+ ### `configs/inference_config.yaml`
551
+
552
+ ```yaml
553
+ model:
554
+ key: "flan-t5-small"
555
+ checkpoint_path: "checkpoints/best_model_merged"
556
+ use_lora: false
557
+
558
+ generation:
559
+ num_beams: 5
560
+ length_penalty: 1.2
561
+ repetition_penalty: 1.3
562
+ no_repeat_ngram_size: 3
563
+ max_new_tokens: 256
564
+
565
+ vocabulary:
566
+ semantic_threshold: 0.82
567
+
568
+ faithfulness:
569
+ threshold: 0.75
570
+ ```
571
 
572
+ ---
573
 
574
+ ## Inference Pipeline (8 Steps)
575
+
576
+ ```
577
+ Raw Text
578
+
579
+
580
+ 1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
581
+
582
+
583
+ 2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
584
+
585
+
586
+ 3. Sentence-Chunked Generation ─ Split into 256-token chunks → Flan-T5 → rejoin
587
+
588
+
589
+ 4. Faithfulness Gate ──── cosine_sim(source, output) < 0.75 → revert to source [NEW v3]
590
+
591
+
592
+ 5. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
593
+
594
+
595
+ 6. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
596
+
597
+
598
+ 7. Register Filtering ── Expand contractions, replace colloquialisms
599
+
600
+
601
+ 8. Metrics ──────────── Style similarity, AWL coverage, readability scores
602
+
603
+
604
+ Corrected Text
605
+ ```
606
 
607
+ ---
608
 
609
+ ## API Usage
610
 
611
+ ```bash
612
+ # Start the server
613
+ PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
614
 
615
+ # Correct text
616
+ curl -X POST http://localhost:8000/correct \
617
+ -H "Content-Type: application/json" \
618
+ -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
619
 
620
+ # Health check
621
+ curl http://localhost:8000/health
622
+ ```
623
 
624
+ Interactive docs at `http://localhost:8000/docs`.
625
 
626
+ ---
627
 
628
+ ## Hardware Requirements
629
 
630
+ | Tier | GPU | LoRA Config | Epochs | Training Time |
631
+ |------|-----|-------------|--------|---------------|
632
+ | **Tested (v1)** | RTX 3050 4GB | r=8 | 5 | ~45 min |
633
+ | **Tested (v2 CPU)** | None (HF Space CPU Basic) | r=16 | 10 | ~12–24 hours |
634
+ | **Tested (v3 CPU)** | None (HF Space CPU Basic) | r=16 | 10 | ~12–24 hours |
635
+ | Recommended | RTX 3090 24GB | r=16 + human-pattern loss | 10 | ~2–3h |
636
+ | Maximum | A100 80GB | Full pipeline with GPT-2 perplexity + ERRANT | 10 | ~12h |
637
 
638
+ ---
639
 
640
+ ## Data Sources
641
 
642
+ | Dataset | Type | Size | Access |
643
+ |---------|------|------|--------|
644
+ | JFLEG (`jhu-clsp/jfleg`) | Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
645
+ | W&I+LOCNESS (`bea2019st/wi_locness`) | Learner errors + corrections | ~34k pairs | HF Hub, no registration |
646
+ | C4-200M-GEC (`cointegrated/c4_200m-gec-filtered`) | Synthetic GEC pairs | ~100k pairs (capped) | HF Hub, no registration — *falls back silently if unavailable* |
647
+ | FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
648
+ | Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
649
+ | Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
650
+ | Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
651
 
652
+ > Note: `train_and_upgrade.py` uses JFLEG + W&I+LOCNESS + C4-GEC (freely accessible via HF Hub). FCE and Kaggle datasets are used in the full manual pipeline only.
653
 
654
+ ---
655
 
656
+ ## Dyslexia Error Simulation
657
 
658
+ The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017). v2+ uses a 20% per-word error rate (up from 15% in v1).
659
 
660
+ | Error Type | Frequency | Example |
661
+ |-----------|-----------|---------|
662
+ | Phonetic substitution | 35% | "because" → "becaus" |
663
+ | Letter transposition | 18% | "the" → "teh" |
664
+ | Letter omission | 16% | "important" → "importnt" |
665
+ | Letter doubling | 12% | "letter" → "lettter" |
666
+ | Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
667
+ | Word boundary errors | 9% | "a lot" → "alot" |
668
 
669
+ ---
670
 
671
+ ## Style Fingerprint Vector
672
 
673
+ The 512-dimensional style vector captures 41 raw features:
674
 
675
+ | Group | Features | Count |
676
+ |-------|----------|-------|
677
+ | Sentence stats | mean, std, skew of sentence lengths | 3 |
678
+ | Word stats | mean, std of word lengths | 2 |
679
+ | Lexical | type-token ratio, lexical density | 2 |
680
+ | Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
681
+ | Discourse | 20 academic discourse markers (per 100 words) | 20 |
682
+ | Register | hedging frequency, formality score, nominalization ratio | 3 |
683
+ | Readability | Flesch reading ease, avg syllables per word | 2 |
684
+ | Pronouns | first-person ratio, third-person ratio | 2 |
685
+ | Other | question ratio, exclamation ratio, AWL coverage | 3 |
686
 
687
+ Projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
688
 
689
+ ---
690
 
691
+ ## Known Limitations
 
692
 
693
+ 1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models. Doubling LoRA rank (r=8 → r=16) partially addresses this.
694
+ 2. **Training window**: 256-token max input (up from 128 in v1/v2) — very long paragraphs may still be split mid-clause.
695
+ 3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
696
+ 4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
697
+ 5. **LanguageTool latency**: Spell correction takes ~15–20s due to JVM startup on first call.
698
+ 6. **Human-pattern loss on CPU**: The GPT-2 perplexity-based loss is skipped on CPU for performance. Full loss is only active on GPU.
699
+ 7. **Faithfulness gate conservatism**: The 0.75 cosine similarity threshold occasionally reverts valid-but-heavily-corrected outputs. Outputs flagged as reverts are logged — monitor `num_fallback` in evaluation to tune the threshold.