morpheuslord commited on
Commit
cd777c7
·
verified ·
1 Parent(s): 62a2bda

Auto-upgrade v3: composite 0.8634 | GLEU 0.7593 | BERTScore 0.9758 | 1-WER 0.8552 | r=16, 256-token ctx, C4-GEC data, faithfulness gate

Browse files
Files changed (4) hide show
  1. README.md +106 -522
  2. adapter_config.json +3 -3
  3. adapter_model.safetensors +1 -1
  4. tokenizer.json +1 -1
README.md CHANGED
@@ -1,622 +1,206 @@
1
  ---
2
- language:
3
- - en
4
  tags:
5
- - text2text-generation
6
- - dyslexia
7
- - grammar-correction
8
- - style-preservation
9
  - lora
10
- - flan-t5
11
- license: mit
12
- base_model: google/flan-t5-small
13
- datasets:
14
- - jhu-clsp/jfleg
15
- - bea2019st/wi_locness
16
- pipeline_tag: translation
17
  ---
18
 
19
- # Dyslexia Academic Writing Correction System
20
 
21
- > **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
22
 
23
- ## Overview
24
 
25
- This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
26
 
27
- 1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
28
- 2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
29
- 3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
30
- 4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
31
 
32
- The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
33
 
34
- ---
35
 
36
- ## Latest Evaluation Results (v2)
37
 
38
- | Metric | Score | Description |
39
- |--------|-------|-------------|
40
- | **GLEU** | **0.7506** | Grammar + fluency correction quality |
41
- | **BERTScore F1** | **0.9733** | Semantic closeness to reference corrections |
42
- | **1 − WER** | **0.8488** | Word-level accuracy (WER = 15.12%) |
43
- | **Composite** | **0.8576** | `(GLEU + BERTScore F1 + (1−WER)) / 3` — gating score for Hub push |
44
 
45
- > The model is only pushed to the Hub when the composite score strictly beats the saved baseline from the previous run, ensuring the Hub always holds the best-seen weights.
 
 
 
 
 
 
46
 
47
- ---
48
 
49
- ## What Changed in v2
50
 
51
- The original model had a critical bug: `CorrectionTrainer.compute_loss()` only used cross-entropy loss. The multi-objective loss (`L_CE + λ_style + λ_semantic + λ_human`) was fully designed in `loss_functions.py` but was **never wired into the trainer**. v2 fixes this and upgrades several other parameters.
 
 
52
 
53
- | Parameter | v1 (Original) | v2 (Upgraded) |
54
- |-----------|--------------|---------------|
55
- | LoRA rank | r=8, α=16 | **r=16, α=32** |
56
- | Epochs | 5 | **10** |
57
- | Effective batch size | 32 (4×8 accum) | **64 (2×32 accum)** |
58
- | Learning rate | 3e-4 | **2e-4** (more stable over longer run) |
59
- | Warmup ratio | 5% | **10%** |
60
- | Label smoothing | none | **0.1** (reduces overconfidence) |
61
- | Loss function | CE only *(bug)* | **CE + Style + Semantic** *(fixed)* |
62
- | Human-pattern loss | designed, unused | omitted on CPU; falls back to CE+style+sem |
63
- | Evaluation | GLEU only | **GLEU + BERTScore F1 + (1−WER) composite** |
64
- | Eval/save strategy | every 100 steps | **per epoch** |
65
- | Early stopping | none | **patience=3** |
66
- | Hub gate | none | **composite must beat saved baseline** |
67
- | Warm-start strategy | cold start | **merge r=8 adapter → apply fresh r=16 LoRA** |
68
- | Data split | 90%/10% train/val | **88%/7%/5% train/val/test** |
69
- | Dyslexia augmentation error rate | 15% | **20%** |
70
 
71
- ### Combined Loss (v2)
72
 
73
- ```
74
- L = L_CE + 0.3·L_style + 0.5·L_semantic
75
- ```
76
 
77
- The human-pattern loss (`λ₃·L_human`) is kept in the design but skipped on CPU (requires GPT-2 perplexity scoring). Style and semantic losses use a lightweight `StyleMLP` — no spaCy or external models required at training time.
78
 
79
- | Term | Purpose | Weight |
80
- |------|---------|--------|
81
- | `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
82
- | `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
83
- | `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
84
- | `L_human` | `1 ��� HumanPatternClassifier(output)` — anti-AI penalty | 0.4 *(GPU only)* |
85
 
86
- ### Warm-Start Merge Strategy
87
 
88
- Rather than fine-tuning from scratch, v2 preserves the corrections learned by the original r=8 adapter:
89
 
90
- 1. Load existing LoRA adapter (r=8) from Hub
91
- 2. Merge adapter weights into base model (`merge_and_unload()`)
92
- 3. Apply a **fresh LoRA at r=16** on top of the merged base
93
- 4. Train with combined loss for 10 epochs
94
 
95
- This doubles the adapter's representational capacity while retaining previously learned correction patterns.
96
 
97
- ---
98
 
99
- ## Features
100
-
101
- | Feature | Description |
102
- |---------|-------------|
103
- | **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
104
- | **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
105
- | **LoRA fine-tuning** | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
106
- | **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
107
- | **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
108
- | **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic (+ λ₃·L_human on GPU)` |
109
- | **Sentence-chunked inference** | Long texts split into 128-token chunks matching training window |
110
- | **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
111
- | **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
112
- | **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text (20% error rate) |
113
- | **Composite score gating** | Hub push only if new model strictly beats saved baseline |
114
 
115
- ---
116
 
117
- ## Project Structure
118
-
119
- ```
120
- Rewriter/
121
- ├── configs/
122
- │ ├── training_config.yaml # Full training hyperparameters
123
- │ ├── training_config_fast.yaml # Quick iteration config
124
- │ ├── inference_config.yaml # Inference & generation settings
125
- │ ├── model_config.yaml # Model architecture registry
126
- │ └── awl_config.yaml # Academic Word List settings
127
- ├── scripts/
128
- │ ├── train.py # Main training script (Click CLI)
129
- │ ├── evaluate.py # Test set evaluation (GLEU, ERRANT, BERTScore)
130
- │ ├── run_inference.py # Interactive CLI inference
131
- │ ├── preprocess_data.py # Raw datasets → unified JSONL
132
- │ ├── pretrain_human_pattern_classifier.py # Stage 3: anti-AI classifier
133
- │ ├── download_datasets.sh # BEA-2019 dataset downloader
134
- │ └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
135
- ├── src/
136
- │ ├── model/
137
- │ │ ├── base_model.py # Model loader (T5/BART/Llama + LoRA + quantization)
138
- │ │ ├── style_conditioner.py # Prefix tuning: style → virtual tokens
139
- │ │ ├── generation_utils.py # Beam search, sampling, batch generation
140
- │ │ └── lora_adapter.py # LoRA configuration helpers
141
- │ ├── preprocessing/
142
- │ │ ├── pipeline.py # Full preprocessing orchestrator
143
- │ │ ├── spell_corrector.py # LanguageTool + dyslexia-aware correction
144
- │ │ ├── dyslexia_simulator.py # Synthetic error generation (Rello et al.)
145
- │ │ ├── dependency_parser.py # spaCy dependency tree analysis
146
- │ │ ├── ner_tagger.py # Named entity protection
147
- │ │ └── sentence_segmenter.py # Sentence boundary detection
148
- │ ├── style/
149
- │ │ ├── fingerprinter.py # 41 features → 512-dim style vector
150
- │ │ ├── style_vector.py # Style vector dataclass
151
- │ │ ├── formality_classifier.py # Rule-based formality scoring
152
- │ │ └── emotion_classifier.py # Emotion detection
153
- │ ├── training/
154
- │ │ ├── dataset.py # Pre-tokenized cached dataset with style vectors
155
- │ │ ├── trainer.py # CorrectionTrainer (HF Trainer + PEFT fixes)
156
- │ │ ├── loss_functions.py # V1 and V2 combined losses
157
- │ │ ├── human_pattern_extractor.py # 17-dim feature extraction + classifier
158
- │ │ └── callbacks.py # Evaluation logging callbacks
159
- │ ├── vocabulary/
160
- │ │ ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
161
- │ │ ├── awl_loader.py # Coxhead Academic Word List loader
162
- │ │ └── register_filter.py # Contraction expansion + colloquial replacement
163
- │ ├── inference/
164
- │ │ ├── corrector.py # End-to-end inference pipeline orchestrator
165
- │ │ └── postprocessor.py # Cleanup, entity restore, formatting
166
- │ ├── evaluation/
167
- │ │ ├── gleu_scorer.py # GLEU + BERTScore computation
168
- │ │ ├── errant_evaluator.py # ERRANT P/R/F0.5 evaluation
169
- │ │ ├── style_metrics.py # Style similarity + AWL coverage
170
- │ │ └── authorship_verifier.py # AI detection resistance testing
171
- │ └── api/
172
- │ ├── main.py # FastAPI application
173
- │ ├── schemas.py # Pydantic request/response models
174
- │ └── middleware.py # Rate limiting + CORS
175
- ├── train_and_upgrade.py # v2 upgrade pipeline (self-improving Hub push)
176
- ├── data/
177
- │ ├── raw/ # Original datasets (JFLEG, W&I+LOCNESS)
178
- │ ├── processed/ # Unified JSONL (train/val/test splits)
179
- │ ├── cache/ # Pre-tokenized dataset caches (.pt files)
180
- │ └── awl/ # Coxhead Academic Word List
181
- ├── train.sh # Multi-stage training orchestrator
182
- ├── start.sh # Inference launcher (CLI or API mode)
183
- ├── baseline_score.json # Saved composite score — gate for Hub push
184
- ├── Dockerfile # Production container
185
- ├── docker-compose.yml # Docker deployment
186
- ├── requirements.txt # Python dependencies
187
- └── pyproject.toml # Project metadata
188
- ```
189
 
190
- ---
191
 
192
- ## Model Architecture
193
-
194
- ### PNG:
195
- ![Architecture](arch.png)
196
-
197
- ### Mermaid Diagram:
198
- ```mermaid
199
- graph TB
200
- subgraph INFERENCE["🔮 Inference Pipeline"]
201
- direction TB
202
- INPUT["📝 Raw Dyslectic Text"]
203
- subgraph PREPROCESS["Pre-Processing"]
204
- SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
205
- SENT_SEG["Sentence Segmenter"]
206
- DEP_PARSE["Dependency Parser"]
207
- NER["NER Tagger"]
208
- end
209
- subgraph STYLE["Style Analysis"]
210
- FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
211
- EMOTION["Emotion Classifier"]
212
- FORMALITY["Formality Classifier"]
213
- STYLE_VEC["Style Vector Composer"]
214
- end
215
- subgraph GENERATION["Core Generation"]
216
- STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
217
- BASE_MODEL["Base LM<br/><i>Flan-T5-Small (warm-merged)</i>"]
218
- LORA["LoRA Adapter<br/><i>r=16</i>"]
219
- GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
220
- end
221
- subgraph POSTPROCESS["Post-Processing"]
222
- POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
223
- VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
224
- AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
225
- REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
226
- end
227
- OUTPUT["✅ Corrected Academic Text"]
228
- INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
229
- INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
230
- NER --> STYLE_COND
231
- STYLE_VEC --> STYLE_COND
232
- STYLE_COND --> BASE_MODEL
233
- LORA -.->|"merged weights"| BASE_MODEL
234
- BASE_MODEL --> GEN_UTILS --> POSTPROC
235
- POSTPROC --> VOCAB_SUB
236
- AWL --> VOCAB_SUB
237
- VOCAB_SUB --> REG_FILTER --> OUTPUT
238
- end
239
-
240
- subgraph TRAINING["🏋️ Training Pipeline (v2)"]
241
- direction TB
242
- subgraph WARMSTART["Warm-Start Merge"]
243
- HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=8 (existing)</i>"]
244
- MERGE["merge_and_unload()"]
245
- FRESH_LORA["Fresh LoRA r=16"]
246
- end
247
- subgraph DATA["Data Pipeline"]
248
- JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
249
- WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
250
- DYSLEXIA_AUG["DyslexiaSimulator<br/><i>20% error rate augmentation</i>"]
251
- SPLIT["88% train / 7% val / 5% test"]
252
- end
253
- subgraph LOSS["Combined Loss (v2 — now active)"]
254
- L_CE["L_CE + label_smoothing=0.1"]
255
- L_STYLE["0.3 · L_style"]
256
- L_SEM["0.5 · L_semantic"]
257
- L_HUMAN["0.4 · L_human<br/><i>(GPU only)</i>"]
258
- end
259
- subgraph EVAL["Composite Evaluation"]
260
- GLEU_E["GLEU"]
261
- BERT_E["BERTScore F1"]
262
- WER_E["1 − WER"]
263
- COMPOSITE["Composite = mean(3)"]
264
- GATE["Beat baseline?"]
265
- HUB_PUSH["Push to Hub ✅"]
266
- end
267
- HUB_ADAPTER --> MERGE --> FRESH_LORA
268
- JFLEG --> DYSLEXIA_AUG
269
- WILOCNESS --> DYSLEXIA_AUG
270
- DYSLEXIA_AUG --> SPLIT
271
- L_CE --> COMPOSITE
272
- L_STYLE --> COMPOSITE
273
- L_SEM --> COMPOSITE
274
- GLEU_E --> COMPOSITE
275
- BERT_E --> COMPOSITE
276
- WER_E --> COMPOSITE
277
- COMPOSITE --> GATE --> HUB_PUSH
278
- end
279
- ```
280
 
281
- ---
282
 
283
- ## Design Choices & Rationale
284
 
285
- ### Why Flan-T5-Small?
286
 
287
- | Consideration | Decision |
288
- |---------------|----------|
289
- | **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
290
- | **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
291
- | **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
292
- | **LoRA efficiency** | Trainable params scale with r: r=16 → ~2.56M (3.3%) — still fits in 4GB |
293
 
294
- ### Why LoRA over Full Fine-Tuning?
295
 
296
- - **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA r=16 needs ~400MB
297
- - **Warm-start safety**: Merging r=8 weights preserves corrections before expanding capacity to r=16
298
- - **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
299
- - **Configuration**: `r=16, alpha=32, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
300
 
301
- ### Why a Combined Multi-Objective Loss?
302
 
303
- The system uses (on CPU): `L = L_CE + 0.3·L_style + 0.5·L_semantic`
304
 
305
- On GPU (with human-pattern classifier available): `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
306
 
307
- | Term | Purpose | Weight |
308
- |------|---------|--------|
309
- | `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
310
- | `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
311
- | `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
312
- | `L_human` | `1 − HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
313
 
314
- The style and semantic losses use a lightweight `StyleMLP` (token embedding mean-pool linear projection) that adds no external dependencies at training time.
315
 
316
- ### Why a Human Pattern Classifier?
317
 
318
- AI-generated text has detectable statistical signatures:
319
- - **Lower GPT-2 perplexity** (AI text is more "predictable")
320
- - **Lower burstiness** (AI has uniform sentence lengths; humans vary)
321
- - **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
322
- - **Lower n-gram novelty** (AI reuses phrases more)
323
 
324
- The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal. Requires GPU for GPT-2 perplexity scoring; falls back gracefully on CPU.
325
 
326
- ### Why Sentence-Chunked Inference?
327
 
328
- The model was trained with `max_input_length=128` tokens. The task prefix alone consumes ~40 tokens, leaving ~86 tokens for actual text. Long inputs are:
329
 
330
- 1. Split into sentences using spaCy
331
- 2. Grouped into chunks that fit the 128-token budget
332
- 3. Each chunk is corrected independently
333
- 4. Results are joined back together
334
 
335
- ### Why Post-Generation Vocabulary Elevation?
336
 
337
- Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), a separate BERT-based lexical substitution pipeline is applied:
338
 
339
- 1. POS-tag the output with spaCy
340
- 2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
341
- 3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
342
- 4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
343
- 5. Track used substitutions to prevent duplicate replacements
344
 
345
- ---
346
 
347
- ## Quick Start
348
 
349
- ### Prerequisites
350
 
351
- - Python 3.10
352
- - NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
353
- - ~10GB disk space for models and datasets
354
 
355
- ### Option A: Self-Improving Upgrade Pipeline (v2)
356
 
357
- This pipeline loads the existing Hub adapter, upgrades it, evaluates, and only pushes if it improves.
358
 
359
- ```bash
360
- git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
361
- pip install -r requirements.txt
362
 
363
- export HF_TOKEN="your-hf-token-with-write-access"
364
- python train_and_upgrade.py
365
- ```
366
 
367
- The pipeline handles all 10 steps automatically:
368
- **Load adapter → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Gate → Save → Merge → Push**
369
 
370
- ### Option B: Manual Step-by-Step (original pipeline)
371
 
372
- ```bash
373
- # 1. Install dependencies
374
- pip install -r requirements.txt
375
- python -m spacy download en_core_web_sm
376
 
377
- # 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
378
- python scripts/preprocess_data.py
379
 
380
- # 3. Pre-train the human pattern classifier
381
- python scripts/pretrain_human_pattern_classifier.py
382
 
383
- # 4. Train the correction model
384
- PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
385
 
386
- # 5. Merge LoRA adapter into base model for inference
387
- python -c "
388
- from peft import PeftModel
389
- from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
390
- import torch
391
- model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
392
- model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
393
- model = model.merge_and_unload()
394
- model.save_pretrained('checkpoints/best_model_merged')
395
- AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
396
- "
397
 
398
- # 6. Run inference
399
- PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
400
 
401
- # 7. Or start the API server
402
- PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
403
- ```
404
 
405
- ---
406
 
407
- ## Training Pipeline
408
-
409
- ### v2 Upgrade Pipeline (`train_and_upgrade.py`) — 10 Steps
410
-
411
- | Step | Action |
412
- |------|--------|
413
- | 1 | Load existing LoRA adapter (r=8) from Hub |
414
- | 2 | Merge into base weights (`merge_and_unload`) — warm start |
415
- | 3 | Apply fresh LoRA r=16 on merged base |
416
- | 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (20% error rate) |
417
- | 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
418
- | 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) |
419
- | 7 | Compare composite score against `baseline_score.json` |
420
- | 8 | If improved: save LoRA adapter |
421
- | 9 | Merge adapter → save full model |
422
- | 10 | Push adapter + merged model to Hub; update baseline |
423
-
424
- ### v1 Original Pipeline (`train.sh`) — 5 Stages
425
-
426
- | Stage | Action |
427
- |-------|--------|
428
- | 1 | Setup & Dependencies |
429
- | 2 | Data Preprocessing (FCE + W&I+LOCNESS + JFLEG → JSONL) |
430
- | 3 | Human Pattern Classifier Pre-Training |
431
- | 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
432
- | 5 | Evaluation (GLEU only) |
433
 
434
- ---
435
 
436
- ## Hyperparameter Reference
437
-
438
- ### v2 (`train_and_upgrade.py`)
439
-
440
- ```python
441
- LORA_R = 16
442
- LORA_ALPHA = 32
443
- LORA_DROPOUT = 0.05
444
- TARGET_MODULES = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
445
-
446
- EPOCHS = 10
447
- BATCH_SIZE = 2 # per device
448
- GRAD_ACCUM = 32 # effective batch = 64
449
- LR = 2e-4
450
- WARMUP_RATIO = 0.10
451
- LABEL_SMOOTHING = 0.1
452
- MAX_INPUT_LEN = 128
453
- MAX_TARGET_LEN = 128
454
-
455
- LAMBDA_STYLE = 0.3
456
- LAMBDA_SEMANTIC = 0.5
457
- LAMBDA_HUMAN = 0.4 # GPU only
458
- ```
459
-
460
- ### v1 (`configs/training_config.yaml`)
461
-
462
- ```yaml
463
- lora:
464
- r: 8
465
- lora_alpha: 16
466
- lora_dropout: 0.05
467
- target_modules: [q, v, k, o, wi_0, wi_1, wo]
468
-
469
- training:
470
- per_device_train_batch_size: 4
471
- gradient_accumulation_steps: 8 # effective batch = 32
472
- learning_rate: 3.0e-4
473
- lr_scheduler_type: cosine
474
- bf16: true
475
-
476
- loss:
477
- lambda_style: 0.3
478
- lambda_semantic: 0.5
479
- lambda_human_pattern: 0.4
480
- ```
481
-
482
- ### `configs/inference_config.yaml`
483
-
484
- ```yaml
485
- model:
486
- key: "flan-t5-small"
487
- checkpoint_path: "checkpoints/best_model_merged"
488
- use_lora: false
489
-
490
- generation:
491
- num_beams: 5
492
- length_penalty: 1.2
493
- no_repeat_ngram_size: 3
494
- max_new_tokens: 128
495
-
496
- vocabulary:
497
- semantic_threshold: 0.82
498
- ```
499
 
500
- ---
501
 
502
- ## Inference Pipeline (7 Steps)
503
-
504
- ```
505
- Raw Text
506
-
507
-
508
- 1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
509
-
510
-
511
- 2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
512
-
513
-
514
- 3. Sentence-Chunked Generation ─ Split into 128-token chunks → Flan-T5 → rejoin
515
-
516
-
517
- 4. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
518
-
519
-
520
- 5. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
521
-
522
-
523
- 6. Register Filtering ── Expand contractions, replace colloquialisms
524
-
525
-
526
- 7. Metrics ──────────── Style similarity, AWL coverage, readability scores
527
-
528
-
529
- Corrected Text
530
- ```
531
 
532
- ---
533
 
534
- ## API Usage
535
 
536
- ```bash
537
- # Start the server
538
- PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
539
 
540
- # Correct text
541
- curl -X POST http://localhost:8000/correct \
542
- -H "Content-Type: application/json" \
543
- -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
544
 
545
- # Health check
546
- curl http://localhost:8000/health
547
- ```
548
 
549
- Interactive docs at `http://localhost:8000/docs`.
550
 
551
- ---
552
 
553
- ## Hardware Requirements
554
 
555
- | Tier | GPU | LoRA Config | Training Time |
556
- |------|-----|-------------|---------------|
557
- | **Tested (v1)** | RTX 3050 4GB | r=8, 5 epochs | ~45 min |
558
- | **Tested (v2 CPU)** | None (HF Space CPU Basic) | r=16, 10 epochs | ~12–24 hours |
559
- | Recommended | RTX 3090 24GB | r=16, 10 epochs + human-pattern loss | ~2–3h |
560
- | Maximum | A100 80GB | Full pipeline with GPT-2 perplexity scoring | ~12h |
561
 
562
- ---
563
 
564
- ## Data Sources
565
 
566
- | Dataset | Type | Size | Access |
567
- |---------|------|------|--------|
568
- | JFLEG (`jhu-clsp/jfleg`) | Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
569
- | W&I+LOCNESS (`bea2019st/wi_locness`) | Learner errors + corrections | ~34k pairs | HF Hub, no registration |
570
- | FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
571
- | Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
572
- | Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
573
- | Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
574
 
575
- > Note: `train_and_upgrade.py` uses only JFLEG + W&I+LOCNESS (freely accessible via HF Hub). FCE and Kaggle datasets are used in the full manual pipeline only.
576
 
577
- ---
578
 
579
- ## Dyslexia Error Simulation
580
 
581
- The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017). v2 uses a 20% per-word error rate (up from 15%).
582
 
583
- | Error Type | Frequency | Example |
584
- |-----------|-----------|---------|
585
- | Phonetic substitution | 35% | "because" → "becaus" |
586
- | Letter transposition | 18% | "the" → "teh" |
587
- | Letter omission | 16% | "important" → "importnt" |
588
- | Letter doubling | 12% | "letter" → "lettter" |
589
- | Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
590
- | Word boundary errors | 9% | "a lot" → "alot" |
591
 
592
- ---
593
 
594
- ## Style Fingerprint Vector
595
 
596
- The 512-dimensional style vector captures 41 raw features:
597
 
598
- | Group | Features | Count |
599
- |-------|----------|-------|
600
- | Sentence stats | mean, std, skew of sentence lengths | 3 |
601
- | Word stats | mean, std of word lengths | 2 |
602
- | Lexical | type-token ratio, lexical density | 2 |
603
- | Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
604
- | Discourse | 20 academic discourse markers (per 100 words) | 20 |
605
- | Register | hedging frequency, formality score, nominalization ratio | 3 |
606
- | Readability | Flesch reading ease, avg syllables per word | 2 |
607
- | Pronouns | first-person ratio, third-person ratio | 2 |
608
- | Other | question ratio, exclamation ratio, AWL coverage | 3 |
609
 
610
- Projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
611
 
612
- ---
613
 
614
- ## Known Limitations
 
615
 
616
- 1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models. Doubling LoRA rank (r=8 → r=16) partially addresses this.
617
- 2. **Training window**: 128-token max input means very long sentences may be split mid-clause.
618
- 3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
619
- 4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
620
- 5. **LanguageTool latency**: Spell correction takes ~15–20s due to JVM startup on first call.
621
- 6. **Human-pattern loss on CPU**: The GPT-2 perplexity-based loss is skipped on CPU for performance. Full loss is only active on GPU.
622
- <!-- 7. **Semantic drift in correction**: The pipeline can introduce meaning-level errors — dyslexic phonetic patterns misread by LanguageTool can produce plausible-but-wrong word substitutions. BERTScore F1 and WER (now primary evaluation signals in v2) help detect but don't eliminate this. A dedicated post-correction semantic faithfulness check remains a future improvement. -->
 
1
  ---
2
+ base_model: google/flan-t5-small
3
+ library_name: peft
4
  tags:
5
+ - base_model:adapter:google/flan-t5-small
 
 
 
6
  - lora
7
+ - transformers
 
 
 
 
 
 
8
  ---
9
 
10
+ # Model Card for Model ID
11
 
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
 
 
14
 
 
15
 
16
+ ## Model Details
 
 
 
17
 
18
+ ### Model Description
19
 
20
+ <!-- Provide a longer summary of what this model is. -->
21
 
 
22
 
 
 
 
 
 
 
23
 
24
+ - **Developed by:** [More Information Needed]
25
+ - **Funded by [optional]:** [More Information Needed]
26
+ - **Shared by [optional]:** [More Information Needed]
27
+ - **Model type:** [More Information Needed]
28
+ - **Language(s) (NLP):** [More Information Needed]
29
+ - **License:** [More Information Needed]
30
+ - **Finetuned from model [optional]:** [More Information Needed]
31
 
32
+ ### Model Sources [optional]
33
 
34
+ <!-- Provide the basic links for the model. -->
35
 
36
+ - **Repository:** [More Information Needed]
37
+ - **Paper [optional]:** [More Information Needed]
38
+ - **Demo [optional]:** [More Information Needed]
39
 
40
+ ## Uses
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
43
 
44
+ ### Direct Use
 
 
45
 
46
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
47
 
48
+ [More Information Needed]
 
 
 
 
 
49
 
50
+ ### Downstream Use [optional]
51
 
52
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
53
 
54
+ [More Information Needed]
 
 
 
55
 
56
+ ### Out-of-Scope Use
57
 
58
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
59
 
60
+ [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
 
62
+ ## Bias, Risks, and Limitations
63
 
64
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
+ [More Information Needed]
67
 
68
+ ### Recommendations
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
69
 
70
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
71
 
72
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
73
 
74
+ ## How to Get Started with the Model
75
 
76
+ Use the code below to get started with the model.
 
 
 
 
 
77
 
78
+ [More Information Needed]
79
 
80
+ ## Training Details
 
 
 
81
 
82
+ ### Training Data
83
 
84
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
85
 
86
+ [More Information Needed]
87
 
88
+ ### Training Procedure
 
 
 
 
 
89
 
90
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
91
 
92
+ #### Preprocessing [optional]
93
 
94
+ [More Information Needed]
 
 
 
 
95
 
 
96
 
97
+ #### Training Hyperparameters
98
 
99
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
100
 
101
+ #### Speeds, Sizes, Times [optional]
 
 
 
102
 
103
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
104
 
105
+ [More Information Needed]
106
 
107
+ ## Evaluation
 
 
 
 
108
 
109
+ <!-- This section describes the evaluation protocols and provides the results. -->
110
 
111
+ ### Testing Data, Factors & Metrics
112
 
113
+ #### Testing Data
114
 
115
+ <!-- This should link to a Dataset Card if possible. -->
 
 
116
 
117
+ [More Information Needed]
118
 
119
+ #### Factors
120
 
121
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
122
 
123
+ [More Information Needed]
 
 
124
 
125
+ #### Metrics
 
126
 
127
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
128
 
129
+ [More Information Needed]
 
 
 
130
 
131
+ ### Results
 
132
 
133
+ [More Information Needed]
 
134
 
135
+ #### Summary
 
136
 
 
 
 
 
 
 
 
 
 
 
 
137
 
 
 
138
 
139
+ ## Model Examination [optional]
 
 
140
 
141
+ <!-- Relevant interpretability work for the model goes here -->
142
 
143
+ [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
144
 
145
+ ## Environmental Impact
146
 
147
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
148
 
149
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
150
 
151
+ - **Hardware Type:** [More Information Needed]
152
+ - **Hours used:** [More Information Needed]
153
+ - **Cloud Provider:** [More Information Needed]
154
+ - **Compute Region:** [More Information Needed]
155
+ - **Carbon Emitted:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
156
 
157
+ ## Technical Specifications [optional]
158
 
159
+ ### Model Architecture and Objective
160
 
161
+ [More Information Needed]
 
 
162
 
163
+ ### Compute Infrastructure
 
 
 
164
 
165
+ [More Information Needed]
 
 
166
 
167
+ #### Hardware
168
 
169
+ [More Information Needed]
170
 
171
+ #### Software
172
 
173
+ [More Information Needed]
 
 
 
 
 
174
 
175
+ ## Citation [optional]
176
 
177
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
178
 
179
+ **BibTeX:**
 
 
 
 
 
 
 
180
 
181
+ [More Information Needed]
182
 
183
+ **APA:**
184
 
185
+ [More Information Needed]
186
 
187
+ ## Glossary [optional]
188
 
189
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
 
 
 
 
 
 
190
 
191
+ [More Information Needed]
192
 
193
+ ## More Information [optional]
194
 
195
+ [More Information Needed]
196
 
197
+ ## Model Card Authors [optional]
 
 
 
 
 
 
 
 
 
 
198
 
199
+ [More Information Needed]
200
 
201
+ ## Model Card Contact
202
 
203
+ [More Information Needed]
204
+ ### Framework versions
205
 
206
+ - PEFT 0.19.1
 
 
 
 
 
 
adapter_config.json CHANGED
@@ -30,12 +30,12 @@
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
33
- "wo",
34
  "o",
35
- "q",
36
  "wi_0",
37
- "v",
38
  "k",
 
 
 
39
  "wi_1"
40
  ],
41
  "target_parameters": null,
 
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
 
33
  "o",
 
34
  "wi_0",
 
35
  "k",
36
+ "q",
37
+ "v",
38
+ "wo",
39
  "wi_1"
40
  ],
41
  "target_parameters": null,
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:739806c54db7ce3ca21af4278e4160f3ed7feff9f6e09ad03beae7b26aa457c4
3
  size 10264128
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:77d31009b5285f236ff82ba88fdcf0f9370c49456f83335a50b89d0130fff9e5
3
  size 10264128
tokenizer.json CHANGED
@@ -2,7 +2,7 @@
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
- "max_length": 128,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },
 
2
  "version": "1.0",
3
  "truncation": {
4
  "direction": "Right",
5
+ "max_length": 256,
6
  "strategy": "LongestFirst",
7
  "stride": 0
8
  },