morpheuslord commited on
Commit
c6ccba3
·
verified ·
1 Parent(s): 1ebf493

Auto-upgrade: composite 0.8576 | GLEU 0.7506 | BERTScore 0.9733 | 1-WER 0.8488 | r=16, 10 epochs, combined loss

Browse files
README.md CHANGED
@@ -1,596 +1,206 @@
1
  ---
2
- language:
3
- - en
4
  tags:
5
- - text2text-generation
6
- - dyslexia
7
- - grammar-correction
8
- - style-preservation
9
  - lora
10
- - flan-t5
11
- license: mit
12
- base_model: google/flan-t5-small
13
- datasets:
14
- - cambridge/fce
15
- - wi_locness
16
- - jfleg
17
- pipeline_tag: translation
18
  ---
19
 
20
- # Dyslexia Academic Writing Correction System
21
 
22
- > **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
23
 
24
- ## Overview
25
 
26
- This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
27
 
28
- 1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
29
- 2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
30
- 3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
31
- 4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
32
 
33
- The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation), trained on real learner error corpora (FCE, W&I+LOCNESS, JFLEG) augmented with synthetic dyslexia-simulated data.
34
 
35
- ---
36
 
37
- ## Features
38
-
39
- | Feature | Description |
40
- |---------|-------------|
41
- | **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
42
- | **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
43
- | **LoRA fine-tuning** | 1.63% trainable params (1.28M / 78.2M total), rank=8 |
44
- | **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
45
- | **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
46
- | **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic + λ₃·L_human_pattern` |
47
- | **Sentence-chunked inference** | Long texts split into 128-token chunks matching training window |
48
- | **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
49
- | **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
50
- | **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text |
51
 
52
- ---
53
 
54
- ## Project Structure
55
-
56
- ```
57
- Rewriter/
58
- ├── configs/
59
- │ ├── training_config.yaml # Full training hyperparameters
60
- │ ├── training_config_fast.yaml # Quick iteration config
61
- │ ├── inference_config.yaml # Inference & generation settings
62
- │ ├── model_config.yaml # Model architecture registry
63
- │ └── awl_config.yaml # Academic Word List settings
64
- ├── scripts/
65
- │ ├── train.py # Main training script (Click CLI)
66
- │ ├── evaluate.py # Test set evaluation (GLEU, ERRANT, BERTScore)
67
- │ ├── run_inference.py # Interactive CLI inference
68
- │ ├── preprocess_data.py # Raw datasets → unified JSONL
69
- │ ├── pretrain_human_pattern_classifier.py # Stage 3: anti-AI classifier
70
- │ ├── download_datasets.sh # BEA-2019 dataset downloader
71
- │ └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
72
- ├── src/
73
- │ ├── model/
74
- │ │ ├── base_model.py # Model loader (T5/BART/Llama + LoRA + quantization)
75
- │ │ ├── style_conditioner.py # Prefix tuning: style → virtual tokens
76
- │ │ ├── generation_utils.py # Beam search, sampling, batch generation
77
- │ │ └── lora_adapter.py # LoRA configuration helpers
78
- │ ├── preprocessing/
79
- │ │ ├── pipeline.py # Full preprocessing orchestrator
80
- │ │ ├── spell_corrector.py # LanguageTool + dyslexia-aware correction
81
- │ │ ├── dyslexia_simulator.py # Synthetic error generation (Rello et al.)
82
- │ │ ├── dependency_parser.py # spaCy dependency tree analysis
83
- │ │ ├── ner_tagger.py # Named entity protection
84
- │ │ └── sentence_segmenter.py # Sentence boundary detection
85
- │ ├── style/
86
- │ │ ├── fingerprinter.py # 41 features → 512-dim style vector
87
- │ │ ├── style_vector.py # Style vector dataclass
88
- │ │ ├── formality_classifier.py # Rule-based formality scoring
89
- │ │ └── emotion_classifier.py # Emotion detection
90
- │ ├── training/
91
- │ │ ├── dataset.py # Pre-tokenized cached dataset with style vectors
92
- │ │ ├── trainer.py # CorrectionTrainer (HF Trainer + PEFT fixes)
93
- │ │ ├── loss_functions.py # V1 and V2 combined losses
94
- │ │ ├── human_pattern_extractor.py # 17-dim feature extraction + classifier
95
- │ │ └── callbacks.py # Evaluation logging callbacks
96
- │ ├── vocabulary/
97
- │ │ ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
98
- │ │ ├── awl_loader.py # Coxhead Academic Word List loader
99
- │ │ └── register_filter.py # Contraction expansion + colloquial replacement
100
- │ ├── inference/
101
- │ │ ├── corrector.py # End-to-end inference pipeline orchestrator
102
- │ │ └── postprocessor.py # Cleanup, entity restore, formatting
103
- │ ├── evaluation/
104
- │ │ ├── gleu_scorer.py # GLEU + BERTScore computation
105
- │ │ ├── errant_evaluator.py # ERRANT P/R/F0.5 evaluation
106
- │ │ ├── style_metrics.py # Style similarity + AWL coverage
107
- │ │ └── authorship_verifier.py # AI detection resistance testing
108
- │ └── api/
109
- │ ├── main.py # FastAPI application
110
- │ ├── schemas.py # Pydantic request/response models
111
- │ └── middleware.py # Rate limiting + CORS
112
- ├── data/
113
- │ ├── raw/ # Original datasets (FCE, W&I+LOCNESS, JFLEG, Kaggle)
114
- │ ├── processed/ # Unified JSONL (train/val/test splits)
115
- │ ├── cache/ # Pre-tokenized dataset caches (.pt files)
116
- │ └── awl/ # Coxhead Academic Word List
117
- ├── train.sh # Multi-stage training orchestrator
118
- ├── start.sh # Inference launcher (CLI or API mode)
119
- ├── Dockerfile # Production container
120
- ├── docker-compose.yml # Docker deployment
121
- ├── requirements.txt # Python dependencies
122
- └── pyproject.toml # Project metadata
123
- ```
124
-
125
- ## Model Architecture
126
-
127
-
128
- ### PNG:
129
- ![Architecture](arch.png)
130
-
131
- ### Mermaid Diagram:
132
- ```mermaid
133
- graph TB
134
- %% ── Inference Pipeline (left-to-right flow) ──────────────────────
135
- subgraph INFERENCE["🔮 Inference Pipeline"]
136
- direction TB
137
- INPUT["📝 Raw Dyslectic Text"]
138
-
139
- subgraph PREPROCESS["Pre-Processing"]
140
- SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
141
- SENT_SEG["Sentence Segmenter"]
142
- DEP_PARSE["Dependency Parser"]
143
- NER["NER Tagger"]
144
- end
145
-
146
- subgraph STYLE["Style Analysis"]
147
- FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
148
- EMOTION["Emotion Classifier"]
149
- FORMALITY["Formality Classifier"]
150
- STYLE_VEC["Style Vector Composer"]
151
- end
152
-
153
- subgraph GENERATION["Core Generation"]
154
- STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
155
- BASE_MODEL["Base LM<br/><i>Flan-T5 / BART / Llama-3</i>"]
156
- LORA["LoRA Adapter"]
157
- GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
158
- end
159
-
160
- subgraph POSTPROCESS["Post-Processing"]
161
- POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
162
- VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
163
- AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
164
- REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
165
- end
166
-
167
- OUTPUT["✅ Corrected Academic Text"]
168
-
169
- INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
170
- INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
171
- NER --> STYLE_COND
172
- STYLE_VEC --> STYLE_COND
173
- STYLE_COND --> BASE_MODEL
174
- LORA -.->|"merged weights"| BASE_MODEL
175
- BASE_MODEL --> GEN_UTILS --> POSTPROC
176
- POSTPROC --> VOCAB_SUB
177
- AWL --> VOCAB_SUB
178
- VOCAB_SUB --> REG_FILTER --> OUTPUT
179
- end
180
-
181
- %% ── Training Pipeline ────────────────────────────────────────────
182
- subgraph TRAINING["🏋️ Training Pipeline"]
183
- direction TB
184
-
185
- subgraph DATA["Data Pipeline"]
186
- RAW_DATA["Raw Datasets<br/><i>JFLEG, WI+LOCNESS, C4_200M,<br/>FCE, Lang-8, NUCLE</i>"]
187
- KAGGLE["Kaggle Datasets<br/><i>Shanegerami, Starblasters8</i>"]
188
- PREPROC_SCRIPT["preprocess_data.py"]
189
- TRAIN_JSONL["train.jsonl / val.jsonl / test.jsonl"]
190
- end
191
-
192
- subgraph HP_PRETRAIN["Human Pattern Pre-Training"]
193
- FEAT_EXTRACT["Feature Extractor<br/><i>17-dim: perplexity, burstiness,<br/>n-gram novelty, AI markers...</i>"]
194
- GPT2["GPT-2<br/><i>perplexity scorer</i>"]
195
- HP_CLASSIFIER["Human Pattern Classifier<br/><i>MLP: 17→128→64→1</i>"]
196
- HP_WEIGHTS["human_pattern_classifier.pt"]
197
- end
198
-
199
- subgraph MAIN_TRAIN["Main Model Training"]
200
- DATASET["WritingCorrectionDataset"]
201
- COMBINED_LOSS["Combined Loss Function"]
202
- L_CE["L_CE<br/><i>cross-entropy</i>"]
203
- L_STYLE["λ₁ · L_style<br/><i>style consistency</i>"]
204
- L_SEM["λ₂ · L_semantic<br/><i>meaning preservation</i>"]
205
- L_HUMAN["λ₃ · L_human_pattern<br/><i>anti-AI penalty</i>"]
206
- TRAINER["CorrectionTrainer"]
207
- CALLBACKS["Callbacks<br/><i>StyleMetrics,<br/>EarlyStoppingOnStyleDrift</i>"]
208
- end
209
-
210
- subgraph EVAL["Evaluation"]
211
- ERRANT["ERRANT Evaluator<br/><i>P / R / F₀.₅</i>"]
212
- GLEU["GLEU Scorer"]
213
- STYLE_MET["Style Metrics<br/><i>cosine similarity</i>"]
214
- AUTH_VER["Authorship Verifier<br/><i>AI detection resistance</i>"]
215
- end
216
-
217
- RAW_DATA --> PREPROC_SCRIPT --> TRAIN_JSONL
218
- KAGGLE --> FEAT_EXTRACT
219
- GPT2 --> FEAT_EXTRACT --> HP_CLASSIFIER --> HP_WEIGHTS
220
- TRAIN_JSONL --> DATASET --> TRAINER
221
- L_CE --> COMBINED_LOSS
222
- L_STYLE --> COMBINED_LOSS
223
- L_SEM --> COMBINED_LOSS
224
- HP_WEIGHTS -.->|"frozen"| L_HUMAN --> COMBINED_LOSS
225
- COMBINED_LOSS --> TRAINER
226
- CALLBACKS --> TRAINER
227
- TRAINER --> EVAL
228
- end
229
-
230
- %% ── API Layer ────────────────────────────────────────────────────
231
- subgraph API["🌐 FastAPI Server"]
232
- ENDPOINT["/correct endpoint"]
233
- SCHEMAS["Request / Response Schemas"]
234
- MIDDLEWARE["Rate Limiting & CORS"]
235
- CORRECTOR["Corrector<br/><i>orchestrates full pipeline</i>"]
236
- end
237
-
238
- ENDPOINT --> CORRECTOR --> INFERENCE
239
- TRAINER -->|"best_model/"| BASE_MODEL
240
-
241
- %% ── Styling ──────────────────────────────────────────────────────
242
- classDef pipeline fill:#1a1a2e,stroke:#16213e,color:#e94560,stroke-width:2px
243
- classDef module fill:#0f3460,stroke:#533483,color:#e2e2e2,stroke-width:1px
244
- classDef data fill:#1a1a2e,stroke:#e94560,color:#eee,stroke-width:1px
245
- classDef output fill:#533483,stroke:#e94560,color:#fff,stroke-width:2px
246
-
247
- class INPUT,RAW_DATA,KAGGLE,TRAIN_JSONL data
248
- class OUTPUT,HP_WEIGHTS output
249
- ```
250
 
251
- ---
252
 
253
- ## Design Choices & Rationale
254
 
255
- ### Why Flan-T5-Small?
 
 
256
 
257
- | Consideration | Decision |
258
- |---------------|----------|
259
- | **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
260
- | **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
261
- | **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
262
- | **LoRA efficiency** | Only 1.28M trainable params (1.63%) — fits in 4GB with batch_size=4 + bf16 |
263
 
264
- ### Why LoRA over Full Fine-Tuning?
265
 
266
- - **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA needs ~200MB
267
- - **Speed**: LoRA converges in 5 epochs (~1,515 steps) on a single RTX 3050
268
- - **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
269
- - **Configuration**: `r=8, alpha=16, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
270
 
271
- ### Why a Combined Multi-Objective Loss?
272
 
273
- The system uses a 4-term loss function: `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
274
 
275
- | Term | Purpose | Weight |
276
- |------|---------|--------|
277
- | `L_CE` | Standard cross-entropy token prediction | 1.0 |
278
- | `L_style` | `1 - cos_sim(output_style, input_style)` — preserves writing fingerprint | 0.3 |
279
- | `L_semantic` | `1 - cos_sim(input_embedding, output_embedding)` — preserves meaning | 0.5 |
280
- | `L_human` | `1 - HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
281
 
282
- **Why these weights?** Style and human-pattern losses are auxiliary signals too high and they override grammar correction. The semantic loss is weighted highest (0.5) because meaning preservation is the hardest constraint to satisfy.
283
 
284
- ### Why a Human Pattern Classifier?
285
 
286
- AI-generated text has detectable statistical signatures:
287
- - **Lower GPT-2 perplexity** (AI text is more "predictable")
288
- - **Lower burstiness** (AI has uniform sentence lengths; humans vary)
289
- - **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
290
- - **Lower n-gram novelty** (AI reuses phrases more)
291
 
292
- The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal.
293
 
294
- ### Why Sentence-Chunked Inference?
295
 
296
- The model was trained with `max_input_length=128` tokens. The task prefix alone consumes ~40 tokens, leaving ~86 tokens for actual text. Long inputs are:
297
 
298
- 1. Split into sentences using spaCy
299
- 2. Grouped into chunks that fit the 128-token budget
300
- 3. Each chunk is corrected independently
301
- 4. Results are joined back together
302
 
303
- This prevents the model from seeing out-of-distribution input lengths and avoids truncation artifacts.
304
 
305
- ### Why Post-Generation Vocabulary Elevation?
306
 
307
- Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), we apply a separate **BERT-based lexical substitution** pipeline:
308
 
309
- 1. POS-tag the output with spaCy
310
- 2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
311
- 3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
312
- 4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
313
- 5. Track used substitutions to prevent duplicate replacements
314
 
315
- ---
316
 
317
- ## Quick Start
318
 
319
- ### Prerequisites
320
 
321
- - Python ≥ 3.10
322
- - NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
323
- - ~10GB disk space for models and datasets
324
 
325
- ### Option A: Automated Training Pipeline
326
 
327
- ```bash
328
- # Clone and setup
329
- git clone https://huggingface.co/morpheuslord/rewriter && cd rewriter
330
- pip install -r requirements.txt
331
 
332
- # Set W&B key (optional, for experiment tracking)
333
- export WANDB_API_KEY="your-key-here"
334
 
335
- # Run the full 5-stage pipeline
336
- bash train.sh
337
- ```
338
 
339
- The orchestrator handles: **Setup Preprocessing Human Pattern Pre-training Model Training Evaluation**
340
 
341
- Each stage has a checkpoint system — if interrupted, re-run `train.sh` and select `[S]kip` for completed stages.
342
 
343
- ### Option B: Manual Step-by-Step
344
 
345
- ```bash
346
- # 1. Install dependencies
347
- pip install -r requirements.txt
348
- python -m spacy download en_core_web_sm
349
 
350
- # 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
351
- python scripts/preprocess_data.py
352
 
353
- # 3. Pre-train the human pattern classifier
354
- python scripts/pretrain_human_pattern_classifier.py
355
 
356
- # 4. Train the correction model
357
- PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
358
 
359
- # 5. Merge LoRA adapter into base model for inference
360
- python -c "
361
- from peft import PeftModel
362
- from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
363
- import torch
364
- model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
365
- model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
366
- model = model.merge_and_unload()
367
- model.save_pretrained('checkpoints/best_model_merged')
368
- AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
369
- "
370
 
371
- # 6. Run inference
372
- PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
373
 
374
- # 7. Or start the API server
375
- PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
376
- ```
377
 
378
- ---
379
 
380
- ## Training Pipeline (5 Stages)
381
 
382
- ### Stage 1: Setup & Dependencies
383
- Installs Python packages, downloads spaCy models (`en_core_web_sm`), and NLTK tokenizers.
384
 
385
- ### Stage 2: Data Preprocessing
386
- Converts raw datasets into unified JSONL format:
387
 
388
- | Dataset | Source | Format | Pairs |
389
- |---------|--------|--------|-------|
390
- | **FCE v2.1** | BEA-2019 Shared Task | Character-level edits | ~28k |
391
- | **W&I+LOCNESS v2.1** | BEA-2019 Shared Task | Character-level edits | ~34k |
392
- | **JFLEG** | Johns Hopkins | 4 reference corrections per source | ~5k |
393
 
394
- Output schema: `{"input": "erroneous text", "target": "corrected text", "source": "fce|wi_locness|jfleg"}`
395
 
396
- Split: 90% train / 10% validation (with 50% of validation used as test, capped at 500).
397
 
398
- ### Stage 3: Human Pattern Classifier Pre-Training
399
- Trains a frozen binary MLP classifier on ~100k human vs AI text samples. Uses 17 features:
400
 
401
- ```
402
- [perplexity, burstiness, sentence_starter_diversity,
403
- bigram_novelty, trigram_novelty, 4gram_novelty,
404
- ai_marker_density, overused_discourse_density,
405
- em_dash_rate, ellipsis_rate, comma_rate, semicolon_rate,
406
- word_count, sentence_count, mean_sent_length, std_sent_length, ttr]
407
- ```
408
 
409
- GPT-2 perplexity is computed in batched GPU forward passes. Text features are extracted in parallel via `ProcessPoolExecutor`.
410
 
411
- ### Stage 4: Main Model Training
412
- Fine-tunes Flan-T5-Small with LoRA using the V2 combined loss. Key hyperparameters:
413
 
414
- | Parameter | Value |
415
- |-----------|-------|
416
- | Effective batch size | 32 (4 × 8 gradient accumulation) |
417
- | Learning rate | 3e-4 (cosine schedule, 5% warmup) |
418
- | Precision | bf16 (Ampere+ GPUs) |
419
- | Max input tokens | 128 |
420
- | Max target tokens | 128 |
421
- | Epochs | 5 |
422
- | Eval/Save interval | Every 100 steps |
423
 
424
- ### Stage 5: Evaluation
425
- Runs on test set with metrics: GLEU, BERTScore F1, ERRANT F0.5, Style Similarity, AWL Coverage.
426
 
427
- ---
428
 
429
- ## Inference Pipeline (7 Steps)
430
-
431
- ```
432
- Raw Text
433
-
434
-
435
- 1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
436
-
437
-
438
- 2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
439
-
440
-
441
- 3. Sentence-Chunked Generation ─ Split into 128-token chunks → Flan-T5 → rejoin
442
-
443
-
444
- 4. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
445
-
446
-
447
- 5. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate
448
-
449
-
450
- 6. Register Filtering ── Expand contractions, replace colloquialisms
451
-
452
-
453
- 7. Metrics ──────────── Style similarity, AWL coverage, readability scores
454
-
455
-
456
- Corrected Text
457
- ```
458
 
459
- ---
460
 
461
- ## Configuration Reference
462
-
463
- ### `configs/training_config.yaml`
464
-
465
- ```yaml
466
- model:
467
- key: "flan-t5-small" # flan-t5-xl | flan-t5-large | flan-t5-base | flan-t5-small
468
- quantize: false # 4-bit NF4 quantization (needs GPU)
469
- use_lora: true # Parameter-efficient fine-tuning
470
-
471
- lora:
472
- r: 8 # LoRA rank (higher = more capacity, more VRAM)
473
- lora_alpha: 16 # Scaling factor (usually 2×r)
474
- lora_dropout: 0.05 # Regularisation
475
- target_modules: [q, v, k, o, wi_0, wi_1, wo] # All attention + FFN layers
476
-
477
- training:
478
- per_device_train_batch_size: 4
479
- gradient_accumulation_steps: 8 # Effective batch = 32
480
- learning_rate: 3.0e-4
481
- lr_scheduler_type: cosine
482
- bf16: true # Use bfloat16 on Ampere+ GPUs
483
-
484
- loss:
485
- lambda_style: 0.3 # Style preservation weight
486
- lambda_semantic: 0.5 # Meaning preservation weight
487
- lambda_human_pattern: 0.4 # Anti-AI penalty weight
488
- ```
489
-
490
- ### `configs/inference_config.yaml`
491
-
492
- ```yaml
493
- model:
494
- key: "flan-t5-small"
495
- checkpoint_path: "checkpoints/best_model_merged"
496
- use_lora: false # Merged model — no adapter needed
497
-
498
- generation:
499
- num_beams: 5 # Beam search width
500
- length_penalty: 1.2 # > 1.0 rewards longer outputs
501
- no_repeat_ngram_size: 3 # Prevents repetition
502
- max_new_tokens: 128 # Must match training max_target_length
503
-
504
- vocabulary:
505
- semantic_threshold: 0.82 # Minimum cosine similarity for AWL substitution
506
- ```
507
 
508
- ---
509
 
510
- ## API Usage
511
 
512
- ```bash
513
- # Start the server
514
- PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
515
 
516
- # Correct text
517
- curl -X POST http://localhost:8000/correct \
518
- -H "Content-Type: application/json" \
519
- -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
520
 
521
- # Health check
522
- curl http://localhost:8000/health
523
- ```
524
 
525
- Interactive docs available at `http://localhost:8000/docs`.
 
 
 
 
526
 
527
- ---
528
 
529
- ## Hardware Requirements
530
 
531
- | Tier | GPU | Model | Training Time |
532
- |------|-----|-------|---------------|
533
- | **Tested** | RTX 3050 4GB | Flan-T5-Small + LoRA | ~45 min (5 epochs) |
534
- | Recommended | RTX 3090 24GB | Flan-T5-Base + LoRA | ~2h |
535
- | Maximum | A100 80GB | Flan-T5-XL + LoRA | ~12h |
536
 
537
- CPU inference is supported but significantly slower (~30s per correction vs ~2s on GPU).
538
 
539
- ---
540
 
541
- ## Data Sources
542
 
543
- | Dataset | Type | Size | Source |
544
- |---------|------|------|--------|
545
- | FCE v2.1 | Learner errors + corrections | ~28k pairs | Cambridge English |
546
- | W&I+LOCNESS v2.1 | Learner errors + corrections | ~34k pairs | BEA-2019 Shared Task |
547
- | JFLEG | Fluency corrections (4 refs) | ~5k pairs | Johns Hopkins |
548
- | Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
549
- | Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
550
- | Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
551
 
552
- ---
553
 
554
- ## Dyslexia Error Simulation
555
 
556
- The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017):
557
 
558
- | Error Type | Frequency | Example |
559
- |-----------|-----------|---------|
560
- | Phonetic substitution | 35% | "because" → "becaus" |
561
- | Letter transposition | 18% | "the" → "teh" |
562
- | Letter omission | 16% | "important" → "importnt" |
563
- | Letter doubling | 12% | "letter" → "lettter" |
564
- | Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
565
- | Word boundary errors | 9% | "a lot" → "alot" |
566
 
567
- ---
568
 
569
- ## Style Fingerprint Vector
570
 
571
- The 512-dimensional style vector captures 41 raw features:
572
 
573
- | Group | Features | Count |
574
- |-------|----------|-------|
575
- | Sentence stats | mean, std, skew of sentence lengths | 3 |
576
- | Word stats | mean, std of word lengths | 2 |
577
- | Lexical | type-token ratio, lexical density | 2 |
578
- | Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
579
- | Discourse | 20 academic discourse markers (per 100 words) | 20 |
580
- | Register | hedging frequency, formality score, nominalization ratio | 3 |
581
- | Readability | Flesch reading ease, avg syllables per word | 2 |
582
- | Pronouns | first-person ratio, third-person ratio | 2 |
583
- | Other | question ratio, exclamation ratio, AWL coverage | 3 |
584
 
585
- These are projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
586
 
587
- ---
 
 
 
 
 
 
 
 
 
 
 
 
588
 
589
- ## Known Limitations
 
590
 
591
- 1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models
592
- 2. **Training window**: 128-token max input means very long sentences may be split mid-clause
593
- 3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the similarity threshold (0.82) is a trade-off between coverage and accuracy
594
- 4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output
595
- 5. **LanguageTool latency**: Spell correction takes ~15-20s due to JVM startup on first call
596
- 6. **Semantic drift in correction**: Qualitative evaluation reveals the pipeline can introduce meaning-level errors rather than purely correcting surface errors — e.g. dyslexic phonetic patterns misread by LanguageTool produce plausible-but-wrong word substitutions that corrupt the intended meaning. The Style Similarity metric (0.96) does not capture this failure mode, as it measures surface token overlap rather than semantic faithfulness. Future work should add **BERTScore F1** and **Word Error Rate (WER)** against ground-truth corrections as primary evaluation signals, and a dedicated post-correction **semantic faithfulness check** (cosine similarity between input and output sentence embeddings) to flag and reject meaning-drift before returning output.
 
1
  ---
2
+ base_model: google/flan-t5-small
3
+ library_name: peft
4
  tags:
5
+ - base_model:adapter:google/flan-t5-small
 
 
 
6
  - lora
7
+ - transformers
 
 
 
 
 
 
 
8
  ---
9
 
10
+ # Model Card for Model ID
11
 
12
+ <!-- Provide a quick summary of what the model is/does. -->
13
 
 
14
 
 
15
 
16
+ ## Model Details
 
 
 
17
 
18
+ ### Model Description
19
 
20
+ <!-- Provide a longer summary of what this model is. -->
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
 
23
 
24
+ - **Developed by:** [More Information Needed]
25
+ - **Funded by [optional]:** [More Information Needed]
26
+ - **Shared by [optional]:** [More Information Needed]
27
+ - **Model type:** [More Information Needed]
28
+ - **Language(s) (NLP):** [More Information Needed]
29
+ - **License:** [More Information Needed]
30
+ - **Finetuned from model [optional]:** [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ ### Model Sources [optional]
33
 
34
+ <!-- Provide the basic links for the model. -->
35
 
36
+ - **Repository:** [More Information Needed]
37
+ - **Paper [optional]:** [More Information Needed]
38
+ - **Demo [optional]:** [More Information Needed]
39
 
40
+ ## Uses
 
 
 
 
 
41
 
42
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
43
 
44
+ ### Direct Use
 
 
 
45
 
46
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
47
 
48
+ [More Information Needed]
49
 
50
+ ### Downstream Use [optional]
 
 
 
 
 
51
 
52
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
53
 
54
+ [More Information Needed]
55
 
56
+ ### Out-of-Scope Use
 
 
 
 
57
 
58
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
59
 
60
+ [More Information Needed]
61
 
62
+ ## Bias, Risks, and Limitations
63
 
64
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
65
 
66
+ [More Information Needed]
67
 
68
+ ### Recommendations
69
 
70
+ <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
71
 
72
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
73
 
74
+ ## How to Get Started with the Model
75
 
76
+ Use the code below to get started with the model.
77
 
78
+ [More Information Needed]
79
 
80
+ ## Training Details
 
 
81
 
82
+ ### Training Data
83
 
84
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
85
 
86
+ [More Information Needed]
 
87
 
88
+ ### Training Procedure
 
 
89
 
90
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
91
 
92
+ #### Preprocessing [optional]
93
 
94
+ [More Information Needed]
95
 
 
 
 
 
96
 
97
+ #### Training Hyperparameters
 
98
 
99
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
100
 
101
+ #### Speeds, Sizes, Times [optional]
 
102
 
103
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
 
 
 
 
 
 
 
 
 
104
 
105
+ [More Information Needed]
 
106
 
107
+ ## Evaluation
 
 
108
 
109
+ <!-- This section describes the evaluation protocols and provides the results. -->
110
 
111
+ ### Testing Data, Factors & Metrics
112
 
113
+ #### Testing Data
 
114
 
115
+ <!-- This should link to a Dataset Card if possible. -->
 
116
 
117
+ [More Information Needed]
 
 
 
 
118
 
119
+ #### Factors
120
 
121
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
122
 
123
+ [More Information Needed]
 
124
 
125
+ #### Metrics
 
 
 
 
 
 
126
 
127
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
128
 
129
+ [More Information Needed]
 
130
 
131
+ ### Results
 
 
 
 
 
 
 
 
132
 
133
+ [More Information Needed]
 
134
 
135
+ #### Summary
136
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
137
 
 
138
 
139
+ ## Model Examination [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
+ <!-- Relevant interpretability work for the model goes here -->
142
 
143
+ [More Information Needed]
144
 
145
+ ## Environmental Impact
 
 
146
 
147
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
 
 
 
148
 
149
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
 
 
150
 
151
+ - **Hardware Type:** [More Information Needed]
152
+ - **Hours used:** [More Information Needed]
153
+ - **Cloud Provider:** [More Information Needed]
154
+ - **Compute Region:** [More Information Needed]
155
+ - **Carbon Emitted:** [More Information Needed]
156
 
157
+ ## Technical Specifications [optional]
158
 
159
+ ### Model Architecture and Objective
160
 
161
+ [More Information Needed]
 
 
 
 
162
 
163
+ ### Compute Infrastructure
164
 
165
+ [More Information Needed]
166
 
167
+ #### Hardware
168
 
169
+ [More Information Needed]
 
 
 
 
 
 
 
170
 
171
+ #### Software
172
 
173
+ [More Information Needed]
174
 
175
+ ## Citation [optional]
176
 
177
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 
 
 
 
 
 
 
178
 
179
+ **BibTeX:**
180
 
181
+ [More Information Needed]
182
 
183
+ **APA:**
184
 
185
+ [More Information Needed]
 
 
 
 
 
 
 
 
 
 
186
 
187
+ ## Glossary [optional]
188
 
189
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
190
+
191
+ [More Information Needed]
192
+
193
+ ## More Information [optional]
194
+
195
+ [More Information Needed]
196
+
197
+ ## Model Card Authors [optional]
198
+
199
+ [More Information Needed]
200
+
201
+ ## Model Card Contact
202
 
203
+ [More Information Needed]
204
+ ### Framework versions
205
 
206
+ - PEFT 0.19.1
 
 
 
 
 
adapter_config.json ADDED
@@ -0,0 +1,48 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "google/flan-t5-small",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "lora_ga_config": null,
23
+ "megatron_config": null,
24
+ "megatron_core": "megatron.core",
25
+ "modules_to_save": null,
26
+ "peft_type": "LORA",
27
+ "peft_version": "0.19.1",
28
+ "qalora_group_size": 16,
29
+ "r": 16,
30
+ "rank_pattern": {},
31
+ "revision": null,
32
+ "target_modules": [
33
+ "wo",
34
+ "o",
35
+ "q",
36
+ "wi_0",
37
+ "v",
38
+ "k",
39
+ "wi_1"
40
+ ],
41
+ "target_parameters": null,
42
+ "task_type": "SEQ_2_SEQ_LM",
43
+ "trainable_token_indices": null,
44
+ "use_bdlora": null,
45
+ "use_dora": false,
46
+ "use_qalora": false,
47
+ "use_rslora": false
48
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:739806c54db7ce3ca21af4278e4160f3ed7feff9f6e09ad03beae7b26aa457c4
3
+ size 10264128
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "eos_token": "</s>",
4
+ "extra_ids": 100,
5
+ "extra_special_tokens": [
6
+ "<extra_id_0>",
7
+ "<extra_id_1>",
8
+ "<extra_id_2>",
9
+ "<extra_id_3>",
10
+ "<extra_id_4>",
11
+ "<extra_id_5>",
12
+ "<extra_id_6>",
13
+ "<extra_id_7>",
14
+ "<extra_id_8>",
15
+ "<extra_id_9>",
16
+ "<extra_id_10>",
17
+ "<extra_id_11>",
18
+ "<extra_id_12>",
19
+ "<extra_id_13>",
20
+ "<extra_id_14>",
21
+ "<extra_id_15>",
22
+ "<extra_id_16>",
23
+ "<extra_id_17>",
24
+ "<extra_id_18>",
25
+ "<extra_id_19>",
26
+ "<extra_id_20>",
27
+ "<extra_id_21>",
28
+ "<extra_id_22>",
29
+ "<extra_id_23>",
30
+ "<extra_id_24>",
31
+ "<extra_id_25>",
32
+ "<extra_id_26>",
33
+ "<extra_id_27>",
34
+ "<extra_id_28>",
35
+ "<extra_id_29>",
36
+ "<extra_id_30>",
37
+ "<extra_id_31>",
38
+ "<extra_id_32>",
39
+ "<extra_id_33>",
40
+ "<extra_id_34>",
41
+ "<extra_id_35>",
42
+ "<extra_id_36>",
43
+ "<extra_id_37>",
44
+ "<extra_id_38>",
45
+ "<extra_id_39>",
46
+ "<extra_id_40>",
47
+ "<extra_id_41>",
48
+ "<extra_id_42>",
49
+ "<extra_id_43>",
50
+ "<extra_id_44>",
51
+ "<extra_id_45>",
52
+ "<extra_id_46>",
53
+ "<extra_id_47>",
54
+ "<extra_id_48>",
55
+ "<extra_id_49>",
56
+ "<extra_id_50>",
57
+ "<extra_id_51>",
58
+ "<extra_id_52>",
59
+ "<extra_id_53>",
60
+ "<extra_id_54>",
61
+ "<extra_id_55>",
62
+ "<extra_id_56>",
63
+ "<extra_id_57>",
64
+ "<extra_id_58>",
65
+ "<extra_id_59>",
66
+ "<extra_id_60>",
67
+ "<extra_id_61>",
68
+ "<extra_id_62>",
69
+ "<extra_id_63>",
70
+ "<extra_id_64>",
71
+ "<extra_id_65>",
72
+ "<extra_id_66>",
73
+ "<extra_id_67>",
74
+ "<extra_id_68>",
75
+ "<extra_id_69>",
76
+ "<extra_id_70>",
77
+ "<extra_id_71>",
78
+ "<extra_id_72>",
79
+ "<extra_id_73>",
80
+ "<extra_id_74>",
81
+ "<extra_id_75>",
82
+ "<extra_id_76>",
83
+ "<extra_id_77>",
84
+ "<extra_id_78>",
85
+ "<extra_id_79>",
86
+ "<extra_id_80>",
87
+ "<extra_id_81>",
88
+ "<extra_id_82>",
89
+ "<extra_id_83>",
90
+ "<extra_id_84>",
91
+ "<extra_id_85>",
92
+ "<extra_id_86>",
93
+ "<extra_id_87>",
94
+ "<extra_id_88>",
95
+ "<extra_id_89>",
96
+ "<extra_id_90>",
97
+ "<extra_id_91>",
98
+ "<extra_id_92>",
99
+ "<extra_id_93>",
100
+ "<extra_id_94>",
101
+ "<extra_id_95>",
102
+ "<extra_id_96>",
103
+ "<extra_id_97>",
104
+ "<extra_id_98>",
105
+ "<extra_id_99>"
106
+ ],
107
+ "is_local": false,
108
+ "local_files_only": false,
109
+ "model_max_length": 512,
110
+ "pad_token": "<pad>",
111
+ "sp_model_kwargs": {},
112
+ "tokenizer_class": "T5Tokenizer",
113
+ "unk_token": "<unk>"
114
+ }