morpheuslord
/

rewrite

@@ -1,206 +1,622 @@
 ---
-base_model: google/flan-t5-small
-library_name: peft
 tags:
-- base_model:adapter:google/flan-t5-small
 - lora
-- transformers
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.19.1

 ---
+language:
+- en
 tags:
+- text2text-generation
+- dyslexia
+- grammar-correction
+- style-preservation
 - lora
+- flan-t5
+license: mit
+base_model: google/flan-t5-small
+datasets:
+- jhu-clsp/jfleg
+- bea2019st/wi_locness
+pipeline_tag: translation
 ---
+# Dyslexia Academic Writing Correction System
+> **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
+## Overview
+This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
+1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
+2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
+3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
+4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
+The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
+---
+## Latest Evaluation Results (v2)
+| Metric | Score | Description |
+|--------|-------|-------------|
+| **GLEU** | **0.7506** | Grammar + fluency correction quality |
+| **BERTScore F1** | **0.9733** | Semantic closeness to reference corrections |
+| **1 − WER** | **0.8488** | Word-level accuracy (WER = 15.12%) |
+| **Composite** | **0.8576** | `(GLEU + BERTScore F1 + (1−WER)) / 3` — gating score for Hub push |
+> The model is only pushed to the Hub when the composite score strictly beats the saved baseline from the previous run, ensuring the Hub always holds the best-seen weights.
+---
+## What Changed in v2
+The original model had a critical bug: `CorrectionTrainer.compute_loss()` only used cross-entropy loss. The multi-objective loss (`L_CE + λ_style + λ_semantic + λ_human`) was fully designed in `loss_functions.py` but was **never wired into the trainer**. v2 fixes this and upgrades several other parameters.
+| Parameter | v1 (Original) | v2 (Upgraded) |
+|-----------|--------------|---------------|
+| LoRA rank | r=8, α=16 | **r=16, α=32** |
+| Epochs | 5 | **10** |
+| Effective batch size | 32 (4×8 accum) | **64 (2×32 accum)** |
+| Learning rate | 3e-4 | **2e-4** (more stable over longer run) |
+| Warmup ratio | 5% | **10%** |
+| Label smoothing | none | **0.1** (reduces overconfidence) |
+| Loss function | CE only *(bug)* | **CE + Style + Semantic** *(fixed)* |
+| Human-pattern loss | designed, unused | omitted on CPU; falls back to CE+style+sem |
+| Evaluation | GLEU only | **GLEU + BERTScore F1 + (1−WER) composite** |
+| Eval/save strategy | every 100 steps | **per epoch** |
+| Early stopping | none | **patience=3** |
+| Hub gate | none | **composite must beat saved baseline** |
+| Warm-start strategy | cold start | **merge r=8 adapter → apply fresh r=16 LoRA** |
+| Data split | 90%/10% train/val | **88%/7%/5% train/val/test** |
+| Dyslexia augmentation error rate | 15% | **20%** |
+### Combined Loss (v2)
+```
+L = L_CE + 0.3·L_style + 0.5·L_semantic
+```
+The human-pattern loss (`λ₃·L_human`) is kept in the design but skipped on CPU (requires GPT-2 perplexity scoring). Style and semantic losses use a lightweight `StyleMLP` — no spaCy or external models required at training time.
+| Term | Purpose | Weight |
+|------|---------|--------|
+| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
+| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
+| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
+| `L_human` | `1 − HumanPatternClassifier(output)` — anti-AI penalty | 0.4 *(GPU only)* |
+### Warm-Start Merge Strategy
+Rather than fine-tuning from scratch, v2 preserves the corrections learned by the original r=8 adapter:
+1. Load existing LoRA adapter (r=8) from Hub
+2. Merge adapter weights into base model (`merge_and_unload()`)
+3. Apply a **fresh LoRA at r=16** on top of the merged base
+4. Train with combined loss for 10 epochs
+This doubles the adapter's representational capacity while retaining previously learned correction patterns.
+---
+## Features
+| Feature | Description |
+|---------|-------------|
+| **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
+| **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
+| **LoRA fine-tuning** | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
+| **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
+| **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
+| **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic (+ λ₃·L_human on GPU)` |
+| **Sentence-chunked inference** | Long texts split into 128-token chunks matching training window |
+| **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
+| **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
+| **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text (20% error rate) |
+| **Composite score gating** | Hub push only if new model strictly beats saved baseline |
+---
+## Project Structure
+```
+Rewriter/
+├── configs/
+│   ├── training_config.yaml        # Full training hyperparameters
+│   ├── training_config_fast.yaml   # Quick iteration config
+│   ├── inference_config.yaml       # Inference & generation settings
+│   ├── model_config.yaml           # Model architecture registry
+│   └── awl_config.yaml             # Academic Word List settings
+├── scripts/
+│   ├── train.py                    # Main training script (Click CLI)
+│   ├── evaluate.py                 # Test set evaluation (GLEU, ERRANT, BERTScore)
+│   ├── run_inference.py            # Interactive CLI inference
+│   ├── preprocess_data.py          # Raw datasets → unified JSONL
+│   ├── pretrain_human_pattern_classifier.py  # Stage 3: anti-AI classifier
+│   ├── download_datasets.sh        # BEA-2019 dataset downloader
+│   └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
+├── src/
+│   ├── model/
+│   │   ├── base_model.py           # Model loader (T5/BART/Llama + LoRA + quantization)
+│   │   ├── style_conditioner.py    # Prefix tuning: style → virtual tokens
+│   │   ├── generation_utils.py     # Beam search, sampling, batch generation
+│   │   └── lora_adapter.py         # LoRA configuration helpers
+│   ├── preprocessing/
+│   │   ├── pipeline.py             # Full preprocessing orchestrator
+│   │   ├── spell_corrector.py      # LanguageTool + dyslexia-aware correction
+│   │   ├── dyslexia_simulator.py   # Synthetic error generation (Rello et al.)
+│   │   ├── dependency_parser.py    # spaCy dependency tree analysis
+│   │   ├── ner_tagger.py           # Named entity protection
+│   │   └── sentence_segmenter.py   # Sentence boundary detection
+│   ├── style/
+│   │   ├── fingerprinter.py        # 41 features → 512-dim style vector
+│   │   ├── style_vector.py         # Style vector dataclass
+│   │   ├── formality_classifier.py # Rule-based formality scoring
+│   │   └── emotion_classifier.py   # Emotion detection
+│   ├── training/
+│   │   ├── dataset.py              # Pre-tokenized cached dataset with style vectors
+│   │   ├── trainer.py              # CorrectionTrainer (HF Trainer + PEFT fixes)
+│   │   ├── loss_functions.py       # V1 and V2 combined losses
+│   │   ├── human_pattern_extractor.py  # 17-dim feature extraction + classifier
+│   │   └── callbacks.py            # Evaluation logging callbacks
+│   ├── vocabulary/
+│   │   ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
+│   │   ├── awl_loader.py           # Coxhead Academic Word List loader
+│   │   └── register_filter.py      # Contraction expansion + colloquial replacement
+│   ├── inference/
+│   │   ├── corrector.py            # End-to-end inference pipeline orchestrator
+│   │   └── postprocessor.py        # Cleanup, entity restore, formatting
+│   ├── evaluation/
+│   │   ├── gleu_scorer.py          # GLEU + BERTScore computation
+│   │   ├── errant_evaluator.py     # ERRANT P/R/F0.5 evaluation
+│   │   ├── style_metrics.py        # Style similarity + AWL coverage
+│   │   └── authorship_verifier.py  # AI detection resistance testing
+│   └── api/
+│       ├── main.py                 # FastAPI application
+│       ├── schemas.py              # Pydantic request/response models
+│       └── middleware.py           # Rate limiting + CORS
+├── train_and_upgrade.py            # v2 upgrade pipeline (self-improving Hub push)
+├── data/
+│   ├── raw/                        # Original datasets (JFLEG, W&I+LOCNESS)
+│   ├── processed/                  # Unified JSONL (train/val/test splits)
+│   ├── cache/                      # Pre-tokenized dataset caches (.pt files)
+│   └── awl/                        # Coxhead Academic Word List
+├── train.sh                        # Multi-stage training orchestrator
+├── start.sh                        # Inference launcher (CLI or API mode)
+├── baseline_score.json             # Saved composite score — gate for Hub push
+├── Dockerfile                      # Production container
+├── docker-compose.yml              # Docker deployment
+├── requirements.txt                # Python dependencies
+└── pyproject.toml                  # Project metadata
+```
+---
+## Model Architecture
+### PNG:
+![Architecture](arch.png)
+### Mermaid Diagram:
+```mermaid
+graph TB
+    subgraph INFERENCE["🔮 Inference Pipeline"]
+        direction TB
+        INPUT["📝 Raw Dyslectic Text"]
+        subgraph PREPROCESS["Pre-Processing"]
+            SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
+            SENT_SEG["Sentence Segmenter"]
+            DEP_PARSE["Dependency Parser"]
+            NER["NER Tagger"]
+        end
+        subgraph STYLE["Style Analysis"]
+            FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
+            EMOTION["Emotion Classifier"]
+            FORMALITY["Formality Classifier"]
+            STYLE_VEC["Style Vector Composer"]
+        end
+        subgraph GENERATION["Core Generation"]
+            STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
+            BASE_MODEL["Base LM<br/><i>Flan-T5-Small (warm-merged)</i>"]
+            LORA["LoRA Adapter<br/><i>r=16</i>"]
+            GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
+        end
+        subgraph POSTPROCESS["Post-Processing"]
+            POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
+            VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
+            AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
+            REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
+        end
+        OUTPUT["✅ Corrected Academic Text"]
+        INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
+        INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
+        NER --> STYLE_COND
+        STYLE_VEC --> STYLE_COND
+        STYLE_COND --> BASE_MODEL
+        LORA -.->|"merged weights"| BASE_MODEL
+        BASE_MODEL --> GEN_UTILS --> POSTPROC
+        POSTPROC --> VOCAB_SUB
+        AWL --> VOCAB_SUB
+        VOCAB_SUB --> REG_FILTER --> OUTPUT
+    end
+    subgraph TRAINING["🏋️ Training Pipeline (v2)"]
+        direction TB
+        subgraph WARMSTART["Warm-Start Merge"]
+            HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=8 (existing)</i>"]
+            MERGE["merge_and_unload()"]
+            FRESH_LORA["Fresh LoRA r=16"]
+        end
+        subgraph DATA["Data Pipeline"]
+            JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
+            WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
+            DYSLEXIA_AUG["DyslexiaSimulator<br/><i>20% error rate augmentation</i>"]
+            SPLIT["88% train / 7% val / 5% test"]
+        end
+        subgraph LOSS["Combined Loss (v2 — now active)"]
+            L_CE["L_CE + label_smoothing=0.1"]
+            L_STYLE["0.3 · L_style"]
+            L_SEM["0.5 · L_semantic"]
+            L_HUMAN["0.4 · L_human<br/><i>(GPU only)</i>"]
+        end
+        subgraph EVAL["Composite Evaluation"]
+            GLEU_E["GLEU"]
+            BERT_E["BERTScore F1"]
+            WER_E["1 − WER"]
+            COMPOSITE["Composite = mean(3)"]
+            GATE["Beat baseline?"]
+            HUB_PUSH["Push to Hub ✅"]
+        end
+        HUB_ADAPTER --> MERGE --> FRESH_LORA
+        JFLEG --> DYSLEXIA_AUG
+        WILOCNESS --> DYSLEXIA_AUG
+        DYSLEXIA_AUG --> SPLIT
+        L_CE --> COMPOSITE
+        L_STYLE --> COMPOSITE
+        L_SEM --> COMPOSITE
+        GLEU_E --> COMPOSITE
+        BERT_E --> COMPOSITE
+        WER_E --> COMPOSITE
+        COMPOSITE --> GATE --> HUB_PUSH
+    end
+```
+---
+## Design Choices & Rationale
+### Why Flan-T5-Small?
+| Consideration | Decision |
+|---------------|----------|
+| **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
+| **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
+| **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
+| **LoRA efficiency** | Trainable params scale with r: r=16 → ~2.56M (3.3%) — still fits in 4GB |
+### Why LoRA over Full Fine-Tuning?
+- **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA r=16 needs ~400MB
+- **Warm-start safety**: Merging r=8 weights preserves corrections before expanding capacity to r=16
+- **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
+- **Configuration**: `r=16, alpha=32, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
+### Why a Combined Multi-Objective Loss?
+The system uses (on CPU): `L = L_CE + 0.3·L_style + 0.5·L_semantic`
+On GPU (with human-pattern classifier available): `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
+| Term | Purpose | Weight |
+|------|---------|--------|
+| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
+| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
+| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
+| `L_human` | `1 − HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
+The style and semantic losses use a lightweight `StyleMLP` (token embedding mean-pool → linear projection) that adds no external dependencies at training time.
+### Why a Human Pattern Classifier?
+AI-generated text has detectable statistical signatures:
+- **Lower GPT-2 perplexity** (AI text is more "predictable")
+- **Lower burstiness** (AI has uniform sentence lengths; humans vary)
+- **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
+- **Lower n-gram novelty** (AI reuses phrases more)
+The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal. Requires GPU for GPT-2 perplexity scoring; falls back gracefully on CPU.
+### Why Sentence-Chunked Inference?
+The model was trained with `max_input_length=128` tokens. The task prefix alone consumes ~40 tokens, leaving ~86 tokens for actual text. Long inputs are:
+1. Split into sentences using spaCy
+2. Grouped into chunks that fit the 128-token budget
+3. Each chunk is corrected independently
+4. Results are joined back together
+### Why Post-Generation Vocabulary Elevation?
+Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), a separate BERT-based lexical substitution pipeline is applied:
+1. POS-tag the output with spaCy
+2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
+3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
+4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
+5. Track used substitutions to prevent duplicate replacements
+---
+## Quick Start
+### Prerequisites
+- Python ≥ 3.10
+- NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
+- ~10GB disk space for models and datasets
+### Option A: Self-Improving Upgrade Pipeline (v2)
+This pipeline loads the existing Hub adapter, upgrades it, evaluates, and only pushes if it improves.
+```bash
+git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
+pip install -r requirements.txt
+export HF_TOKEN="your-hf-token-with-write-access"
+python train_and_upgrade.py
+```
+The pipeline handles all 10 steps automatically:
+**Load adapter → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Gate → Save → Merge → Push**
+### Option B: Manual Step-by-Step (original pipeline)
+```bash
+# 1. Install dependencies
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+# 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
+python scripts/preprocess_data.py
+# 3. Pre-train the human pattern classifier
+python scripts/pretrain_human_pattern_classifier.py
+# 4. Train the correction model
+PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
+# 5. Merge LoRA adapter into base model for inference
+python -c "
+from peft import PeftModel
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+import torch
+model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
+model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
+model = model.merge_and_unload()
+model.save_pretrained('checkpoints/best_model_merged')
+AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
+"
+# 6. Run inference
+PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
+# 7. Or start the API server
+PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
+```
+---
+## Training Pipeline
+### v2 Upgrade Pipeline (`train_and_upgrade.py`) — 10 Steps
+| Step | Action |
+|------|--------|
+| 1 | Load existing LoRA adapter (r=8) from Hub |
+| 2 | Merge into base weights (`merge_and_unload`) — warm start |
+| 3 | Apply fresh LoRA r=16 on merged base |
+| 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (20% error rate) |
+| 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
+| 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) |
+| 7 | Compare composite score against `baseline_score.json` |
+| 8 | If improved: save LoRA adapter |
+| 9 | Merge adapter → save full model |
+| 10 | Push adapter + merged model to Hub; update baseline |
+### v1 Original Pipeline (`train.sh`) — 5 Stages
+| Stage | Action |
+|-------|--------|
+| 1 | Setup & Dependencies |
+| 2 | Data Preprocessing (FCE + W&I+LOCNESS + JFLEG → JSONL) |
+| 3 | Human Pattern Classifier Pre-Training |
+| 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
+| 5 | Evaluation (GLEU only) |
+---
+## Hyperparameter Reference
+### v2 (`train_and_upgrade.py`)
+```python
+LORA_R          = 16
+LORA_ALPHA      = 32
+LORA_DROPOUT    = 0.05
+TARGET_MODULES  = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
+EPOCHS          = 10
+BATCH_SIZE      = 2            # per device
+GRAD_ACCUM      = 32           # effective batch = 64
+LR              = 2e-4
+WARMUP_RATIO    = 0.10
+LABEL_SMOOTHING = 0.1
+MAX_INPUT_LEN   = 128
+MAX_TARGET_LEN  = 128
+LAMBDA_STYLE    = 0.3
+LAMBDA_SEMANTIC = 0.5
+LAMBDA_HUMAN    = 0.4          # GPU only
+```
+### v1 (`configs/training_config.yaml`)
+```yaml
+lora:
+  r: 8
+  lora_alpha: 16
+  lora_dropout: 0.05
+  target_modules: [q, v, k, o, wi_0, wi_1, wo]
+training:
+  per_device_train_batch_size: 4
+  gradient_accumulation_steps: 8  # effective batch = 32
+  learning_rate: 3.0e-4
+  lr_scheduler_type: cosine
+  bf16: true
+loss:
+  lambda_style: 0.3
+  lambda_semantic: 0.5
+  lambda_human_pattern: 0.4
+```
+### `configs/inference_config.yaml`
+```yaml
+model:
+  key: "flan-t5-small"
+  checkpoint_path: "checkpoints/best_model_merged"
+  use_lora: false
+generation:
+  num_beams: 5
+  length_penalty: 1.2
+  no_repeat_ngram_size: 3
+  max_new_tokens: 128
+vocabulary:
+  semantic_threshold: 0.82
+```
+---
+## Inference Pipeline (7 Steps)
+```
+Raw Text
+  │
+  ▼
+1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
+  │
+  ▼
+2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
+  │
+  ▼
+3. Sentence-Chunked Generation ─ Split into 128-token chunks → Flan-T5 → rejoin
+  │
+  ▼
+4. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
+  │
+  ▼
+5. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
+  │
+  ▼
+6. Register Filtering ── Expand contractions, replace colloquialisms
+  │
+  ▼
+7. Metrics ──────────── Style similarity, AWL coverage, readability scores
+  │
+  ▼
+Corrected Text
+```
+---
+## API Usage
+```bash
+# Start the server
+PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
+# Correct text
+curl -X POST http://localhost:8000/correct \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
+# Health check
+curl http://localhost:8000/health
+```
+Interactive docs at `http://localhost:8000/docs`.
+---
+## Hardware Requirements
+| Tier | GPU | LoRA Config | Training Time |
+|------|-----|-------------|---------------|
+| **Tested (v1)** | RTX 3050 4GB | r=8, 5 epochs | ~45 min |
+| **Tested (v2 CPU)** | None (HF Space CPU Basic) | r=16, 10 epochs | ~12–24 hours |
+| Recommended | RTX 3090 24GB | r=16, 10 epochs + human-pattern loss | ~2–3h |
+| Maximum | A100 80GB | Full pipeline with GPT-2 perplexity scoring | ~12h |
+---
+## Data Sources
+| Dataset | Type | Size | Access |
+|---------|------|------|--------|
+| JFLEG (`jhu-clsp/jfleg`) | Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
+| W&I+LOCNESS (`bea2019st/wi_locness`) | Learner errors + corrections | ~34k pairs | HF Hub, no registration |
+| FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
+| Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
+| Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
+| Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
+> Note: `train_and_upgrade.py` uses only JFLEG + W&I+LOCNESS (freely accessible via HF Hub). FCE and Kaggle datasets are used in the full manual pipeline only.
+---
+## Dyslexia Error Simulation
+The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017). v2 uses a 20% per-word error rate (up from 15%).
+| Error Type | Frequency | Example |
+|-----------|-----------|---------|
+| Phonetic substitution | 35% | "because" → "becaus" |
+| Letter transposition | 18% | "the" → "teh" |
+| Letter omission | 16% | "important" → "importnt" |
+| Letter doubling | 12% | "letter" → "lettter" |
+| Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
+| Word boundary errors | 9% | "a lot" → "alot" |
+---
+## Style Fingerprint Vector
+The 512-dimensional style vector captures 41 raw features:
+| Group | Features | Count |
+|-------|----------|-------|
+| Sentence stats | mean, std, skew of sentence lengths | 3 |
+| Word stats | mean, std of word lengths | 2 |
+| Lexical | type-token ratio, lexical density | 2 |
+| Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
+| Discourse | 20 academic discourse markers (per 100 words) | 20 |
+| Register | hedging frequency, formality score, nominalization ratio | 3 |
+| Readability | Flesch reading ease, avg syllables per word | 2 |
+| Pronouns | first-person ratio, third-person ratio | 2 |
+| Other | question ratio, exclamation ratio, AWL coverage | 3 |
+Projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
+---
+## Known Limitations
+1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models. Doubling LoRA rank (r=8 → r=16) partially addresses this.
+2. **Training window**: 128-token max input means very long sentences may be split mid-clause.
+3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
+4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
+5. **LanguageTool latency**: Spell correction takes ~15–20s due to JVM startup on first call.
+6. **Human-pattern loss on CPU**: The GPT-2 perplexity-based loss is skipped on CPU for performance. Full loss is only active on GPU.
+<!-- 7. **Semantic drift in correction**: The pipeline can introduce meaning-level errors — dyslexic phonetic patterns misread by LanguageTool can produce plausible-but-wrong word substitutions. BERTScore F1 and WER (now primary evaluation signals in v2) help detect but don't eliminate this. A dedicated post-correction semantic faithfulness check remains a future improvement. -->