morpheuslord
/

rewrite

@@ -1,206 +1,699 @@
 ---
-base_model: google/flan-t5-small
-library_name: peft
 tags:
-- base_model:adapter:google/flan-t5-small
 - lora
-- transformers
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.19.1

 ---
+language:
+- en
 tags:
+- text2text-generation
+- dyslexia
+- grammar-correction
+- style-preservation
 - lora
+- flan-t5
+license: mit
+base_model: google/flan-t5-small
+datasets:
+- jhu-clsp/jfleg
+- bea2019st/wi_locness
+pipeline_tag: translation
 ---
+# Dyslexia Academic Writing Correction System
+> **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
+## Overview
+This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
+1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
+2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
+3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
+4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
+The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
+---
+## Latest Evaluation Results (v3)
+| Metric | Score | Description |
+|--------|-------|-------------|
+| **GLEU** | **0.7593** | Grammar + fluency correction quality |
+| **BERTScore F1** | **0.9758** | Semantic closeness to reference corrections |
+| **1 − WER** | **0.8552** | Word-level accuracy (WER = 14.48%) |
+| **Composite** | **0.8634** | `(GLEU + BERTScore F1 + (1−WER)) / 3` — gating score for Hub push |
+| **Faithfulness reverts** | **11** | Outputs whose cosine sim to input fell below 0.75 — reverted to source |
+> The model is only pushed to the Hub when the composite score strictly beats the saved baseline from the previous run, ensuring the Hub always holds the best-seen weights.
+### Score Progression
+| Metric | v1 | v2 | v3 | Δ v2→v3 |
+|--------|----|----|-----|---------|
+| GLEU | — | 0.7506 | **0.7593** | +0.0087 |
+| BERTScore F1 | — | 0.9733 | **0.9758** | +0.0025 |
+| 1 − WER | — | 0.8488 | **0.8552** | +0.0064 |
+| Composite | — | 0.8576 | **0.8634** | +0.0058 |
+---
+## What Changed in v3
+v3 keeps the same base model and LoRA rank as v2 but improves every other stage of the pipeline: wider context window, better generation, a semantic faithfulness gate that prevents meaning-destroying corrections, and optional ERRANT F0.5 evaluation.
+| Parameter | v2 | v3 |
+|-----------|----|----|
+| Context window | 128 tokens | **256 tokens** |
+| Additional data | JFLEG + W&I only | **+ C4-200M-GEC (~100k pairs, falls back if unavailable)** |
+| Beam search | `num_beams=2` | **`num_beams=5`, `length_penalty=1.2`, `repetition_penalty=1.3`, `no_repeat_ngram_size=3`** |
+| Faithfulness gate | none | **cosine sim < 0.75 → revert output to source** |
+| Human-pattern loss | skipped on CPU | **active on GPU** (loads classifier from Hub if present) |
+| Evaluation cap | always 200 samples | **200 on CPU, full test set on GPU** |
+| ERRANT F0.5 | not present | **optional metric** (install `errant` + `en_core_web_sm`) |
+| Composite | mean(GLEU, BERTScore, 1-WER) | **mean(GLEU, BERTScore, 1-WER [, ERRANT F0.5 if available])** |
+### Semantic Faithfulness Gate (v3)
+After generation, each output is checked against its source input using `all-MiniLM-L6-v2` sentence embeddings. If cosine similarity falls below **0.75**, the output is discarded and the original input is returned as fallback — preventing corrections that accidentally change meaning.
+In the v3 evaluation run, **11 outputs** (of 228 test pairs evaluated) were reverted. Without the gate, those would have been incorrect predictions dragging all three metrics down.
+### Combined Loss (v3 — unchanged from v2 on CPU)
+```
+L = L_CE + 0.3·L_style + 0.5·L_semantic          (CPU)
+L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human   (GPU)
+```
+| Term | Purpose | Weight |
+|------|---------|--------|
+| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
+| `L_style` | `1 − cos_sim(style(input), style(output))` | 0.3 |
+| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` | 0.5 |
+| `L_human` | `1 − HumanPatternClassifier(output)` — anti-AI penalty | 0.4 *(GPU only)* |
+---
+## What Changed in v2
+The original model had a critical bug: `CorrectionTrainer.compute_loss()` only used cross-entropy loss. The multi-objective loss (`L_CE + λ_style + λ_semantic + λ_human`) was fully designed in `loss_functions.py` but was **never wired into the trainer**. v2 fixes this and upgrades several other parameters.
+| Parameter | v1 (Original) | v2 (Upgraded) |
+|-----------|--------------|---------------|
+| LoRA rank | r=8, α=16 | **r=16, α=32** |
+| Epochs | 5 | **10** |
+| Effective batch size | 32 (4×8 accum) | **64 (2×32 accum)** |
+| Learning rate | 3e-4 | **2e-4** |
+| Warmup ratio | 5% | **10%** |
+| Label smoothing | none | **0.1** |
+| Loss function | CE only *(bug)* | **CE + Style + Semantic** *(fixed)* |
+| Human-pattern loss | designed, unused | omitted on CPU; falls back to CE+style+sem |
+| Evaluation | GLEU only | **GLEU + BERTScore F1 + (1−WER) composite** |
+| Eval/save strategy | every 100 steps | **per epoch** |
+| Early stopping | none | **patience=3** |
+| Hub gate | none | **composite must beat saved baseline** |
+| Warm-start strategy | cold start | **merge r=8 adapter → apply fresh r=16 LoRA** |
+| Data split | 90%/10% train/val | **88%/7%/5% train/val/test** |
+| Dyslexia augmentation error rate | 15% | **20%** |
+---
+## Features
+| Feature | Description |
+|---------|-------------|
+| **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
+| **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
+| **LoRA fine-tuning** | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
+| **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
+| **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
+| **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic (+ λ₃·L_human on GPU)` |
+| **Semantic faithfulness gate** | Outputs with cosine sim < 0.75 to source are reverted — prevents meaning drift |
+| **Sentence-chunked inference** | Long texts split into 256-token chunks matching training window |
+| **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
+| **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
+| **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text (20% error rate) |
+| **Composite score gating** | Hub push only if new model strictly beats saved baseline |
+---
+## Project Structure
+```
+Rewriter/
+├── configs/
+│   ├── training_config.yaml        # Full training hyperparameters
+│   ├── training_config_fast.yaml   # Quick iteration config
+│   ├── inference_config.yaml       # Inference & generation settings
+│   ├── model_config.yaml           # Model architecture registry
+│   └── awl_config.yaml             # Academic Word List settings
+├── scripts/
+│   ├── train.py                    # Main training script (Click CLI)
+│   ├── evaluate.py                 # Test set evaluation (GLEU, ERRANT, BERTScore)
+│   ├── run_inference.py            # Interactive CLI inference
+│   ├── preprocess_data.py          # Raw datasets → unified JSONL
+│   ├── pretrain_human_pattern_classifier.py  # Stage 3: anti-AI classifier
+│   ├── download_datasets.sh        # BEA-2019 dataset downloader
+│   └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
+├── src/
+│   ├── model/
+│   │   ├── base_model.py           # Model loader (T5/BART/Llama + LoRA + quantization)
+│   │   ├── style_conditioner.py    # Prefix tuning: style → virtual tokens
+│   │   ├── generation_utils.py     # Beam search, sampling, batch generation
+│   │   └── lora_adapter.py         # LoRA configuration helpers
+│   ├── preprocessing/
+│   │   ├── pipeline.py             # Full preprocessing orchestrator
+│   │   ├── spell_corrector.py      # LanguageTool + dyslexia-aware correction
+│   │   ├── dyslexia_simulator.py   # Synthetic error generation (Rello et al.)
+│   │   ├── dependency_parser.py    # spaCy dependency tree analysis
+│   │   ├── ner_tagger.py           # Named entity protection
+│   │   └── sentence_segmenter.py   # Sentence boundary detection
+│   ├── style/
+│   │   ├── fingerprinter.py        # 41 features → 512-dim style vector
+│   │   ├── style_vector.py         # Style vector dataclass
+│   │   ├── formality_classifier.py # Rule-based formality scoring
+│   │   └── emotion_classifier.py   # Emotion detection
+│   ├── training/
+│   │   ├── dataset.py              # Pre-tokenized cached dataset with style vectors
+│   │   ├── trainer.py              # CorrectionTrainer (HF Trainer + PEFT fixes)
+│   │   ├── loss_functions.py       # V1 and V2 combined losses
+│   │   ├── human_pattern_extractor.py  # 17-dim feature extraction + classifier
+│   │   └── callbacks.py            # Evaluation logging callbacks
+│   ├── vocabulary/
+│   │   ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
+│   │   ├── awl_loader.py           # Coxhead Academic Word List loader
+│   │   └── register_filter.py      # Contraction expansion + colloquial replacement
+│   ├── inference/
+│   │   ├── corrector.py            # End-to-end inference pipeline orchestrator
+│   │   └── postprocessor.py        # Cleanup, entity restore, formatting
+│   ├── evaluation/
+│   │   ├── gleu_scorer.py          # GLEU + BERTScore computation
+│   │   ├── errant_evaluator.py     # ERRANT P/R/F0.5 evaluation
+│   │   ├── style_metrics.py        # Style similarity + AWL coverage
+│   │   └── authorship_verifier.py  # AI detection resistance testing
+│   └── api/
+│       ├── main.py                 # FastAPI application
+│       ├── schemas.py              # Pydantic request/response models
+│       └── middleware.py           # Rate limiting + CORS
+├── train_and_upgrade.py            # v3 upgrade pipeline (self-improving Hub push)
+├── data/
+│   ├── raw/                        # Original datasets (JFLEG, W&I+LOCNESS)
+│   ├── processed/                  # Unified JSONL (train/val/test splits)
+│   ├── cache/                      # Pre-tokenized dataset caches (.pt files)
+│   └── awl/                        # Coxhead Academic Word List
+├── train.sh                        # Multi-stage training orchestrator
+├── start.sh                        # Inference launcher (CLI or API mode)
+├── baseline_score.json             # Saved composite score (0.8634) — gate for Hub push
+├── Dockerfile                      # Production container
+├── docker-compose.yml              # Docker deployment
+├── requirements.txt                # Python dependencies
+└── pyproject.toml                  # Project metadata
+```
+---
+## Model Architecture
+### PNG:
+![Architecture](arch.png)
+### Mermaid Diagram:
+```mermaid
+graph TB
+    subgraph INFERENCE["🔮 Inference Pipeline"]
+        direction TB
+        INPUT["📝 Raw Dyslectic Text"]
+        subgraph PREPROCESS["Pre-Processing"]
+            SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
+            SENT_SEG["Sentence Segmenter"]
+            DEP_PARSE["Dependency Parser"]
+            NER["NER Tagger"]
+        end
+        subgraph STYLE["Style Analysis"]
+            FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
+            EMOTION["Emotion Classifier"]
+            FORMALITY["Formality Classifier"]
+            STYLE_VEC["Style Vector Composer"]
+        end
+        subgraph GENERATION["Core Generation"]
+            STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
+            BASE_MODEL["Base LM<br/><i>Flan-T5-Small (warm-merged)</i>"]
+            LORA["LoRA Adapter<br/><i>r=16</i>"]
+            GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
+        end
+        subgraph POSTPROCESS["Post-Processing"]
+            FAITH["Faithfulness Gate<br/><i>cos sim &lt; 0.75 → revert</i>"]
+            POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
+            VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
+            AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
+            REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
+        end
+        OUTPUT["✅ Corrected Academic Text"]
+        INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
+        INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
+        NER --> STYLE_COND
+        STYLE_VEC --> STYLE_COND
+        STYLE_COND --> BASE_MODEL
+        LORA -.->|"merged weights"| BASE_MODEL
+        BASE_MODEL --> GEN_UTILS --> FAITH --> POSTPROC
+        POSTPROC --> VOCAB_SUB
+        AWL --> VOCAB_SUB
+        VOCAB_SUB --> REG_FILTER --> OUTPUT
+    end
+    subgraph TRAINING["🏋️ Training Pipeline (v3)"]
+        direction TB
+        subgraph WARMSTART["Warm-Start Merge"]
+            HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=16 (v2)</i>"]
+            MERGE["merge_and_unload()"]
+            FRESH_LORA["Fresh LoRA r=16"]
+        end
+        subgraph DATA["Data Pipeline"]
+            JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
+            WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
+            C4GEC["C4-200M-GEC<br/><i>~100k pairs (optional)</i>"]
+            DYSLEXIA_AUG["DyslexiaSimulator<br/><i>20% error rate augmentation</i>"]
+            SPLIT["88% train / 7% val / 5% test"]
+        end
+        subgraph LOSS["Combined Loss (v3)"]
+            L_CE["L_CE + label_smoothing=0.1"]
+            L_STYLE["0.3 · L_style"]
+            L_SEM["0.5 · L_semantic"]
+            L_HUMAN["0.4 · L_human<br/><i>(GPU only)</i>"]
+        end
+        subgraph EVAL["Composite Evaluation"]
+            GLEU_E["GLEU"]
+            BERT_E["BERTScore F1"]
+            WER_E["1 − WER"]
+            ERRANT_E["ERRANT F0.5<br/><i>(optional)</i>"]
+            COMPOSITE["Composite = mean(3 or 4)"]
+            GATE["Beat baseline?"]
+            HUB_PUSH["Push to Hub ✅"]
+        end
+        HUB_ADAPTER --> MERGE --> FRESH_LORA
+        JFLEG --> DYSLEXIA_AUG
+        WILOCNESS --> DYSLEXIA_AUG
+        C4GEC --> DYSLEXIA_AUG
+        DYSLEXIA_AUG --> SPLIT
+        L_CE --> COMPOSITE
+        L_STYLE --> COMPOSITE
+        L_SEM --> COMPOSITE
+        GLEU_E --> COMPOSITE
+        BERT_E --> COMPOSITE
+        WER_E --> COMPOSITE
+        ERRANT_E -.->|"if installed"| COMPOSITE
+        COMPOSITE --> GATE --> HUB_PUSH
+    end
+```
+---
+## Design Choices & Rationale
+### Why Flan-T5-Small?
+| Consideration | Decision |
+|---------------|----------|
+| **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
+| **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
+| **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
+| **LoRA efficiency** | Trainable params scale with r: r=16 → ~2.56M (3.3%) — still fits in 4GB |
+### Why LoRA over Full Fine-Tuning?
+- **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA r=16 needs ~400MB
+- **Warm-start safety**: Merging r=8 weights preserves corrections before expanding capacity to r=16
+- **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
+- **Configuration**: `r=16, alpha=32, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
+### Why a Combined Multi-Objective Loss?
+The system uses (on CPU): `L = L_CE + 0.3·L_style + 0.5·L_semantic`
+On GPU (with human-pattern classifier available): `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
+| Term | Purpose | Weight |
+|------|---------|--------|
+| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
+| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
+| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
+| `L_human` | `1 − HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
+### Why a Semantic Faithfulness Gate?
+Even a well-trained correction model can occasionally produce outputs that drift semantically from the input — particularly when dyslexic phonetic patterns are ambiguous (e.g. "becaus" could be "because" or "becaused"). Rather than accepting every model output blindly, v3 computes cosine similarity between the source and output using `all-MiniLM-L6-v2` sentence embeddings. Outputs below **0.75 similarity** are treated as unreliable and the original input is returned unchanged. This is conservative by design: a correct-but-awkward source is always better than a fluent-but-wrong correction.
+### Why a Human Pattern Classifier?
+AI-generated text has detectable statistical signatures:
+- **Lower GPT-2 perplexity** (AI text is more "predictable")
+- **Lower burstiness** (AI has uniform sentence lengths; humans vary)
+- **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
+- **Lower n-gram novelty** (AI reuses phrases more)
+The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal. Requires GPU for GPT-2 perplexity scoring; falls back gracefully on CPU.
+### Why Sentence-Chunked Inference?
+The model is trained with `max_input_length=256` tokens. The task prefix alone consumes ~40 tokens, leaving ~216 tokens for actual text. Long inputs are:
+1. Split into sentences using spaCy
+2. Grouped into chunks that fit the 256-token budget
+3. Each chunk is corrected independently
+4. Results are joined back together
+### Why Post-Generation Vocabulary Elevation?
+Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), a separate BERT-based lexical substitution pipeline is applied:
+1. POS-tag the output with spaCy
+2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
+3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
+4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
+5. Track used substitutions to prevent duplicate replacements
+---
+## Quick Start
+### Prerequisites
+- Python ≥ 3.10
+- NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
+- ~10GB disk space for models and datasets
+### Option A: Self-Improving Upgrade Pipeline (v3)
+This pipeline loads the existing Hub adapter, upgrades it, evaluates, and only pushes if it improves.
+```bash
+git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
+pip install -r requirements.txt
+export HF_TOKEN="your-hf-token-with-write-access"
+python train_and_upgrade.py
+```
+The pipeline handles all 10 steps automatically:
+**Load adapter → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Gate → Save → Merge → Push**
+### Option B: Manual Step-by-Step (original pipeline)
+```bash
+# 1. Install dependencies
+pip install -r requirements.txt
+python -m spacy download en_core_web_sm
+# 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
+python scripts/preprocess_data.py
+# 3. Pre-train the human pattern classifier
+python scripts/pretrain_human_pattern_classifier.py
+# 4. Train the correction model
+PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
+# 5. Merge LoRA adapter into base model for inference
+python -c "
+from peft import PeftModel
+from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
+import torch
+model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
+model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
+model = model.merge_and_unload()
+model.save_pretrained('checkpoints/best_model_merged')
+AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
+"
+# 6. Run inference
+PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
+# 7. Or start the API server
+PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
+```
+---
+## Training Pipeline
+### v3 Upgrade Pipeline (`train_and_upgrade.py`) — 10 Steps
+| Step | Action |
+|------|--------|
+| 1 | Load existing LoRA adapter (r=16, v2) from Hub |
+| 2 | Merge into base weights (`merge_and_unload`) — warm start |
+| 3 | Apply fresh LoRA r=16 on merged base |
+| 4 | Load JFLEG + W&I+LOCNESS + C4-GEC (optional); augment with DyslexiaSimulator (20% error rate) |
+| 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
+| 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) [+ ERRANT F0.5 if installed] |
+| 7 | Apply semantic faithfulness gate — revert outputs with cosine sim < 0.75 to source |
+| 8 | Compare composite score against `baseline_score.json` |
+| 9 | If improved: merge adapter → save full model |
+| 10 | Push adapter (repo root) + merged model (`merged/` subfolder) to Hub; update baseline |
+### v2 Upgrade Pipeline — 10 Steps
+| Step | Action |
+|------|--------|
+| 1 | Load existing LoRA adapter (r=8) from Hub |
+| 2 | Merge into base weights (`merge_and_unload`) — warm start |
+| 3 | Apply fresh LoRA r=16 on merged base |
+| 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (20% error rate) |
+| 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
+| 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) |
+| 7 | Compare composite score against `baseline_score.json` |
+| 8 | If improved: save LoRA adapter |
+| 9 | Merge adapter → save full model |
+| 10 | Push adapter + merged model to Hub; update baseline |
+### v1 Original Pipeline (`train.sh`) — 5 Stages
+| Stage | Action |
+|-------|--------|
+| 1 | Setup & Dependencies |
+| 2 | Data Preprocessing (FCE + W&I+LOCNESS + JFLEG → JSONL) |
+| 3 | Human Pattern Classifier Pre-Training |
+| 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
+| 5 | Evaluation (GLEU only) |
+---
+## Hyperparameter Reference
+### v3 (`train_and_upgrade.py`)
+```python
+LORA_R          = 16
+LORA_ALPHA      = 32
+LORA_DROPOUT    = 0.05
+TARGET_MODULES  = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
+EPOCHS          = 10
+BATCH_SIZE      = 2            # per device (CPU); 8 on GPU
+GRAD_ACCUM      = 32           # effective batch = 64
+LR              = 2e-4
+WARMUP_RATIO    = 0.10
+LABEL_SMOOTHING = 0.1
+MAX_INPUT_LEN   = 256          # up from 128 in v2
+MAX_TARGET_LEN  = 256
+LAMBDA_STYLE    = 0.3
+LAMBDA_SEMANTIC = 0.5
+LAMBDA_HUMAN    = 0.4          # GPU only
+FAITHFULNESS_THRESHOLD = 0.75  # new in v3
+```
+### v2 (`train_and_upgrade.py`)
+```python
+LORA_R          = 16
+LORA_ALPHA      = 32
+LORA_DROPOUT    = 0.05
+TARGET_MODULES  = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
+EPOCHS          = 10
+BATCH_SIZE      = 2
+GRAD_ACCUM      = 32           # effective batch = 64
+LR              = 2e-4
+WARMUP_RATIO    = 0.10
+LABEL_SMOOTHING = 0.1
+MAX_INPUT_LEN   = 128
+MAX_TARGET_LEN  = 128
+LAMBDA_STYLE    = 0.3
+LAMBDA_SEMANTIC = 0.5
+LAMBDA_HUMAN    = 0.4          # GPU only
+```
+### v1 (`configs/training_config.yaml`)
+```yaml
+lora:
+  r: 8
+  lora_alpha: 16
+  lora_dropout: 0.05
+  target_modules: [q, v, k, o, wi_0, wi_1, wo]
+training:
+  per_device_train_batch_size: 4
+  gradient_accumulation_steps: 8  # effective batch = 32
+  learning_rate: 3.0e-4
+  lr_scheduler_type: cosine
+  bf16: true
+loss:
+  lambda_style: 0.3
+  lambda_semantic: 0.5
+  lambda_human_pattern: 0.4
+```
+### `configs/inference_config.yaml`
+```yaml
+model:
+  key: "flan-t5-small"
+  checkpoint_path: "checkpoints/best_model_merged"
+  use_lora: false
+generation:
+  num_beams: 5
+  length_penalty: 1.2
+  repetition_penalty: 1.3
+  no_repeat_ngram_size: 3
+  max_new_tokens: 256
+vocabulary:
+  semantic_threshold: 0.82
+faithfulness:
+  threshold: 0.75
+```
+---
+## Inference Pipeline (8 Steps)
+```
+Raw Text
+  │
+  ▼
+1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
+  │
+  ▼
+2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
+  │
+  ▼
+3. Sentence-Chunked Generation ─ Split into 256-token chunks → Flan-T5 → rejoin
+  │
+  ▼
+4. Faithfulness Gate ──── cosine_sim(source, output) < 0.75 → revert to source  [NEW v3]
+  │
+  ▼
+5. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
+  │
+  ▼
+6. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
+  │
+  ▼
+7. Register Filtering ── Expand contractions, replace colloquialisms
+  │
+  ▼
+8. Metrics ──────────── Style similarity, AWL coverage, readability scores
+  │
+  ▼
+Corrected Text
+```
+---
+## API Usage
+```bash
+# Start the server
+PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
+# Correct text
+curl -X POST http://localhost:8000/correct \
+  -H "Content-Type: application/json" \
+  -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
+# Health check
+curl http://localhost:8000/health
+```
+Interactive docs at `http://localhost:8000/docs`.
+---
+## Hardware Requirements
+| Tier | GPU | LoRA Config | Epochs | Training Time |
+|------|-----|-------------|--------|---------------|
+| **Tested (v1)** | RTX 3050 4GB | r=8 | 5 | ~45 min |
+| **Tested (v2 CPU)** | None (HF Space CPU Basic) | r=16 | 10 | ~12–24 hours |
+| **Tested (v3 CPU)** | None (HF Space CPU Basic) | r=16 | 10 | ~12–24 hours |
+| Recommended | RTX 3090 24GB | r=16 + human-pattern loss | 10 | ~2–3h |
+| Maximum | A100 80GB | Full pipeline with GPT-2 perplexity + ERRANT | 10 | ~12h |
+---
+## Data Sources
+| Dataset | Type | Size | Access |
+|---------|------|------|--------|
+| JFLEG (`jhu-clsp/jfleg`) | Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
+| W&I+LOCNESS (`bea2019st/wi_locness`) | Learner errors + corrections | ~34k pairs | HF Hub, no registration |
+| C4-200M-GEC (`cointegrated/c4_200m-gec-filtered`) | Synthetic GEC pairs | ~100k pairs (capped) | HF Hub, no registration — *falls back silently if unavailable* |
+| FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
+| Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
+| Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
+| Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
+> Note: `train_and_upgrade.py` uses JFLEG + W&I+LOCNESS + C4-GEC (freely accessible via HF Hub). FCE and Kaggle datasets are used in the full manual pipeline only.
+---
+## Dyslexia Error Simulation
+The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017). v2+ uses a 20% per-word error rate (up from 15% in v1).
+| Error Type | Frequency | Example |
+|-----------|-----------|---------|
+| Phonetic substitution | 35% | "because" → "becaus" |
+| Letter transposition | 18% | "the" → "teh" |
+| Letter omission | 16% | "important" → "importnt" |
+| Letter doubling | 12% | "letter" → "lettter" |
+| Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
+| Word boundary errors | 9% | "a lot" → "alot" |
+---
+## Style Fingerprint Vector
+The 512-dimensional style vector captures 41 raw features:
+| Group | Features | Count |
+|-------|----------|-------|
+| Sentence stats | mean, std, skew of sentence lengths | 3 |
+| Word stats | mean, std of word lengths | 2 |
+| Lexical | type-token ratio, lexical density | 2 |
+| Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
+| Discourse | 20 academic discourse markers (per 100 words) | 20 |
+| Register | hedging frequency, formality score, nominalization ratio | 3 |
+| Readability | Flesch reading ease, avg syllables per word | 2 |
+| Pronouns | first-person ratio, third-person ratio | 2 |
+| Other | question ratio, exclamation ratio, AWL coverage | 3 |
+Projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
+---
+## Known Limitations
+1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models. Doubling LoRA rank (r=8 → r=16) partially addresses this.
+2. **Training window**: 256-token max input (up from 128 in v1/v2) — very long paragraphs may still be split mid-clause.
+3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
+4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
+5. **LanguageTool latency**: Spell correction takes ~15–20s due to JVM startup on first call.
+6. **Human-pattern loss on CPU**: The GPT-2 perplexity-based loss is skipped on CPU for performance. Full loss is only active on GPU.
+7. **Faithfulness gate conservatism**: The 0.75 cosine similarity threshold occasionally reverts valid-but-heavily-corrected outputs. Outputs flagged as reverts are logged — monitor `num_fallback` in evaluation to tune the threshold.