Auto-upgrade: composite 0.8576 | GLEU 0.7506 | BERTScore 0.9733 | 1-WER 0.8488 | r=16, 10 epochs, combined loss

Browse files

Files changed (5) hide show

README.md +112 -502
adapter_config.json +48 -0
adapter_model.safetensors +3 -0
tokenizer.json +0 -0
tokenizer_config.json +114 -0

README.md CHANGED Viewed

@@ -1,596 +1,206 @@
 ---
-language:
-- en
 tags:
-- text2text-generation
-- dyslexia
-- grammar-correction
-- style-preservation
 - lora
-- flan-t5
-license: mit
-base_model: google/flan-t5-small
-datasets:
-- cambridge/fce
-- wi_locness
-- jfleg
-pipeline_tag: translation
 ---
-# Dyslexia Academic Writing Correction System
-> **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
-## Overview
-This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
-1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
-2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
-3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
-4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
-The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation), trained on real learner error corpora (FCE, W&I+LOCNESS, JFLEG) augmented with synthetic dyslexia-simulated data.
----
-## Features
-| Feature | Description |
-|---------|-------------|
-| **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
-| **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
-| **LoRA fine-tuning** | 1.63% trainable params (1.28M / 78.2M total), rank=8 |
-| **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
-| **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
-| **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic + λ₃·L_human_pattern` |
-| **Sentence-chunked inference** | Long texts split into 128-token chunks matching training window |
-| **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
-| **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
-| **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text |
----
-## Project Structure
-```
-Rewriter/
-├── configs/
-│   ├── training_config.yaml        # Full training hyperparameters
-│   ├── training_config_fast.yaml   # Quick iteration config
-│   ├── inference_config.yaml       # Inference & generation settings
-│   ├── model_config.yaml           # Model architecture registry
-│   └── awl_config.yaml             # Academic Word List settings
-├── scripts/
-│   ├── train.py                    # Main training script (Click CLI)
-│   ├── evaluate.py                 # Test set evaluation (GLEU, ERRANT, BERTScore)
-│   ├── run_inference.py            # Interactive CLI inference
-│   ├── preprocess_data.py          # Raw datasets → unified JSONL
-│   ├── pretrain_human_pattern_classifier.py  # Stage 3: anti-AI classifier
-│   ├── download_datasets.sh        # BEA-2019 dataset downloader
-│   └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
-├── src/
-│   ├── model/
-│   │   ├── base_model.py           # Model loader (T5/BART/Llama + LoRA + quantization)
-│   │   ├── style_conditioner.py    # Prefix tuning: style → virtual tokens
-│   │   ├── generation_utils.py     # Beam search, sampling, batch generation
-│   │   └── lora_adapter.py         # LoRA configuration helpers
-│   ├── preprocessing/
-│   │   ├── pipeline.py             # Full preprocessing orchestrator
-│   │   ├── spell_corrector.py      # LanguageTool + dyslexia-aware correction
-│   │   ├── dyslexia_simulator.py   # Synthetic error generation (Rello et al.)
-│   │   ├── dependency_parser.py    # spaCy dependency tree analysis
-│   │   ├── ner_tagger.py           # Named entity protection
-│   │   └── sentence_segmenter.py   # Sentence boundary detection
-│   ├── style/
-│   │   ├── fingerprinter.py        # 41 features → 512-dim style vector
-│   │   ├── style_vector.py         # Style vector dataclass
-│   │   ├── formality_classifier.py # Rule-based formality scoring
-│   │   └── emotion_classifier.py   # Emotion detection
-│   ├── training/
-│   │   ├── dataset.py              # Pre-tokenized cached dataset with style vectors
-│   │   ├── trainer.py              # CorrectionTrainer (HF Trainer + PEFT fixes)
-│   │   ├── loss_functions.py       # V1 and V2 combined losses
-│   │   ├── human_pattern_extractor.py  # 17-dim feature extraction + classifier
-│   │   └── callbacks.py            # Evaluation logging callbacks
-│   ├── vocabulary/
-│   │   ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
-│   │   ├── awl_loader.py           # Coxhead Academic Word List loader
-│   │   └── register_filter.py      # Contraction expansion + colloquial replacement
-│   ├── inference/
-│   │   ├── corrector.py            # End-to-end inference pipeline orchestrator
-│   │   └── postprocessor.py        # Cleanup, entity restore, formatting
-│   ├── evaluation/
-│   │   ├── gleu_scorer.py          # GLEU + BERTScore computation
-│   │   ├── errant_evaluator.py     # ERRANT P/R/F0.5 evaluation
-│   │   ├── style_metrics.py        # Style similarity + AWL coverage
-│   │   └── authorship_verifier.py  # AI detection resistance testing
-│   └── api/
-│       ├── main.py                 # FastAPI application
-│       ├── schemas.py              # Pydantic request/response models
-│       └── middleware.py           # Rate limiting + CORS
-├── data/
-│   ├── raw/                        # Original datasets (FCE, W&I+LOCNESS, JFLEG, Kaggle)
-│   ├── processed/                  # Unified JSONL (train/val/test splits)
-│   ├── cache/                      # Pre-tokenized dataset caches (.pt files)
-│   └── awl/                        # Coxhead Academic Word List
-├── train.sh                        # Multi-stage training orchestrator
-├── start.sh                        # Inference launcher (CLI or API mode)
-├── Dockerfile                      # Production container
-├── docker-compose.yml              # Docker deployment
-├── requirements.txt                # Python dependencies
-└── pyproject.toml                  # Project metadata
-```
-## Model Architecture
-### PNG:
-![Architecture](arch.png)
-### Mermaid Diagram:
-```mermaid
-graph TB
-    %% ── Inference Pipeline (left-to-right flow) ──────────────────────
-    subgraph INFERENCE["🔮 Inference Pipeline"]
-        direction TB
-        INPUT["📝 Raw Dyslectic Text"]
-        subgraph PREPROCESS["Pre-Processing"]
-            SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
-            SENT_SEG["Sentence Segmenter"]
-            DEP_PARSE["Dependency Parser"]
-            NER["NER Tagger"]
-        end
-        subgraph STYLE["Style Analysis"]
-            FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
-            EMOTION["Emotion Classifier"]
-            FORMALITY["Formality Classifier"]
-            STYLE_VEC["Style Vector Composer"]
-        end
-        subgraph GENERATION["Core Generation"]
-            STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
-            BASE_MODEL["Base LM<br/><i>Flan-T5 / BART / Llama-3</i>"]
-            LORA["LoRA Adapter"]
-            GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
-        end
-        subgraph POSTPROCESS["Post-Processing"]
-            POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
-            VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
-            AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
-            REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
-        end
-        OUTPUT["✅ Corrected Academic Text"]
-        INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
-        INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
-        NER --> STYLE_COND
-        STYLE_VEC --> STYLE_COND
-        STYLE_COND --> BASE_MODEL
-        LORA -.->|"merged weights"| BASE_MODEL
-        BASE_MODEL --> GEN_UTILS --> POSTPROC
-        POSTPROC --> VOCAB_SUB
-        AWL --> VOCAB_SUB
-        VOCAB_SUB --> REG_FILTER --> OUTPUT
-    end
-    %% ── Training Pipeline ────────────────────────────────────────────
-    subgraph TRAINING["🏋️ Training Pipeline"]
-        direction TB
-        subgraph DATA["Data Pipeline"]
-            RAW_DATA["Raw Datasets<br/><i>JFLEG, WI+LOCNESS, C4_200M,<br/>FCE, Lang-8, NUCLE</i>"]
-            KAGGLE["Kaggle Datasets<br/><i>Shanegerami, Starblasters8</i>"]
-            PREPROC_SCRIPT["preprocess_data.py"]
-            TRAIN_JSONL["train.jsonl / val.jsonl / test.jsonl"]
-        end
-        subgraph HP_PRETRAIN["Human Pattern Pre-Training"]
-            FEAT_EXTRACT["Feature Extractor<br/><i>17-dim: perplexity, burstiness,<br/>n-gram novelty, AI markers...</i>"]
-            GPT2["GPT-2<br/><i>perplexity scorer</i>"]
-            HP_CLASSIFIER["Human Pattern Classifier<br/><i>MLP: 17→128→64→1</i>"]
-            HP_WEIGHTS["human_pattern_classifier.pt"]
-        end
-        subgraph MAIN_TRAIN["Main Model Training"]
-            DATASET["WritingCorrectionDataset"]
-            COMBINED_LOSS["Combined Loss Function"]
-            L_CE["L_CE<br/><i>cross-entropy</i>"]
-            L_STYLE["λ₁ · L_style<br/><i>style consistency</i>"]
-            L_SEM["λ₂ · L_semantic<br/><i>meaning preservation</i>"]
-            L_HUMAN["λ₃ · L_human_pattern<br/><i>anti-AI penalty</i>"]
-            TRAINER["CorrectionTrainer"]
-            CALLBACKS["Callbacks<br/><i>StyleMetrics,<br/>EarlyStoppingOnStyleDrift</i>"]
-        end
-        subgraph EVAL["Evaluation"]
-            ERRANT["ERRANT Evaluator<br/><i>P / R / F₀.₅</i>"]
-            GLEU["GLEU Scorer"]
-            STYLE_MET["Style Metrics<br/><i>cosine similarity</i>"]
-            AUTH_VER["Authorship Verifier<br/><i>AI detection resistance</i>"]
-        end
-        RAW_DATA --> PREPROC_SCRIPT --> TRAIN_JSONL
-        KAGGLE --> FEAT_EXTRACT
-        GPT2 --> FEAT_EXTRACT --> HP_CLASSIFIER --> HP_WEIGHTS
-        TRAIN_JSONL --> DATASET --> TRAINER
-        L_CE --> COMBINED_LOSS
-        L_STYLE --> COMBINED_LOSS
-        L_SEM --> COMBINED_LOSS
-        HP_WEIGHTS -.->|"frozen"| L_HUMAN --> COMBINED_LOSS
-        COMBINED_LOSS --> TRAINER
-        CALLBACKS --> TRAINER
-        TRAINER --> EVAL
-    end
-    %% ── API Layer ────────────────────────────────────────────────────
-    subgraph API["🌐 FastAPI Server"]
-        ENDPOINT["/correct endpoint"]
-        SCHEMAS["Request / Response Schemas"]
-        MIDDLEWARE["Rate Limiting & CORS"]
-        CORRECTOR["Corrector<br/><i>orchestrates full pipeline</i>"]
-    end
-    ENDPOINT --> CORRECTOR --> INFERENCE
-    TRAINER -->|"best_model/"| BASE_MODEL
-    %% ── Styling ──────────────────────────────────────────────────────
-    classDef pipeline fill:#1a1a2e,stroke:#16213e,color:#e94560,stroke-width:2px
-    classDef module fill:#0f3460,stroke:#533483,color:#e2e2e2,stroke-width:1px
-    classDef data fill:#1a1a2e,stroke:#e94560,color:#eee,stroke-width:1px
-    classDef output fill:#533483,stroke:#e94560,color:#fff,stroke-width:2px
-    class INPUT,RAW_DATA,KAGGLE,TRAIN_JSONL data
-    class OUTPUT,HP_WEIGHTS output
-```
----
-## Design Choices & Rationale
-### Why Flan-T5-Small?
-| Consideration | Decision |
-|---------------|----------|
-| **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
-| **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
-| **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
-| **LoRA efficiency** | Only 1.28M trainable params (1.63%) — fits in 4GB with batch_size=4 + bf16 |
-### Why LoRA over Full Fine-Tuning?
-- **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA needs ~200MB
-- **Speed**: LoRA converges in 5 epochs (~1,515 steps) on a single RTX 3050
-- **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
-- **Configuration**: `r=8, alpha=16, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
-### Why a Combined Multi-Objective Loss?
-The system uses a 4-term loss function: `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
-| Term | Purpose | Weight |
-|------|---------|--------|
-| `L_CE` | Standard cross-entropy token prediction | 1.0 |
-| `L_style` | `1 - cos_sim(output_style, input_style)` — preserves writing fingerprint | 0.3 |
-| `L_semantic` | `1 - cos_sim(input_embedding, output_embedding)` — preserves meaning | 0.5 |
-| `L_human` | `1 - HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
-**Why these weights?** Style and human-pattern losses are auxiliary signals — too high and they override grammar correction. The semantic loss is weighted highest (0.5) because meaning preservation is the hardest constraint to satisfy.
-### Why a Human Pattern Classifier?
-AI-generated text has detectable statistical signatures:
-- **Lower GPT-2 perplexity** (AI text is more "predictable")
-- **Lower burstiness** (AI has uniform sentence lengths; humans vary)
-- **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
-- **Lower n-gram novelty** (AI reuses phrases more)
-The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal.
-### Why Sentence-Chunked Inference?
-The model was trained with `max_input_length=128` tokens. The task prefix alone consumes ~40 tokens, leaving ~86 tokens for actual text. Long inputs are:
-1. Split into sentences using spaCy
-2. Grouped into chunks that fit the 128-token budget
-3. Each chunk is corrected independently
-4. Results are joined back together
-This prevents the model from seeing out-of-distribution input lengths and avoids truncation artifacts.
-### Why Post-Generation Vocabulary Elevation?
-Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), we apply a separate **BERT-based lexical substitution** pipeline:
-1. POS-tag the output with spaCy
-2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
-3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
-4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
-5. Track used substitutions to prevent duplicate replacements
----
-## Quick Start
-### Prerequisites
-- Python ≥ 3.10
-- NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
-- ~10GB disk space for models and datasets
-### Option A: Automated Training Pipeline
-```bash
-# Clone and setup
-git clone https://huggingface.co/morpheuslord/rewriter && cd rewriter
-pip install -r requirements.txt
-# Set W&B key (optional, for experiment tracking)
-export WANDB_API_KEY="your-key-here"
-# Run the full 5-stage pipeline
-bash train.sh
-```
-The orchestrator handles: **Setup → Preprocessing → Human Pattern Pre-training → Model Training → Evaluation**
-Each stage has a checkpoint system — if interrupted, re-run `train.sh` and select `[S]kip` for completed stages.
-### Option B: Manual Step-by-Step
-```bash
-# 1. Install dependencies
-pip install -r requirements.txt
-python -m spacy download en_core_web_sm
-# 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
-python scripts/preprocess_data.py
-# 3. Pre-train the human pattern classifier
-python scripts/pretrain_human_pattern_classifier.py
-# 4. Train the correction model
-PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
-# 5. Merge LoRA adapter into base model for inference
-python -c "
-from peft import PeftModel
-from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-import torch
-model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
-model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
-model = model.merge_and_unload()
-model.save_pretrained('checkpoints/best_model_merged')
-AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
-"
-# 6. Run inference
-PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
-# 7. Or start the API server
-PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
-```
----
-## Training Pipeline (5 Stages)
-### Stage 1: Setup & Dependencies
-Installs Python packages, downloads spaCy models (`en_core_web_sm`), and NLTK tokenizers.
-### Stage 2: Data Preprocessing
-Converts raw datasets into unified JSONL format:
-| Dataset | Source | Format | Pairs |
-|---------|--------|--------|-------|
-| **FCE v2.1** | BEA-2019 Shared Task | Character-level edits | ~28k |
-| **W&I+LOCNESS v2.1** | BEA-2019 Shared Task | Character-level edits | ~34k |
-| **JFLEG** | Johns Hopkins | 4 reference corrections per source | ~5k |
-Output schema: `{"input": "erroneous text", "target": "corrected text", "source": "fce|wi_locness|jfleg"}`
-Split: 90% train / 10% validation (with 50% of validation used as test, capped at 500).
-### Stage 3: Human Pattern Classifier Pre-Training
-Trains a frozen binary MLP classifier on ~100k human vs AI text samples. Uses 17 features:
-```
-[perplexity, burstiness, sentence_starter_diversity,
- bigram_novelty, trigram_novelty, 4gram_novelty,
- ai_marker_density, overused_discourse_density,
- em_dash_rate, ellipsis_rate, comma_rate, semicolon_rate,
- word_count, sentence_count, mean_sent_length, std_sent_length, ttr]
-```
-GPT-2 perplexity is computed in batched GPU forward passes. Text features are extracted in parallel via `ProcessPoolExecutor`.
-### Stage 4: Main Model Training
-Fine-tunes Flan-T5-Small with LoRA using the V2 combined loss. Key hyperparameters:
-| Parameter | Value |
-|-----------|-------|
-| Effective batch size | 32 (4 × 8 gradient accumulation) |
-| Learning rate | 3e-4 (cosine schedule, 5% warmup) |
-| Precision | bf16 (Ampere+ GPUs) |
-| Max input tokens | 128 |
-| Max target tokens | 128 |
-| Epochs | 5 |
-| Eval/Save interval | Every 100 steps |
-### Stage 5: Evaluation
-Runs on test set with metrics: GLEU, BERTScore F1, ERRANT F0.5, Style Similarity, AWL Coverage.
----
-## Inference Pipeline (7 Steps)
-```
-Raw Text
-  │
-  ▼
-1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
-  │
-  ▼
-2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
-  │
-  ▼
-3. Sentence-Chunked Generation ─ Split into 128-token chunks → Flan-T5 → rejoin
-  │
-  ▼
-4. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
-  │
-  ▼
-5. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate
-  │
-  ▼
-6. Register Filtering ── Expand contractions, replace colloquialisms
-  │
-  ▼
-7. Metrics ──────────── Style similarity, AWL coverage, readability scores
-  │
-  ▼
-Corrected Text
-```
----
-## Configuration Reference
-### `configs/training_config.yaml`
-```yaml
-model:
-  key: "flan-t5-small"          # flan-t5-xl | flan-t5-large | flan-t5-base | flan-t5-small
-  quantize: false               # 4-bit NF4 quantization (needs GPU)
-  use_lora: true                # Parameter-efficient fine-tuning
-lora:
-  r: 8                          # LoRA rank (higher = more capacity, more VRAM)
-  lora_alpha: 16                # Scaling factor (usually 2×r)
-  lora_dropout: 0.05            # Regularisation
-  target_modules: [q, v, k, o, wi_0, wi_1, wo]  # All attention + FFN layers
-training:
-  per_device_train_batch_size: 4
-  gradient_accumulation_steps: 8  # Effective batch = 32
-  learning_rate: 3.0e-4
-  lr_scheduler_type: cosine
-  bf16: true                      # Use bfloat16 on Ampere+ GPUs
-loss:
-  lambda_style: 0.3              # Style preservation weight
-  lambda_semantic: 0.5           # Meaning preservation weight
-  lambda_human_pattern: 0.4      # Anti-AI penalty weight
-```
-### `configs/inference_config.yaml`
-```yaml
-model:
-  key: "flan-t5-small"
-  checkpoint_path: "checkpoints/best_model_merged"
-  use_lora: false                # Merged model — no adapter needed
-generation:
-  num_beams: 5                   # Beam search width
-  length_penalty: 1.2            # > 1.0 rewards longer outputs
-  no_repeat_ngram_size: 3        # Prevents repetition
-  max_new_tokens: 128            # Must match training max_target_length
-vocabulary:
-  semantic_threshold: 0.82       # Minimum cosine similarity for AWL substitution
-```
----
-## API Usage
-```bash
-# Start the server
-PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
-# Correct text
-curl -X POST http://localhost:8000/correct \
-  -H "Content-Type: application/json" \
-  -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
-# Health check
-curl http://localhost:8000/health
-```
-Interactive docs available at `http://localhost:8000/docs`.
----
-## Hardware Requirements
-| Tier | GPU | Model | Training Time |
-|------|-----|-------|---------------|
-| **Tested** | RTX 3050 4GB | Flan-T5-Small + LoRA | ~45 min (5 epochs) |
-| Recommended | RTX 3090 24GB | Flan-T5-Base + LoRA | ~2h |
-| Maximum | A100 80GB | Flan-T5-XL + LoRA | ~12h |
-CPU inference is supported but significantly slower (~30s per correction vs ~2s on GPU).
----
-## Data Sources
-| Dataset | Type | Size | Source |
-|---------|------|------|--------|
-| FCE v2.1 | Learner errors + corrections | ~28k pairs | Cambridge English |
-| W&I+LOCNESS v2.1 | Learner errors + corrections | ~34k pairs | BEA-2019 Shared Task |
-| JFLEG | Fluency corrections (4 refs) | ~5k pairs | Johns Hopkins |
-| Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
-| Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
-| Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
----
-## Dyslexia Error Simulation
-The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017):
-| Error Type | Frequency | Example |
-|-----------|-----------|---------|
-| Phonetic substitution | 35% | "because" → "becaus" |
-| Letter transposition | 18% | "the" → "teh" |
-| Letter omission | 16% | "important" → "importnt" |
-| Letter doubling | 12% | "letter" → "lettter" |
-| Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
-| Word boundary errors | 9% | "a lot" → "alot" |
----
-## Style Fingerprint Vector
-The 512-dimensional style vector captures 41 raw features:
-| Group | Features | Count |
-|-------|----------|-------|
-| Sentence stats | mean, std, skew of sentence lengths | 3 |
-| Word stats | mean, std of word lengths | 2 |
-| Lexical | type-token ratio, lexical density | 2 |
-| Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
-| Discourse | 20 academic discourse markers (per 100 words) | 20 |
-| Register | hedging frequency, formality score, nominalization ratio | 3 |
-| Readability | Flesch reading ease, avg syllables per word | 2 |
-| Pronouns | first-person ratio, third-person ratio | 2 |
-| Other | question ratio, exclamation ratio, AWL coverage | 3 |
-These are projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
----
-## Known Limitations
-1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models
-2. **Training window**: 128-token max input means very long sentences may be split mid-clause
-3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the similarity threshold (0.82) is a trade-off between coverage and accuracy
-4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output
-5. **LanguageTool latency**: Spell correction takes ~15-20s due to JVM startup on first call
-6. **Semantic drift in correction**: Qualitative evaluation reveals the pipeline can introduce meaning-level errors rather than purely correcting surface errors — e.g. dyslexic phonetic patterns misread by LanguageTool produce plausible-but-wrong word substitutions that corrupt the intended meaning. The Style Similarity metric (0.96) does not capture this failure mode, as it measures surface token overlap rather than semantic faithfulness. Future work should add **BERTScore F1** and **Word Error Rate (WER)** against ground-truth corrections as primary evaluation signals, and a dedicated post-correction **semantic faithfulness check** (cosine similarity between input and output sentence embeddings) to flag and reject meaning-drift before returning output.

 ---
+base_model: google/flan-t5-small
+library_name: peft
 tags:
+- base_model:adapter:google/flan-t5-small
 - lora
+- transformers
 ---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

adapter_config.json ADDED Viewed

	@@ -0,0 +1,48 @@

+{
+  "alora_invocation_tokens": null,
+  "alpha_pattern": {},
+  "arrow_config": null,
+  "auto_mapping": null,
+  "base_model_name_or_path": "google/flan-t5-small",
+  "bias": "none",
+  "corda_config": null,
+  "ensure_weight_tying": false,
+  "eva_config": null,
+  "exclude_modules": null,
+  "fan_in_fan_out": false,
+  "inference_mode": true,
+  "init_lora_weights": true,
+  "layer_replication": null,
+  "layers_pattern": null,
+  "layers_to_transform": null,
+  "loftq_config": {},
+  "lora_alpha": 32,
+  "lora_bias": false,
+  "lora_dropout": 0.05,
+  "lora_ga_config": null,
+  "megatron_config": null,
+  "megatron_core": "megatron.core",
+  "modules_to_save": null,
+  "peft_type": "LORA",
+  "peft_version": "0.19.1",
+  "qalora_group_size": 16,
+  "r": 16,
+  "rank_pattern": {},
+  "revision": null,
+  "target_modules": [
+    "wo",
+    "o",
+    "q",
+    "wi_0",
+    "v",
+    "k",
+    "wi_1"
+  ],
+  "target_parameters": null,
+  "task_type": "SEQ_2_SEQ_LM",
+  "trainable_token_indices": null,
+  "use_bdlora": null,
+  "use_dora": false,
+  "use_qalora": false,
+  "use_rslora": false
+}

adapter_model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:739806c54db7ce3ca21af4278e4160f3ed7feff9f6e09ad03beae7b26aa457c4
+size 10264128

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,114 @@

+{
+  "backend": "tokenizers",
+  "eos_token": "</s>",
+  "extra_ids": 100,
+  "extra_special_tokens": [
+    "<extra_id_0>",
+    "<extra_id_1>",
+    "<extra_id_2>",
+    "<extra_id_3>",
+    "<extra_id_4>",
+    "<extra_id_5>",
+    "<extra_id_6>",
+    "<extra_id_7>",
+    "<extra_id_8>",
+    "<extra_id_9>",
+    "<extra_id_10>",
+    "<extra_id_11>",
+    "<extra_id_12>",
+    "<extra_id_13>",
+    "<extra_id_14>",
+    "<extra_id_15>",
+    "<extra_id_16>",
+    "<extra_id_17>",
+    "<extra_id_18>",
+    "<extra_id_19>",
+    "<extra_id_20>",
+    "<extra_id_21>",
+    "<extra_id_22>",
+    "<extra_id_23>",
+    "<extra_id_24>",
+    "<extra_id_25>",
+    "<extra_id_26>",
+    "<extra_id_27>",
+    "<extra_id_28>",
+    "<extra_id_29>",
+    "<extra_id_30>",
+    "<extra_id_31>",
+    "<extra_id_32>",
+    "<extra_id_33>",
+    "<extra_id_34>",
+    "<extra_id_35>",
+    "<extra_id_36>",
+    "<extra_id_37>",
+    "<extra_id_38>",
+    "<extra_id_39>",
+    "<extra_id_40>",
+    "<extra_id_41>",
+    "<extra_id_42>",
+    "<extra_id_43>",
+    "<extra_id_44>",
+    "<extra_id_45>",
+    "<extra_id_46>",
+    "<extra_id_47>",
+    "<extra_id_48>",
+    "<extra_id_49>",
+    "<extra_id_50>",
+    "<extra_id_51>",
+    "<extra_id_52>",
+    "<extra_id_53>",
+    "<extra_id_54>",
+    "<extra_id_55>",
+    "<extra_id_56>",
+    "<extra_id_57>",
+    "<extra_id_58>",
+    "<extra_id_59>",
+    "<extra_id_60>",
+    "<extra_id_61>",
+    "<extra_id_62>",
+    "<extra_id_63>",
+    "<extra_id_64>",
+    "<extra_id_65>",
+    "<extra_id_66>",
+    "<extra_id_67>",
+    "<extra_id_68>",
+    "<extra_id_69>",
+    "<extra_id_70>",
+    "<extra_id_71>",
+    "<extra_id_72>",
+    "<extra_id_73>",
+    "<extra_id_74>",
+    "<extra_id_75>",
+    "<extra_id_76>",
+    "<extra_id_77>",
+    "<extra_id_78>",
+    "<extra_id_79>",
+    "<extra_id_80>",
+    "<extra_id_81>",
+    "<extra_id_82>",
+    "<extra_id_83>",
+    "<extra_id_84>",
+    "<extra_id_85>",
+    "<extra_id_86>",
+    "<extra_id_87>",
+    "<extra_id_88>",
+    "<extra_id_89>",
+    "<extra_id_90>",
+    "<extra_id_91>",
+    "<extra_id_92>",
+    "<extra_id_93>",
+    "<extra_id_94>",
+    "<extra_id_95>",
+    "<extra_id_96>",
+    "<extra_id_97>",
+    "<extra_id_98>",
+    "<extra_id_99>"
+  ],
+  "is_local": false,
+  "local_files_only": false,
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sp_model_kwargs": {},
+  "tokenizer_class": "T5Tokenizer",
+  "unk_token": "<unk>"
+}