Auto-upgrade v3: composite 0.8634 | GLEU 0.7593 | BERTScore 0.9758 | 1-WER 0.8552 | r=16, 256-token ctx, C4-GEC data, faithfulness gate

Browse files

Files changed (4) hide show

README.md +106 -522
adapter_config.json +3 -3
adapter_model.safetensors +1 -1
tokenizer.json +1 -1

README.md CHANGED Viewed

@@ -1,622 +1,206 @@
 ---
-language:
-- en
 tags:
-- text2text-generation
-- dyslexia
-- grammar-correction
-- style-preservation
 - lora
-- flan-t5
-license: mit
-base_model: google/flan-t5-small
-datasets:
-- jhu-clsp/jfleg
-- bea2019st/wi_locness
-pipeline_tag: translation
 ---
-# Dyslexia Academic Writing Correction System
-> **A style-preserving, grammar-correcting, academic vocabulary elevating AI system that corrects dyslectic writing while maintaining the author's personal voice, tone, and authorship signal — not a rewriter, a corrector.**
-## Overview
-This system takes text written by dyslexic students and corrects grammar, spelling, and fluency errors while:
-1. **Preserving the author's unique writing style** via a 512-dimensional style fingerprint vector
-2. **Elevating vocabulary to academic register** using Coxhead's Academic Word List (AWL) and BERT-based lexical substitution
-3. **Resisting AI detection** through a frozen Human Pattern Classifier that penalises AI-typical writing during training
-4. **Maintaining semantic meaning** with cosine-similarity-based semantic preservation loss
-The core model is **Google Flan-T5-Small** fine-tuned with **LoRA** (Low-Rank Adaptation, r=16), trained on real learner error corpora (JFLEG, W&I+LOCNESS) augmented with synthetic dyslexia-simulated data.
----
-## Latest Evaluation Results (v2)
-| Metric | Score | Description |
-|--------|-------|-------------|
-| **GLEU** | **0.7506** | Grammar + fluency correction quality |
-| **BERTScore F1** | **0.9733** | Semantic closeness to reference corrections |
-| **1 − WER** | **0.8488** | Word-level accuracy (WER = 15.12%) |
-| **Composite** | **0.8576** | `(GLEU + BERTScore F1 + (1−WER)) / 3` — gating score for Hub push |
-> The model is only pushed to the Hub when the composite score strictly beats the saved baseline from the previous run, ensuring the Hub always holds the best-seen weights.
----
-## What Changed in v2
-The original model had a critical bug: `CorrectionTrainer.compute_loss()` only used cross-entropy loss. The multi-objective loss (`L_CE + λ_style + λ_semantic + λ_human`) was fully designed in `loss_functions.py` but was **never wired into the trainer**. v2 fixes this and upgrades several other parameters.
-| Parameter | v1 (Original) | v2 (Upgraded) |
-|-----------|--------------|---------------|
-| LoRA rank | r=8, α=16 | **r=16, α=32** |
-| Epochs | 5 | **10** |
-| Effective batch size | 32 (4×8 accum) | **64 (2×32 accum)** |
-| Learning rate | 3e-4 | **2e-4** (more stable over longer run) |
-| Warmup ratio | 5% | **10%** |
-| Label smoothing | none | **0.1** (reduces overconfidence) |
-| Loss function | CE only *(bug)* | **CE + Style + Semantic** *(fixed)* |
-| Human-pattern loss | designed, unused | omitted on CPU; falls back to CE+style+sem |
-| Evaluation | GLEU only | **GLEU + BERTScore F1 + (1−WER) composite** |
-| Eval/save strategy | every 100 steps | **per epoch** |
-| Early stopping | none | **patience=3** |
-| Hub gate | none | **composite must beat saved baseline** |
-| Warm-start strategy | cold start | **merge r=8 adapter → apply fresh r=16 LoRA** |
-| Data split | 90%/10% train/val | **88%/7%/5% train/val/test** |
-| Dyslexia augmentation error rate | 15% | **20%** |
-### Combined Loss (v2)
-```
-L = L_CE + 0.3·L_style + 0.5·L_semantic
-```
-The human-pattern loss (`λ₃·L_human`) is kept in the design but skipped on CPU (requires GPT-2 perplexity scoring). Style and semantic losses use a lightweight `StyleMLP` — no spaCy or external models required at training time.
-| Term | Purpose | Weight |
-|------|---------|--------|
-| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
-| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
-| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
-| `L_human` | `1 ��� HumanPatternClassifier(output)` — anti-AI penalty | 0.4 *(GPU only)* |
-### Warm-Start Merge Strategy
-Rather than fine-tuning from scratch, v2 preserves the corrections learned by the original r=8 adapter:
-1. Load existing LoRA adapter (r=8) from Hub
-2. Merge adapter weights into base model (`merge_and_unload()`)
-3. Apply a **fresh LoRA at r=16** on top of the merged base
-4. Train with combined loss for 10 epochs
-This doubles the adapter's representational capacity while retaining previously learned correction patterns.
----
-## Features
-| Feature | Description |
-|---------|-------------|
-| **Two-pass spell correction** | Dyslexia-aware phonetic pattern handling via LanguageTool |
-| **Style fingerprinting** | 41 raw features → MLP → 512-dim L2-normalised style vector |
-| **LoRA fine-tuning** | r=16, α=32, dropout=0.05 — targeting all attention + FFN projections |
-| **Academic vocabulary elevation** | BERT fill-mask → AWL candidate filtering → semantic similarity gate |
-| **Human pattern anti-AI loss** | Pre-trained frozen MLP classifier (17-dim features including GPT-2 perplexity) |
-| **Combined training loss** | `L_CE + λ₁·L_style + λ₂·L_semantic (+ λ₃·L_human on GPU)` |
-| **Sentence-chunked inference** | Long texts split into 128-token chunks matching training window |
-| **FastAPI server** | RESTful `/correct` endpoint with CORS and rate limiting |
-| **Multi-stage training** | Orchestrated via `train.sh` with checkpoint system (Skip/Redo/Continue) |
-| **Synthetic data augmentation** | `DyslexiaSimulator` generates realistic errors from clean text (20% error rate) |
-| **Composite score gating** | Hub push only if new model strictly beats saved baseline |
----
-## Project Structure
-```
-Rewriter/
-├── configs/
-│   ├── training_config.yaml        # Full training hyperparameters
-│   ├── training_config_fast.yaml   # Quick iteration config
-│   ├── inference_config.yaml       # Inference & generation settings
-│   ├── model_config.yaml           # Model architecture registry
-│   └── awl_config.yaml             # Academic Word List settings
-├── scripts/
-│   ├── train.py                    # Main training script (Click CLI)
-│   ├── evaluate.py                 # Test set evaluation (GLEU, ERRANT, BERTScore)
-│   ├── run_inference.py            # Interactive CLI inference
-│   ├── preprocess_data.py          # Raw datasets → unified JSONL
-│   ├── pretrain_human_pattern_classifier.py  # Stage 3: anti-AI classifier
-│   ├── download_datasets.sh        # BEA-2019 dataset downloader
-│   └── download_kaggle_datasets.sh # Kaggle human/AI data downloader
-├── src/
-│   ├── model/
-│   │   ├── base_model.py           # Model loader (T5/BART/Llama + LoRA + quantization)
-│   │   ├── style_conditioner.py    # Prefix tuning: style → virtual tokens
-│   │   ├── generation_utils.py     # Beam search, sampling, batch generation
-│   │   └── lora_adapter.py         # LoRA configuration helpers
-│   ├── preprocessing/
-│   │   ├── pipeline.py             # Full preprocessing orchestrator
-│   │   ├── spell_corrector.py      # LanguageTool + dyslexia-aware correction
-│   │   ├── dyslexia_simulator.py   # Synthetic error generation (Rello et al.)
-│   │   ├── dependency_parser.py    # spaCy dependency tree analysis
-│   │   ├── ner_tagger.py           # Named entity protection
-│   │   └── sentence_segmenter.py   # Sentence boundary detection
-│   ├── style/
-│   │   ├── fingerprinter.py        # 41 features → 512-dim style vector
-│   │   ├── style_vector.py         # Style vector dataclass
-│   │   ├── formality_classifier.py # Rule-based formality scoring
-│   │   └── emotion_classifier.py   # Emotion detection
-│   ├── training/
-│   │   ├── dataset.py              # Pre-tokenized cached dataset with style vectors
-│   │   ├── trainer.py              # CorrectionTrainer (HF Trainer + PEFT fixes)
-│   │   ├── loss_functions.py       # V1 and V2 combined losses
-│   │   ├── human_pattern_extractor.py  # 17-dim feature extraction + classifier
-│   │   └── callbacks.py            # Evaluation logging callbacks
-│   ├── vocabulary/
-│   │   ├── lexical_substitution.py # BERT fill-mask → AWL substitution pipeline
-│   │   ├── awl_loader.py           # Coxhead Academic Word List loader
-│   │   └── register_filter.py      # Contraction expansion + colloquial replacement
-│   ├── inference/
-│   │   ├── corrector.py            # End-to-end inference pipeline orchestrator
-│   │   └── postprocessor.py        # Cleanup, entity restore, formatting
-│   ├── evaluation/
-│   │   ├── gleu_scorer.py          # GLEU + BERTScore computation
-│   │   ├── errant_evaluator.py     # ERRANT P/R/F0.5 evaluation
-│   │   ├── style_metrics.py        # Style similarity + AWL coverage
-│   │   └── authorship_verifier.py  # AI detection resistance testing
-│   └── api/
-│       ├── main.py                 # FastAPI application
-│       ├── schemas.py              # Pydantic request/response models
-│       └── middleware.py           # Rate limiting + CORS
-├── train_and_upgrade.py            # v2 upgrade pipeline (self-improving Hub push)
-├── data/
-│   ├── raw/                        # Original datasets (JFLEG, W&I+LOCNESS)
-│   ├── processed/                  # Unified JSONL (train/val/test splits)
-│   ├── cache/                      # Pre-tokenized dataset caches (.pt files)
-│   └── awl/                        # Coxhead Academic Word List
-├── train.sh                        # Multi-stage training orchestrator
-├── start.sh                        # Inference launcher (CLI or API mode)
-├── baseline_score.json             # Saved composite score — gate for Hub push
-├── Dockerfile                      # Production container
-├── docker-compose.yml              # Docker deployment
-├── requirements.txt                # Python dependencies
-└── pyproject.toml                  # Project metadata
-```
----
-## Model Architecture
-### PNG:
-![Architecture](arch.png)
-### Mermaid Diagram:
-```mermaid
-graph TB
-    subgraph INFERENCE["🔮 Inference Pipeline"]
-        direction TB
-        INPUT["📝 Raw Dyslectic Text"]
-        subgraph PREPROCESS["Pre-Processing"]
-            SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
-            SENT_SEG["Sentence Segmenter"]
-            DEP_PARSE["Dependency Parser"]
-            NER["NER Tagger"]
-        end
-        subgraph STYLE["Style Analysis"]
-            FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
-            EMOTION["Emotion Classifier"]
-            FORMALITY["Formality Classifier"]
-            STYLE_VEC["Style Vector Composer"]
-        end
-        subgraph GENERATION["Core Generation"]
-            STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
-            BASE_MODEL["Base LM<br/><i>Flan-T5-Small (warm-merged)</i>"]
-            LORA["LoRA Adapter<br/><i>r=16</i>"]
-            GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
-        end
-        subgraph POSTPROCESS["Post-Processing"]
-            POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
-            VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
-            AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
-            REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
-        end
-        OUTPUT["✅ Corrected Academic Text"]
-        INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
-        INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
-        NER --> STYLE_COND
-        STYLE_VEC --> STYLE_COND
-        STYLE_COND --> BASE_MODEL
-        LORA -.->|"merged weights"| BASE_MODEL
-        BASE_MODEL --> GEN_UTILS --> POSTPROC
-        POSTPROC --> VOCAB_SUB
-        AWL --> VOCAB_SUB
-        VOCAB_SUB --> REG_FILTER --> OUTPUT
-    end
-    subgraph TRAINING["🏋️ Training Pipeline (v2)"]
-        direction TB
-        subgraph WARMSTART["Warm-Start Merge"]
-            HUB_ADAPTER["Hub LoRA Adapter<br/><i>r=8 (existing)</i>"]
-            MERGE["merge_and_unload()"]
-            FRESH_LORA["Fresh LoRA r=16"]
-        end
-        subgraph DATA["Data Pipeline"]
-            JFLEG["jhu-clsp/jfleg<br/><i>~5k pairs, 4 refs each</i>"]
-            WILOCNESS["bea2019st/wi_locness<br/><i>~34k pairs</i>"]
-            DYSLEXIA_AUG["DyslexiaSimulator<br/><i>20% error rate augmentation</i>"]
-            SPLIT["88% train / 7% val / 5% test"]
-        end
-        subgraph LOSS["Combined Loss (v2 — now active)"]
-            L_CE["L_CE + label_smoothing=0.1"]
-            L_STYLE["0.3 · L_style"]
-            L_SEM["0.5 · L_semantic"]
-            L_HUMAN["0.4 · L_human<br/><i>(GPU only)</i>"]
-        end
-        subgraph EVAL["Composite Evaluation"]
-            GLEU_E["GLEU"]
-            BERT_E["BERTScore F1"]
-            WER_E["1 − WER"]
-            COMPOSITE["Composite = mean(3)"]
-            GATE["Beat baseline?"]
-            HUB_PUSH["Push to Hub ✅"]
-        end
-        HUB_ADAPTER --> MERGE --> FRESH_LORA
-        JFLEG --> DYSLEXIA_AUG
-        WILOCNESS --> DYSLEXIA_AUG
-        DYSLEXIA_AUG --> SPLIT
-        L_CE --> COMPOSITE
-        L_STYLE --> COMPOSITE
-        L_SEM --> COMPOSITE
-        GLEU_E --> COMPOSITE
-        BERT_E --> COMPOSITE
-        WER_E --> COMPOSITE
-        COMPOSITE --> GATE --> HUB_PUSH
-    end
-```
----
-## Design Choices & Rationale
-### Why Flan-T5-Small?
-| Consideration | Decision |
-|---------------|----------|
-| **Hardware constraint** | RTX 3050 Laptop GPU (4GB VRAM) — rules out models > 500M params |
-| **Architecture** | Encoder-decoder (seq2seq) is ideal for text-to-text correction tasks |
-| **Instruction tuning** | Flan-T5 is pre-trained on 1,800+ instruction tasks — follows correction prompts naturally |
-| **LoRA efficiency** | Trainable params scale with r: r=16 → ~2.56M (3.3%) — still fits in 4GB |
-### Why LoRA over Full Fine-Tuning?
-- **Memory**: Full fine-tuning of T5-Small requires ~2.5GB for gradients alone; LoRA r=16 needs ~400MB
-- **Warm-start safety**: Merging r=8 weights preserves corrections before expanding capacity to r=16
-- **Merging**: LoRA weights merge into base model at inference time — zero latency overhead
-- **Configuration**: `r=16, alpha=32, dropout=0.05`, targeting all attention + FFN projections (`q, k, v, o, wi_0, wi_1, wo`)
-### Why a Combined Multi-Objective Loss?
-The system uses (on CPU): `L = L_CE + 0.3·L_style + 0.5·L_semantic`
-On GPU (with human-pattern classifier available): `L = L_CE + 0.3·L_style + 0.5·L_semantic + 0.4·L_human`
-| Term | Purpose | Weight |
-|------|---------|--------|
-| `L_CE` | Cross-entropy with label smoothing (0.1) | 1.0 |
-| `L_style` | `1 − cos_sim(style(input), style(output))` — preserves writing fingerprint | 0.3 |
-| `L_semantic` | `1 − cos_sim(input_emb, output_emb)` — preserves meaning | 0.5 |
-| `L_human` | `1 − HumanPatternClassifier(output)` — penalises AI-like text patterns | 0.4 |
-The style and semantic losses use a lightweight `StyleMLP` (token embedding mean-pool → linear projection) that adds no external dependencies at training time.
-### Why a Human Pattern Classifier?
-AI-generated text has detectable statistical signatures:
-- **Lower GPT-2 perplexity** (AI text is more "predictable")
-- **Lower burstiness** (AI has uniform sentence lengths; humans vary)
-- **Higher AI marker density** (overuse of "delve", "leverage", "furthermore")
-- **Lower n-gram novelty** (AI reuses phrases more)
-The classifier is a 3-layer MLP (17→128→64→1) pre-trained on ~100k samples from two Kaggle datasets (Shanegerami AI_Human.csv + Starblasters8), then **frozen** during main training. Its output score (0=AI, 1=human) is used as a reward signal. Requires GPU for GPT-2 perplexity scoring; falls back gracefully on CPU.
-### Why Sentence-Chunked Inference?
-The model was trained with `max_input_length=128` tokens. The task prefix alone consumes ~40 tokens, leaving ~86 tokens for actual text. Long inputs are:
-1. Split into sentences using spaCy
-2. Grouped into chunks that fit the 128-token budget
-3. Each chunk is corrected independently
-4. Results are joined back together
-### Why Post-Generation Vocabulary Elevation?
-Rather than relying solely on the model to produce academic vocabulary (which T5-Small lacks the capacity for), a separate BERT-based lexical substitution pipeline is applied:
-1. POS-tag the output with spaCy
-2. Identify non-AWL content words (nouns, verbs, adjectives, adverbs)
-3. Mask each candidate → run BERT fill-mask → filter to AWL-only predictions
-4. Accept substitution only if `semantic_similarity > 0.82` (measured with `all-mpnet-base-v2`)
-5. Track used substitutions to prevent duplicate replacements
----
-## Quick Start
-### Prerequisites
-- Python ≥ 3.10
-- NVIDIA GPU with ≥ 4GB VRAM (or CPU, slower)
-- ~10GB disk space for models and datasets
-### Option A: Self-Improving Upgrade Pipeline (v2)
-This pipeline loads the existing Hub adapter, upgrades it, evaluates, and only pushes if it improves.
-```bash
-git clone https://huggingface.co/morpheuslord/rewrite && cd rewrite
-pip install -r requirements.txt
-export HF_TOKEN="your-hf-token-with-write-access"
-python train_and_upgrade.py
-```
-The pipeline handles all 10 steps automatically:
-**Load adapter → Warm-start merge → Apply r=16 LoRA → Load data → Train → Evaluate → Gate → Save → Merge → Push**
-### Option B: Manual Step-by-Step (original pipeline)
-```bash
-# 1. Install dependencies
-pip install -r requirements.txt
-python -m spacy download en_core_web_sm
-# 2. Preprocess datasets (FCE, W&I+LOCNESS, JFLEG → unified JSONL)
-python scripts/preprocess_data.py
-# 3. Pre-train the human pattern classifier
-python scripts/pretrain_human_pattern_classifier.py
-# 4. Train the correction model
-PYTHONPATH=. python scripts/train.py --config configs/training_config.yaml --use-v2-loss
-# 5. Merge LoRA adapter into base model for inference
-python -c "
-from peft import PeftModel
-from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
-import torch
-model = AutoModelForSeq2SeqLM.from_pretrained('google/flan-t5-small', torch_dtype=torch.bfloat16)
-model = PeftModel.from_pretrained(model, 'checkpoints/checkpoint-BEST')
-model = model.merge_and_unload()
-model.save_pretrained('checkpoints/best_model_merged')
-AutoTokenizer.from_pretrained('google/flan-t5-small').save_pretrained('checkpoints/best_model_merged')
-"
-# 6. Run inference
-PYTHONPATH=. python scripts/run_inference.py --text "The studnet recieved alot of informtion."
-# 7. Or start the API server
-PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
-```
----
-## Training Pipeline
-### v2 Upgrade Pipeline (`train_and_upgrade.py`) — 10 Steps
-| Step | Action |
-|------|--------|
-| 1 | Load existing LoRA adapter (r=8) from Hub |
-| 2 | Merge into base weights (`merge_and_unload`) — warm start |
-| 3 | Apply fresh LoRA r=16 on merged base |
-| 4 | Load JFLEG + W&I+LOCNESS; augment with DyslexiaSimulator (20% error rate) |
-| 5 | Train with combined loss for 10 epochs, early stopping patience=3 |
-| 6 | Evaluate on test set: GLEU + BERTScore F1 + (1−WER) |
-| 7 | Compare composite score against `baseline_score.json` |
-| 8 | If improved: save LoRA adapter |
-| 9 | Merge adapter → save full model |
-| 10 | Push adapter + merged model to Hub; update baseline |
-### v1 Original Pipeline (`train.sh`) — 5 Stages
-| Stage | Action |
-|-------|--------|
-| 1 | Setup & Dependencies |
-| 2 | Data Preprocessing (FCE + W&I+LOCNESS + JFLEG → JSONL) |
-| 3 | Human Pattern Classifier Pre-Training |
-| 4 | Main Model Training (LoRA r=8, 5 epochs, CE only) |
-| 5 | Evaluation (GLEU only) |
----
-## Hyperparameter Reference
-### v2 (`train_and_upgrade.py`)
-```python
-LORA_R          = 16
-LORA_ALPHA      = 32
-LORA_DROPOUT    = 0.05
-TARGET_MODULES  = ["q", "v", "k", "o", "wi_0", "wi_1", "wo"]
-EPOCHS          = 10
-BATCH_SIZE      = 2            # per device
-GRAD_ACCUM      = 32           # effective batch = 64
-LR              = 2e-4
-WARMUP_RATIO    = 0.10
-LABEL_SMOOTHING = 0.1
-MAX_INPUT_LEN   = 128
-MAX_TARGET_LEN  = 128
-LAMBDA_STYLE    = 0.3
-LAMBDA_SEMANTIC = 0.5
-LAMBDA_HUMAN    = 0.4          # GPU only
-```
-### v1 (`configs/training_config.yaml`)
-```yaml
-lora:
-  r: 8
-  lora_alpha: 16
-  lora_dropout: 0.05
-  target_modules: [q, v, k, o, wi_0, wi_1, wo]
-training:
-  per_device_train_batch_size: 4
-  gradient_accumulation_steps: 8  # effective batch = 32
-  learning_rate: 3.0e-4
-  lr_scheduler_type: cosine
-  bf16: true
-loss:
-  lambda_style: 0.3
-  lambda_semantic: 0.5
-  lambda_human_pattern: 0.4
-```
-### `configs/inference_config.yaml`
-```yaml
-model:
-  key: "flan-t5-small"
-  checkpoint_path: "checkpoints/best_model_merged"
-  use_lora: false
-generation:
-  num_beams: 5
-  length_penalty: 1.2
-  no_repeat_ngram_size: 3
-  max_new_tokens: 128
-vocabulary:
-  semantic_threshold: 0.82
-```
----
-## Inference Pipeline (7 Steps)
-```
-Raw Text
-  │
-  ▼
-1. Preprocessing ─────── LanguageTool spell correction + spaCy parsing
-  │
-  ▼
-2. Style Fingerprinting ─ Extract 41 features → MLP → 512-dim vector
-  │
-  ▼
-3. Sentence-Chunked Generation ─ Split into 128-token chunks → Flan-T5 → rejoin
-  │
-  ▼
-4. Post-Processing ───── Remove artifacts, replace em dashes, fix spacing
-  │
-  ▼
-5. Vocabulary Elevation ─ BERT fill-mask → AWL filtering → semantic gate (threshold 0.82)
-  │
-  ▼
-6. Register Filtering ── Expand contractions, replace colloquialisms
-  │
-  ▼
-7. Metrics ──────────── Style similarity, AWL coverage, readability scores
-  │
-  ▼
-Corrected Text
-```
----
-## API Usage
-```bash
-# Start the server
-PYTHONPATH=. python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000
-# Correct text
-curl -X POST http://localhost:8000/correct \
-  -H "Content-Type: application/json" \
-  -d '{"text": "The studnet recieved alot of informtion.", "style_alpha": 0.6}'
-# Health check
-curl http://localhost:8000/health
-```
-Interactive docs at `http://localhost:8000/docs`.
----
-## Hardware Requirements
-| Tier | GPU | LoRA Config | Training Time |
-|------|-----|-------------|---------------|
-| **Tested (v1)** | RTX 3050 4GB | r=8, 5 epochs | ~45 min |
-| **Tested (v2 CPU)** | None (HF Space CPU Basic) | r=16, 10 epochs | ~12–24 hours |
-| Recommended | RTX 3090 24GB | r=16, 10 epochs + human-pattern loss | ~2–3h |
-| Maximum | A100 80GB | Full pipeline with GPT-2 perplexity scoring | ~12h |
----
-## Data Sources
-| Dataset | Type | Size | Access |
-|---------|------|------|--------|
-| JFLEG (`jhu-clsp/jfleg`) | Fluency corrections (4 refs each) | ~5k pairs | HF Hub, no registration |
-| W&I+LOCNESS (`bea2019st/wi_locness`) | Learner errors + corrections | ~34k pairs | HF Hub, no registration |
-| FCE v2.1 | Learner errors + corrections | ~28k pairs | BEA-2019 (registration required) |
-| Shanegerami AI_Human.csv | Human vs AI classification | ~50k samples | Kaggle |
-| Starblasters8 data.parquet | Human vs AI classification | ~50k samples | Kaggle |
-| Coxhead AWL | Academic Word List | 570 families / 549 headwords | Victoria University |
-> Note: `train_and_upgrade.py` uses only JFLEG + W&I+LOCNESS (freely accessible via HF Hub). FCE and Kaggle datasets are used in the full manual pipeline only.
----
-## Dyslexia Error Simulation
-The `DyslexiaSimulator` generates synthetic training data based on research by Rello et al. (2013, 2017). v2 uses a 20% per-word error rate (up from 15%).
-| Error Type | Frequency | Example |
-|-----------|-----------|---------|
-| Phonetic substitution | 35% | "because" → "becaus" |
-| Letter transposition | 18% | "the" → "teh" |
-| Letter omission | 16% | "important" → "importnt" |
-| Letter doubling | 12% | "letter" → "lettter" |
-| Letter reversal (b/d, p/q) | 10% | "bad" → "dad" |
-| Word boundary errors | 9% | "a lot" → "alot" |
----
-## Style Fingerprint Vector
-The 512-dimensional style vector captures 41 raw features:
-| Group | Features | Count |
-|-------|----------|-------|
-| Sentence stats | mean, std, skew of sentence lengths | 3 |
-| Word stats | mean, std of word lengths | 2 |
-| Lexical | type-token ratio, lexical density | 2 |
-| Syntactic | passive/active voice ratio, subordinate clause ratio, avg dependency tree depth | 4 |
-| Discourse | 20 academic discourse markers (per 100 words) | 20 |
-| Register | hedging frequency, formality score, nominalization ratio | 3 |
-| Readability | Flesch reading ease, avg syllables per word | 2 |
-| Pronouns | first-person ratio, third-person ratio | 2 |
-| Other | question ratio, exclamation ratio, AWL coverage | 3 |
-Projected through a 2-layer MLP (`41 → 256 → 512`) with LayerNorm and GELU activation, then L2-normalised.
----
-## Known Limitations
-1. **Model capacity**: Flan-T5-Small (77M params) has limited correction ability compared to larger models. Doubling LoRA rank (r=8 → r=16) partially addresses this.
-2. **Training window**: 128-token max input means very long sentences may be split mid-clause.
-3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the 0.82 similarity threshold is a trade-off between coverage and accuracy.
-4. **Already-correct text**: The model is trained on error→correction pairs; feeding it clean text produces unpredictable output.
-5. **LanguageTool latency**: Spell correction takes ~15–20s due to JVM startup on first call.
-6. **Human-pattern loss on CPU**: The GPT-2 perplexity-based loss is skipped on CPU for performance. Full loss is only active on GPU.
-<!-- 7. **Semantic drift in correction**: The pipeline can introduce meaning-level errors — dyslexic phonetic patterns misread by LanguageTool can produce plausible-but-wrong word substitutions. BERTScore F1 and WER (now primary evaluation signals in v2) help detect but don't eliminate this. A dedicated post-correction semantic faithfulness check remains a future improvement. -->

 ---
+base_model: google/flan-t5-small
+library_name: peft
 tags:
+- base_model:adapter:google/flan-t5-small
 - lora
+- transformers
 ---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]
+### Framework versions
+- PEFT 0.19.1

adapter_config.json CHANGED Viewed

@@ -30,12 +30,12 @@
   "rank_pattern": {},
   "revision": null,
   "target_modules": [
-    "wo",
     "o",
-    "q",
     "wi_0",
-    "v",
     "k",
     "wi_1"
   ],
   "target_parameters": null,

   "rank_pattern": {},
   "revision": null,
   "target_modules": [
     "o",
     "wi_0",
     "k",
+    "q",
+    "v",
+    "wo",
     "wi_1"
   ],
   "target_parameters": null,

adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:739806c54db7ce3ca21af4278e4160f3ed7feff9f6e09ad03beae7b26aa457c4
 size 10264128

 version https://git-lfs.github.com/spec/v1
+oid sha256:77d31009b5285f236ff82ba88fdcf0f9370c49456f83335a50b89d0130fff9e5
 size 10264128

tokenizer.json CHANGED Viewed

@@ -2,7 +2,7 @@
   "version": "1.0",
   "truncation": {
     "direction": "Right",
-    "max_length": 128,
     "strategy": "LongestFirst",
     "stride": 0
   },

   "version": "1.0",
   "truncation": {
     "direction": "Right",
+    "max_length": 256,
     "strategy": "LongestFirst",
     "stride": 0
   },