morpheuslord commited on
Commit
c54b88c
ยท
verified ยท
1 Parent(s): 9603c33

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md CHANGED
@@ -122,6 +122,126 @@ Rewriter/
122
  โ””โ”€โ”€ pyproject.toml # Project metadata
123
  ```
124
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
125
  ---
126
 
127
  ## Design Choices & Rationale
@@ -467,3 +587,4 @@ These are projected through a 2-layer MLP (`41 โ†’ 256 โ†’ 512`) with LayerNorm
467
  3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the similarity threshold (0.82) is a trade-off between coverage and accuracy
468
  4. **Already-correct text**: The model is trained on errorโ†’correction pairs; feeding it clean text produces unpredictable output
469
  5. **LanguageTool latency**: Spell correction takes ~15-20s due to JVM startup on first call
 
 
122
  โ””โ”€โ”€ pyproject.toml # Project metadata
123
  ```
124
 
125
+ ## Model Architecture
126
+ ```mermaid
127
+ graph TB
128
+ %% โ”€โ”€ Inference Pipeline (left-to-right flow) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
129
+ subgraph INFERENCE["๐Ÿ”ฎ Inference Pipeline"]
130
+ direction TB
131
+ INPUT["๐Ÿ“ Raw Dyslectic Text"]
132
+
133
+ subgraph PREPROCESS["Pre-Processing"]
134
+ SPELL["Spell Corrector<br/><i>dyslexia-aware phonetic</i>"]
135
+ SENT_SEG["Sentence Segmenter"]
136
+ DEP_PARSE["Dependency Parser"]
137
+ NER["NER Tagger"]
138
+ end
139
+
140
+ subgraph STYLE["Style Analysis"]
141
+ FINGER["Style Fingerprinter<br/><i>512-dim vector</i>"]
142
+ EMOTION["Emotion Classifier"]
143
+ FORMALITY["Formality Classifier"]
144
+ STYLE_VEC["Style Vector Composer"]
145
+ end
146
+
147
+ subgraph GENERATION["Core Generation"]
148
+ STYLE_COND["Style Conditioner<br/><i>prefix tuning</i>"]
149
+ BASE_MODEL["Base LM<br/><i>Flan-T5 / BART / Llama-3</i>"]
150
+ LORA["LoRA Adapter"]
151
+ GEN_UTILS["Generation Utils<br/><i>beam search, sampling</i>"]
152
+ end
153
+
154
+ subgraph POSTPROCESS["Post-Processing"]
155
+ POSTPROC["Post-Processor<br/><i>formatting, cleanup</i>"]
156
+ VOCAB_SUB["Lexical Substitution<br/><i>BERT-based</i>"]
157
+ AWL["AWL Loader<br/><i>Coxhead Academic Word List</i>"]
158
+ REG_FILTER["Register Filter<br/><i>academic tone gate</i>"]
159
+ end
160
+
161
+ OUTPUT["โœ… Corrected Academic Text"]
162
+
163
+ INPUT --> SPELL --> SENT_SEG --> DEP_PARSE --> NER
164
+ INPUT --> FINGER --> EMOTION --> FORMALITY --> STYLE_VEC
165
+ NER --> STYLE_COND
166
+ STYLE_VEC --> STYLE_COND
167
+ STYLE_COND --> BASE_MODEL
168
+ LORA -.->|"merged weights"| BASE_MODEL
169
+ BASE_MODEL --> GEN_UTILS --> POSTPROC
170
+ POSTPROC --> VOCAB_SUB
171
+ AWL --> VOCAB_SUB
172
+ VOCAB_SUB --> REG_FILTER --> OUTPUT
173
+ end
174
+
175
+ %% โ”€โ”€ Training Pipeline โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
176
+ subgraph TRAINING["๐Ÿ‹๏ธ Training Pipeline"]
177
+ direction TB
178
+
179
+ subgraph DATA["Data Pipeline"]
180
+ RAW_DATA["Raw Datasets<br/><i>JFLEG, WI+LOCNESS, C4_200M,<br/>FCE, Lang-8, NUCLE</i>"]
181
+ KAGGLE["Kaggle Datasets<br/><i>Shanegerami, Starblasters8</i>"]
182
+ PREPROC_SCRIPT["preprocess_data.py"]
183
+ TRAIN_JSONL["train.jsonl / val.jsonl / test.jsonl"]
184
+ end
185
+
186
+ subgraph HP_PRETRAIN["Human Pattern Pre-Training"]
187
+ FEAT_EXTRACT["Feature Extractor<br/><i>17-dim: perplexity, burstiness,<br/>n-gram novelty, AI markers...</i>"]
188
+ GPT2["GPT-2<br/><i>perplexity scorer</i>"]
189
+ HP_CLASSIFIER["Human Pattern Classifier<br/><i>MLP: 17โ†’128โ†’64โ†’1</i>"]
190
+ HP_WEIGHTS["human_pattern_classifier.pt"]
191
+ end
192
+
193
+ subgraph MAIN_TRAIN["Main Model Training"]
194
+ DATASET["WritingCorrectionDataset"]
195
+ COMBINED_LOSS["Combined Loss Function"]
196
+ L_CE["L_CE<br/><i>cross-entropy</i>"]
197
+ L_STYLE["ฮปโ‚ ยท L_style<br/><i>style consistency</i>"]
198
+ L_SEM["ฮปโ‚‚ ยท L_semantic<br/><i>meaning preservation</i>"]
199
+ L_HUMAN["ฮปโ‚ƒ ยท L_human_pattern<br/><i>anti-AI penalty</i>"]
200
+ TRAINER["CorrectionTrainer"]
201
+ CALLBACKS["Callbacks<br/><i>StyleMetrics,<br/>EarlyStoppingOnStyleDrift</i>"]
202
+ end
203
+
204
+ subgraph EVAL["Evaluation"]
205
+ ERRANT["ERRANT Evaluator<br/><i>P / R / Fโ‚€.โ‚…</i>"]
206
+ GLEU["GLEU Scorer"]
207
+ STYLE_MET["Style Metrics<br/><i>cosine similarity</i>"]
208
+ AUTH_VER["Authorship Verifier<br/><i>AI detection resistance</i>"]
209
+ end
210
+
211
+ RAW_DATA --> PREPROC_SCRIPT --> TRAIN_JSONL
212
+ KAGGLE --> FEAT_EXTRACT
213
+ GPT2 --> FEAT_EXTRACT --> HP_CLASSIFIER --> HP_WEIGHTS
214
+ TRAIN_JSONL --> DATASET --> TRAINER
215
+ L_CE --> COMBINED_LOSS
216
+ L_STYLE --> COMBINED_LOSS
217
+ L_SEM --> COMBINED_LOSS
218
+ HP_WEIGHTS -.->|"frozen"| L_HUMAN --> COMBINED_LOSS
219
+ COMBINED_LOSS --> TRAINER
220
+ CALLBACKS --> TRAINER
221
+ TRAINER --> EVAL
222
+ end
223
+
224
+ %% โ”€โ”€ API Layer โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
225
+ subgraph API["๐ŸŒ FastAPI Server"]
226
+ ENDPOINT["/correct endpoint"]
227
+ SCHEMAS["Request / Response Schemas"]
228
+ MIDDLEWARE["Rate Limiting & CORS"]
229
+ CORRECTOR["Corrector<br/><i>orchestrates full pipeline</i>"]
230
+ end
231
+
232
+ ENDPOINT --> CORRECTOR --> INFERENCE
233
+ TRAINER -->|"best_model/"| BASE_MODEL
234
+
235
+ %% โ”€โ”€ Styling โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€๏ฟฝ๏ฟฝ๏ฟฝโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
236
+ classDef pipeline fill:#1a1a2e,stroke:#16213e,color:#e94560,stroke-width:2px
237
+ classDef module fill:#0f3460,stroke:#533483,color:#e2e2e2,stroke-width:1px
238
+ classDef data fill:#1a1a2e,stroke:#e94560,color:#eee,stroke-width:1px
239
+ classDef output fill:#533483,stroke:#e94560,color:#fff,stroke-width:2px
240
+
241
+ class INPUT,RAW_DATA,KAGGLE,TRAIN_JSONL data
242
+ class OUTPUT,HP_WEIGHTS output
243
+ ```
244
+
245
  ---
246
 
247
  ## Design Choices & Rationale
 
587
  3. **Vocabulary elevation**: BERT fill-mask can suggest semantically inappropriate AWL words; the similarity threshold (0.82) is a trade-off between coverage and accuracy
588
  4. **Already-correct text**: The model is trained on errorโ†’correction pairs; feeding it clean text produces unpredictable output
589
  5. **LanguageTool latency**: Spell correction takes ~15-20s due to JVM startup on first call
590
+ 6. **Semantic drift in correction**: Qualitative evaluation reveals the pipeline can introduce meaning-level errors rather than purely correcting surface errors โ€” e.g. dyslexic phonetic patterns misread by LanguageTool produce plausible-but-wrong word substitutions that corrupt the intended meaning. The Style Similarity metric (0.96) does not capture this failure mode, as it measures surface token overlap rather than semantic faithfulness. Future work should add **BERTScore F1** and **Word Error Rate (WER)** against ground-truth corrections as primary evaluation signals, and a dedicated post-correction **semantic faithfulness check** (cosine similarity between input and output sentence embeddings) to flag and reject meaning-drift before returning output.