Spaces:
Sleeping
Sleeping
deploy at 2026-01-05 20:05:15.533029
Browse files
CLAUDE.md
ADDED
|
@@ -0,0 +1,87 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# CLAUDE.md
|
| 2 |
+
|
| 3 |
+
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
| 4 |
+
|
| 5 |
+
## Project Overview
|
| 6 |
+
|
| 7 |
+
This is an NLP-based readability feedback tool that helps writers improve their text by identifying sentence segments where related words are separated by long interruptions. The tool is based on the linguistic principle of Dependency Length Minimisation (DLM), which suggests that keeping syntactically related words close together reduces cognitive load and improves readability.
|
| 8 |
+
|
| 9 |
+
The application is built with FastHTML and uses spaCy for dependency parsing, deployed as a Docker container suitable for Hugging Face Spaces.
|
| 10 |
+
|
| 11 |
+
## Development Commands
|
| 12 |
+
|
| 13 |
+
### Setup
|
| 14 |
+
```bash
|
| 15 |
+
# Install dependencies (including spaCy model)
|
| 16 |
+
pip install -r requirements.txt
|
| 17 |
+
```
|
| 18 |
+
|
| 19 |
+
### Running the Application
|
| 20 |
+
```bash
|
| 21 |
+
# Run the web app locally (starts on port 5001 by default)
|
| 22 |
+
python main.py
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
### Docker
|
| 26 |
+
```bash
|
| 27 |
+
# Build the Docker image
|
| 28 |
+
docker build -t readability-feedback .
|
| 29 |
+
|
| 30 |
+
# Run the container
|
| 31 |
+
docker run -p 7860:7860 readability-feedback
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## Architecture
|
| 35 |
+
|
| 36 |
+
### Core NLP Pipeline (main.py:29-220)
|
| 37 |
+
|
| 38 |
+
The application uses a custom dependency parsing approach that converts spaCy's output to Surface Syntactic Universal Dependencies (SUD) framework:
|
| 39 |
+
|
| 40 |
+
1. **sudify(doc)** (main.py:29-157): Transforms spaCy dependency relations into SUD-style relations with three main categories:
|
| 41 |
+
- `subj`: subject relations (nsubj, nsubjpass, csubj, etc.)
|
| 42 |
+
- `comp`: complement relations (dobj, ccomp, xcomp, aux, etc.)
|
| 43 |
+
- `mod`: modifier relations (advmod, advcl, relcl, etc.)
|
| 44 |
+
- `udep`: underspecified dependencies (acl, amod, nmod, etc.)
|
| 45 |
+
|
| 46 |
+
The function performs several transformations:
|
| 47 |
+
- Reverses auxiliary/mark/case dependencies to create more linguistically accurate heads
|
| 48 |
+
- Handles semicolons/colons as clause separators (main.py:111-133)
|
| 49 |
+
- Classifies prepositional phrases as modifiers or complements based on position
|
| 50 |
+
- Identifies oblique clauses (ccomp that should be mod)
|
| 51 |
+
|
| 52 |
+
2. **flyover(token)** (main.py:160-176): For each token with a `subj` or `comp` relation, calculates the "flyover" span (words between head and dependent) and its complexity (number of heads in the span). This implements the revised dependency length metric from Yadav et al. (2022).
|
| 53 |
+
|
| 54 |
+
3. **get_fluff(doc)** (main.py:179-219): Identifies all maximal flyover spans that should be highlighted in the output. Uses a filtering algorithm to:
|
| 55 |
+
- Remove nested flyovers (keep only the outermost or most complex ones)
|
| 56 |
+
- Collect interstices (non-flyover spans) to reconstruct the full sentence
|
| 57 |
+
- Return a sorted list of (span, complexity) tuples for rendering
|
| 58 |
+
|
| 59 |
+
### Web Application (main.py:222-410)
|
| 60 |
+
|
| 61 |
+
Built with FastHTML, a modern Python web framework:
|
| 62 |
+
|
| 63 |
+
- **index()** route (main.py:230-254): Main page with textarea input and output display area using HTMX for dynamic updates
|
| 64 |
+
- **about()** route (main.py:257-365): Documentation page explaining the linguistic theory and technical implementation
|
| 65 |
+
- **send(text)** route (main.py:368-406): POST endpoint that:
|
| 66 |
+
- Splits input into paragraphs
|
| 67 |
+
- Processes each paragraph through the NLP pipeline
|
| 68 |
+
- Returns HTML with highlighted spans using opacity proportional to flyover complexity
|
| 69 |
+
|
| 70 |
+
### Key Implementation Details
|
| 71 |
+
|
| 72 |
+
- The app uses spaCy's `en_core_web_sm` model (loaded at main.py:6)
|
| 73 |
+
- Highlighting uses background color with opacity scaled by dependency distance: `rgba(237, 201, 241, {a[1]/5})` for light mode
|
| 74 |
+
- The sudify function handles complex edge cases like:
|
| 75 |
+
- Multiple subjects (main.py:96-103): only the closest subject is kept as `subj`, others become `comp`
|
| 76 |
+
- Parenthetical expressions: semicolons/colons are ignored if wrapped in parentheses (main.py:117-128)
|
| 77 |
+
- Post-core conjunctions: conj dependents after the core clause become modifiers (main.py:134-150)
|
| 78 |
+
|
| 79 |
+
## Notes on Modification
|
| 80 |
+
|
| 81 |
+
When modifying the dependency transformation logic (sudify function), be aware that:
|
| 82 |
+
- Order of transformations matters: some rules depend on earlier transformations
|
| 83 |
+
- The function modifies the spaCy Doc object in place by reassigning `.head` and `.dep_` attributes
|
| 84 |
+
- Edge case handling for punctuation and parentheses is critical for accurate parsing
|
| 85 |
+
- The three-pass structure (main.py:29-55, 56-156, 92-156) progressively refines the dependency tree
|
| 86 |
+
|
| 87 |
+
The highlighting algorithm (get_fluff) uses span comparison logic that can be subtle to debug. The key insight is that overlapping flyovers are resolved by keeping the longer or more complex one.
|
main.py
CHANGED
|
@@ -229,14 +229,23 @@ app, rt = fast_app(pico=True)
|
|
| 229 |
@app.get
|
| 230 |
def index():
|
| 231 |
page = Div(
|
| 232 |
-
Form(
|
|
|
|
|
|
|
|
|
|
|
|
|
| 233 |
Div(
|
| 234 |
Span(
|
| 235 |
Button("Check"),
|
| 236 |
A("How this works", href="/about"),
|
| 237 |
style="margin-bottom: 1rem; display: flex; gap: 1rem; align-items: center",
|
| 238 |
),
|
| 239 |
-
Textarea(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
)
|
| 241 |
),
|
| 242 |
Div(
|
|
@@ -247,7 +256,7 @@ def index():
|
|
| 247 |
cls="overflow-auto",
|
| 248 |
style="height: 4rem; text-wrap: balance; padding: 0rem 1rem",
|
| 249 |
),
|
| 250 |
-
Div(id="output", style="padding: 1rem
|
| 251 |
),
|
| 252 |
cls="grid",
|
| 253 |
)
|
|
@@ -370,7 +379,13 @@ def send(text: str):
|
|
| 370 |
paragraphs = re.sub(r"[^\S\r\n]+", " ", text).split("\r\n\r\n")
|
| 371 |
docs = [sudify(nlp(para)) for para in paragraphs]
|
| 372 |
annot_paras = [get_fluff(doc) for doc in docs]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 373 |
return Div(
|
|
|
|
| 374 |
*[
|
| 375 |
P(
|
| 376 |
*[
|
|
@@ -402,7 +417,8 @@ def send(text: str):
|
|
| 402 |
),
|
| 403 |
id="output",
|
| 404 |
cls="overflow-auto",
|
| 405 |
-
style="height: calc(100vh - 11rem); padding: 1rem; padding-bottom: calc(1rem - 5px)",
|
|
|
|
| 406 |
)
|
| 407 |
|
| 408 |
|
|
|
|
| 229 |
@app.get
|
| 230 |
def index():
|
| 231 |
page = Div(
|
| 232 |
+
Form(
|
| 233 |
+
hx_post=send,
|
| 234 |
+
hx_target="#output",
|
| 235 |
+
hx_swap="outerHTML show:none",
|
| 236 |
+
)(
|
| 237 |
Div(
|
| 238 |
Span(
|
| 239 |
Button("Check"),
|
| 240 |
A("How this works", href="/about"),
|
| 241 |
style="margin-bottom: 1rem; display: flex; gap: 1rem; align-items: center",
|
| 242 |
),
|
| 243 |
+
Textarea(
|
| 244 |
+
name="text",
|
| 245 |
+
id="input-text",
|
| 246 |
+
style="height: calc(100vh - 11rem);",
|
| 247 |
+
onscroll="document.getElementById('output').scrollTop = this.scrollTop + 1; document.getElementById('output').scrollLeft = this.scrollLeft;",
|
| 248 |
+
),
|
| 249 |
)
|
| 250 |
),
|
| 251 |
Div(
|
|
|
|
| 256 |
cls="overflow-auto",
|
| 257 |
style="height: 4rem; text-wrap: balance; padding: 0rem 1rem",
|
| 258 |
),
|
| 259 |
+
Div(id="output", style="padding: 1rem calc(1rem - 5px)"),
|
| 260 |
),
|
| 261 |
cls="grid",
|
| 262 |
)
|
|
|
|
| 379 |
paragraphs = re.sub(r"[^\S\r\n]+", " ", text).split("\r\n\r\n")
|
| 380 |
docs = [sudify(nlp(para)) for para in paragraphs]
|
| 381 |
annot_paras = [get_fluff(doc) for doc in docs]
|
| 382 |
+
|
| 383 |
+
sync_script = Script(
|
| 384 |
+
"setTimeout(() => { const textarea = document.getElementById('input-text'); const output = document.getElementById('output'); if (textarea && output) { output.scrollTop = textarea.scrollTop + 1; output.scrollLeft = textarea.scrollLeft; } }, 100);"
|
| 385 |
+
)
|
| 386 |
+
|
| 387 |
return Div(
|
| 388 |
+
sync_script,
|
| 389 |
*[
|
| 390 |
P(
|
| 391 |
*[
|
|
|
|
| 417 |
),
|
| 418 |
id="output",
|
| 419 |
cls="overflow-auto",
|
| 420 |
+
style="height: calc(100vh - 11rem); padding: 1rem; padding-bottom: calc(1rem - 5px);",
|
| 421 |
+
onscroll="document.getElementById('input-text').scrollTop = this.scrollTop - 1; document.getElementById('input-text').scrollLeft = this.scrollLeft;",
|
| 422 |
)
|
| 423 |
|
| 424 |
|