skalyan91 commited on
Commit
39f1c97
·
verified ·
1 Parent(s): 5b44626

deploy at 2026-01-05 20:05:15.533029

Browse files
Files changed (2) hide show
  1. CLAUDE.md +87 -0
  2. main.py +20 -4
CLAUDE.md ADDED
@@ -0,0 +1,87 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md
2
+
3
+ This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
4
+
5
+ ## Project Overview
6
+
7
+ This is an NLP-based readability feedback tool that helps writers improve their text by identifying sentence segments where related words are separated by long interruptions. The tool is based on the linguistic principle of Dependency Length Minimisation (DLM), which suggests that keeping syntactically related words close together reduces cognitive load and improves readability.
8
+
9
+ The application is built with FastHTML and uses spaCy for dependency parsing, deployed as a Docker container suitable for Hugging Face Spaces.
10
+
11
+ ## Development Commands
12
+
13
+ ### Setup
14
+ ```bash
15
+ # Install dependencies (including spaCy model)
16
+ pip install -r requirements.txt
17
+ ```
18
+
19
+ ### Running the Application
20
+ ```bash
21
+ # Run the web app locally (starts on port 5001 by default)
22
+ python main.py
23
+ ```
24
+
25
+ ### Docker
26
+ ```bash
27
+ # Build the Docker image
28
+ docker build -t readability-feedback .
29
+
30
+ # Run the container
31
+ docker run -p 7860:7860 readability-feedback
32
+ ```
33
+
34
+ ## Architecture
35
+
36
+ ### Core NLP Pipeline (main.py:29-220)
37
+
38
+ The application uses a custom dependency parsing approach that converts spaCy's output to Surface Syntactic Universal Dependencies (SUD) framework:
39
+
40
+ 1. **sudify(doc)** (main.py:29-157): Transforms spaCy dependency relations into SUD-style relations with three main categories:
41
+ - `subj`: subject relations (nsubj, nsubjpass, csubj, etc.)
42
+ - `comp`: complement relations (dobj, ccomp, xcomp, aux, etc.)
43
+ - `mod`: modifier relations (advmod, advcl, relcl, etc.)
44
+ - `udep`: underspecified dependencies (acl, amod, nmod, etc.)
45
+
46
+ The function performs several transformations:
47
+ - Reverses auxiliary/mark/case dependencies to create more linguistically accurate heads
48
+ - Handles semicolons/colons as clause separators (main.py:111-133)
49
+ - Classifies prepositional phrases as modifiers or complements based on position
50
+ - Identifies oblique clauses (ccomp that should be mod)
51
+
52
+ 2. **flyover(token)** (main.py:160-176): For each token with a `subj` or `comp` relation, calculates the "flyover" span (words between head and dependent) and its complexity (number of heads in the span). This implements the revised dependency length metric from Yadav et al. (2022).
53
+
54
+ 3. **get_fluff(doc)** (main.py:179-219): Identifies all maximal flyover spans that should be highlighted in the output. Uses a filtering algorithm to:
55
+ - Remove nested flyovers (keep only the outermost or most complex ones)
56
+ - Collect interstices (non-flyover spans) to reconstruct the full sentence
57
+ - Return a sorted list of (span, complexity) tuples for rendering
58
+
59
+ ### Web Application (main.py:222-410)
60
+
61
+ Built with FastHTML, a modern Python web framework:
62
+
63
+ - **index()** route (main.py:230-254): Main page with textarea input and output display area using HTMX for dynamic updates
64
+ - **about()** route (main.py:257-365): Documentation page explaining the linguistic theory and technical implementation
65
+ - **send(text)** route (main.py:368-406): POST endpoint that:
66
+ - Splits input into paragraphs
67
+ - Processes each paragraph through the NLP pipeline
68
+ - Returns HTML with highlighted spans using opacity proportional to flyover complexity
69
+
70
+ ### Key Implementation Details
71
+
72
+ - The app uses spaCy's `en_core_web_sm` model (loaded at main.py:6)
73
+ - Highlighting uses background color with opacity scaled by dependency distance: `rgba(237, 201, 241, {a[1]/5})` for light mode
74
+ - The sudify function handles complex edge cases like:
75
+ - Multiple subjects (main.py:96-103): only the closest subject is kept as `subj`, others become `comp`
76
+ - Parenthetical expressions: semicolons/colons are ignored if wrapped in parentheses (main.py:117-128)
77
+ - Post-core conjunctions: conj dependents after the core clause become modifiers (main.py:134-150)
78
+
79
+ ## Notes on Modification
80
+
81
+ When modifying the dependency transformation logic (sudify function), be aware that:
82
+ - Order of transformations matters: some rules depend on earlier transformations
83
+ - The function modifies the spaCy Doc object in place by reassigning `.head` and `.dep_` attributes
84
+ - Edge case handling for punctuation and parentheses is critical for accurate parsing
85
+ - The three-pass structure (main.py:29-55, 56-156, 92-156) progressively refines the dependency tree
86
+
87
+ The highlighting algorithm (get_fluff) uses span comparison logic that can be subtle to debug. The key insight is that overlapping flyovers are resolved by keeping the longer or more complex one.
main.py CHANGED
@@ -229,14 +229,23 @@ app, rt = fast_app(pico=True)
229
  @app.get
230
  def index():
231
  page = Div(
232
- Form(hx_post=send, hx_target="#output", hx_swap="outerHTML")(
 
 
 
 
233
  Div(
234
  Span(
235
  Button("Check"),
236
  A("How this works", href="/about"),
237
  style="margin-bottom: 1rem; display: flex; gap: 1rem; align-items: center",
238
  ),
239
- Textarea(name="text", style="height: calc(100vh - 11rem)"),
 
 
 
 
 
240
  )
241
  ),
242
  Div(
@@ -247,7 +256,7 @@ def index():
247
  cls="overflow-auto",
248
  style="height: 4rem; text-wrap: balance; padding: 0rem 1rem",
249
  ),
250
- Div(id="output", style="padding: 1rem; padding-bottom: calc(1rem - 5px)"),
251
  ),
252
  cls="grid",
253
  )
@@ -370,7 +379,13 @@ def send(text: str):
370
  paragraphs = re.sub(r"[^\S\r\n]+", " ", text).split("\r\n\r\n")
371
  docs = [sudify(nlp(para)) for para in paragraphs]
372
  annot_paras = [get_fluff(doc) for doc in docs]
 
 
 
 
 
373
  return Div(
 
374
  *[
375
  P(
376
  *[
@@ -402,7 +417,8 @@ def send(text: str):
402
  ),
403
  id="output",
404
  cls="overflow-auto",
405
- style="height: calc(100vh - 11rem); padding: 1rem; padding-bottom: calc(1rem - 5px)",
 
406
  )
407
 
408
 
 
229
  @app.get
230
  def index():
231
  page = Div(
232
+ Form(
233
+ hx_post=send,
234
+ hx_target="#output",
235
+ hx_swap="outerHTML show:none",
236
+ )(
237
  Div(
238
  Span(
239
  Button("Check"),
240
  A("How this works", href="/about"),
241
  style="margin-bottom: 1rem; display: flex; gap: 1rem; align-items: center",
242
  ),
243
+ Textarea(
244
+ name="text",
245
+ id="input-text",
246
+ style="height: calc(100vh - 11rem);",
247
+ onscroll="document.getElementById('output').scrollTop = this.scrollTop + 1; document.getElementById('output').scrollLeft = this.scrollLeft;",
248
+ ),
249
  )
250
  ),
251
  Div(
 
256
  cls="overflow-auto",
257
  style="height: 4rem; text-wrap: balance; padding: 0rem 1rem",
258
  ),
259
+ Div(id="output", style="padding: 1rem calc(1rem - 5px)"),
260
  ),
261
  cls="grid",
262
  )
 
379
  paragraphs = re.sub(r"[^\S\r\n]+", " ", text).split("\r\n\r\n")
380
  docs = [sudify(nlp(para)) for para in paragraphs]
381
  annot_paras = [get_fluff(doc) for doc in docs]
382
+
383
+ sync_script = Script(
384
+ "setTimeout(() => { const textarea = document.getElementById('input-text'); const output = document.getElementById('output'); if (textarea && output) { output.scrollTop = textarea.scrollTop + 1; output.scrollLeft = textarea.scrollLeft; } }, 100);"
385
+ )
386
+
387
  return Div(
388
+ sync_script,
389
  *[
390
  P(
391
  *[
 
417
  ),
418
  id="output",
419
  cls="overflow-auto",
420
+ style="height: calc(100vh - 11rem); padding: 1rem; padding-bottom: calc(1rem - 5px);",
421
+ onscroll="document.getElementById('input-text').scrollTop = this.scrollTop - 1; document.getElementById('input-text').scrollLeft = this.scrollLeft;",
422
  )
423
 
424