Spaces:

Em4e
/

chunk-based-text-editor

Sleeping

Em4e commited on Jun 9, 2025

Commit

c3faf12

verified ·

1 Parent(s): 65f1d08

Update app.py

Files changed (1) hide show

app.py CHANGED Viewed

@@ -222,9 +222,22 @@ st.markdown(
     "Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
     "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)"
 )
 st.info(
     "**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
     icon="ℹ️"
 )

     "Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
     "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)"
 )
 st.info(
+    """
+    **How Layout-Based Chunking is Implemented Here**
+    This app uses a sophisticated, two-step process to create meaningful chunks based on the document's visual and semantic structure:
+    1.  **Structural Preservation (HTML → Markdown):**
+        The code first converts the webpage's HTML into Markdown. This is a critical step because it translates structural tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`). This preserves the document's original layout and hierarchy.
+    2.  **Layout-Aware Parsing (`MarkdownNodeParser`):**
+        Next, it uses the `MarkdownNodeParser` from the LlamaIndex library. This specialized tool is designed to read the structured Markdown and split it at its logical boundaries—primarily the headers (`#`, `##`, etc.).
+    The result is a set of context-aware chunks that respect the original document's sections, rather than being arbitrary splits.
     "**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
+    """,
     icon="ℹ️"
 )