Em4e commited on
Commit
c3faf12
·
verified ·
1 Parent(s): 65f1d08

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +14 -1
app.py CHANGED
@@ -222,9 +222,22 @@ st.markdown(
222
  "Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
223
  "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)"
224
  )
225
-
226
  st.info(
 
 
 
 
 
 
 
 
 
 
 
 
 
227
  "**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
 
228
  icon="ℹ️"
229
  )
230
 
 
222
  "Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
223
  "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)"
224
  )
 
225
  st.info(
226
+ """
227
+ **How Layout-Based Chunking is Implemented Here**
228
+
229
+ This app uses a sophisticated, two-step process to create meaningful chunks based on the document's visual and semantic structure:
230
+
231
+ 1. **Structural Preservation (HTML → Markdown):**
232
+ The code first converts the webpage's HTML into Markdown. This is a critical step because it translates structural tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`). This preserves the document's original layout and hierarchy.
233
+
234
+ 2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
235
+ Next, it uses the `MarkdownNodeParser` from the LlamaIndex library. This specialized tool is designed to read the structured Markdown and split it at its logical boundaries—primarily the headers (`#`, `##`, etc.).
236
+
237
+ The result is a set of context-aware chunks that respect the original document's sections, rather than being arbitrary splits.
238
+
239
  "**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
240
+ """,
241
  icon="ℹ️"
242
  )
243