Spaces:
Sleeping
Sleeping
Update app.py
Browse files
app.py
CHANGED
|
@@ -222,9 +222,22 @@ st.markdown(
|
|
| 222 |
"Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
|
| 223 |
"Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)"
|
| 224 |
)
|
| 225 |
-
|
| 226 |
st.info(
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
"**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
|
|
|
|
| 228 |
icon="ℹ️"
|
| 229 |
)
|
| 230 |
|
|
|
|
| 222 |
"Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
|
| 223 |
"Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)"
|
| 224 |
)
|
|
|
|
| 225 |
st.info(
|
| 226 |
+
"""
|
| 227 |
+
**How Layout-Based Chunking is Implemented Here**
|
| 228 |
+
|
| 229 |
+
This app uses a sophisticated, two-step process to create meaningful chunks based on the document's visual and semantic structure:
|
| 230 |
+
|
| 231 |
+
1. **Structural Preservation (HTML → Markdown):**
|
| 232 |
+
The code first converts the webpage's HTML into Markdown. This is a critical step because it translates structural tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`). This preserves the document's original layout and hierarchy.
|
| 233 |
+
|
| 234 |
+
2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
|
| 235 |
+
Next, it uses the `MarkdownNodeParser` from the LlamaIndex library. This specialized tool is designed to read the structured Markdown and split it at its logical boundaries—primarily the headers (`#`, `##`, etc.).
|
| 236 |
+
|
| 237 |
+
The result is a set of context-aware chunks that respect the original document's sections, rather than being arbitrary splits.
|
| 238 |
+
|
| 239 |
"**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
|
| 240 |
+
""",
|
| 241 |
icon="ℹ️"
|
| 242 |
)
|
| 243 |
|