Spaces:
Sleeping
Sleeping
Update app.py
Browse files
app.py
CHANGED
|
@@ -213,29 +213,28 @@ st.caption("A tool to fetch, chunk, and refine web content.")
|
|
| 213 |
st.markdown(
|
| 214 |
"Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
|
| 215 |
"Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
|
| 216 |
-
st.
|
| 217 |
-
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
|
| 221 |
-
|
| 222 |
-
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
-
|
| 227 |
-
|
| 228 |
-
|
| 229 |
-
|
| 230 |
-
|
| 231 |
-
|
| 232 |
-
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
|
| 236 |
-
|
| 237 |
-
)
|
| 238 |
-
|
| 239 |
|
| 240 |
url_input = st.text_input("Enter a webpage URL to start", key="url_input")
|
| 241 |
|
|
|
|
| 213 |
st.markdown(
|
| 214 |
"Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
|
| 215 |
"Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
|
| 216 |
+
with st.expander("ℹ️ App Information & Chunking Details", expanded=False):
|
| 217 |
+
st.info(
|
| 218 |
+
"""
|
| 219 |
+
• **App version:** v0.0 (alpha) — this is the very first public release, so you may run into bugs or incomplete features.
|
| 220 |
+
• **Server policy warning:** this app relies on automated requests (“bots”) under the hood.
|
| 221 |
+
If the target server enforces a restrictive bot policy (e.g., rate-limits requests, blocks unknown user-agents or IP addresses), parts of the app **may not work** as expected.
|
| 222 |
+
|
| 223 |
+
**What to do if you hit an issue:**
|
| 224 |
+
1. Check the server’s logs or policy settings to ensure it allows automated clients.
|
| 225 |
+
2. Keep an eye out for updates — v0.x → v1.0 is coming soon!
|
| 226 |
+
|
| 227 |
+
---
|
| 228 |
+
**How Layout-Based Chunking is Implemented Here**
|
| 229 |
+
This app uses a sophisticated, two-step process to create meaningful chunks based on the document’s visual and semantic structure:
|
| 230 |
+
1. **Structural Preservation (HTML → Markdown):**
|
| 231 |
+
Converts the webpage’s HTML into Markdown, translating tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`) to preserve layout and hierarchy.
|
| 232 |
+
2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
|
| 233 |
+
Uses LlamaIndex’s `MarkdownNodeParser` to read the structured Markdown and split it at logical boundaries (headers like `#`, `##`, etc.), yielding context-aware chunks that respect original sections.
|
| 234 |
+
|
| 235 |
+
_Note: Some websites may block content scraping. This is an early version, so you might encounter bugs._
|
| 236 |
+
"""
|
| 237 |
+
, icon="ℹ️")
|
|
|
|
| 238 |
|
| 239 |
url_input = st.text_input("Enter a webpage URL to start", key="url_input")
|
| 240 |
|