Spaces:
Sleeping
Sleeping
Update app.py
Browse files
app.py
CHANGED
|
@@ -5,7 +5,7 @@ import re
|
|
| 5 |
from llama_index.core.node_parser import MarkdownNodeParser
|
| 6 |
from llama_index.core.schema import Document, MetadataMode
|
| 7 |
import textstat
|
| 8 |
-
from markdownify import markdownify as md
|
| 9 |
|
| 10 |
# --- Core Logic Classes ---
|
| 11 |
|
|
@@ -215,16 +215,27 @@ st.markdown(
|
|
| 215 |
"Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
|
| 216 |
st.info(
|
| 217 |
"""
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
**How Layout-Based Chunking is Implemented Here**
|
| 219 |
-
This app uses a sophisticated, two-step process to create meaningful chunks based on the document
|
| 220 |
-
1.
|
| 221 |
-
|
| 222 |
-
2.
|
| 223 |
-
|
| 224 |
-
|
| 225 |
-
|
| 226 |
""",
|
| 227 |
-
icon="ℹ️"
|
|
|
|
|
|
|
| 228 |
|
| 229 |
url_input = st.text_input("Enter a webpage URL to start", key="url_input")
|
| 230 |
|
|
|
|
| 5 |
from llama_index.core.node_parser import MarkdownNodeParser
|
| 6 |
from llama_index.core.schema import Document, MetadataMode
|
| 7 |
import textstat
|
| 8 |
+
from markdownify import markdownify as md
|
| 9 |
|
| 10 |
# --- Core Logic Classes ---
|
| 11 |
|
|
|
|
| 215 |
"Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
|
| 216 |
st.info(
|
| 217 |
"""
|
| 218 |
+
• **App version:** v0.0 (alpha) — this is the very first public release, so you may run into bugs or incomplete features.
|
| 219 |
+
• **Server policy warning:** this app relies on automated requests (“bots”) under the hood.
|
| 220 |
+
If the target server enforces a restrictive bot policy (e.g., rate-limits requests, blocks unknown user-agents or IP addresses), parts of the app **may not work** as expected.
|
| 221 |
+
|
| 222 |
+
**What to do if you hit an issue:**
|
| 223 |
+
1. Check the server’s logs or policy settings to ensure it allows automated clients.
|
| 224 |
+
2. Keep an eye out for updates — v0.x → v1.0 is coming soon!
|
| 225 |
+
|
| 226 |
+
---
|
| 227 |
**How Layout-Based Chunking is Implemented Here**
|
| 228 |
+
This app uses a sophisticated, two-step process to create meaningful chunks based on the document’s visual and semantic structure:
|
| 229 |
+
1. **Structural Preservation (HTML → Markdown):**
|
| 230 |
+
Converts the webpage’s HTML into Markdown, translating tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`) to preserve layout and hierarchy.
|
| 231 |
+
2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
|
| 232 |
+
Uses LlamaIndex’s `MarkdownNodeParser` to read the structured Markdown and split it at logical boundaries (headers like `#`, `##`, etc.), yielding context-aware chunks that respect original sections.
|
| 233 |
+
|
| 234 |
+
_Note: Some websites may block content scraping. This is an early version, so you might encounter bugs._
|
| 235 |
""",
|
| 236 |
+
icon="ℹ️"
|
| 237 |
+
)
|
| 238 |
+
|
| 239 |
|
| 240 |
url_input = st.text_input("Enter a webpage URL to start", key="url_input")
|
| 241 |
|