Em4e commited on
Commit
223e4c3
·
verified ·
1 Parent(s): 8a0ee2e

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +20 -9
app.py CHANGED
@@ -5,7 +5,7 @@ import re
5
  from llama_index.core.node_parser import MarkdownNodeParser
6
  from llama_index.core.schema import Document, MetadataMode
7
  import textstat
8
- from markdownify import markdownify as md # <-- MODIFIED: Switched to markdownify
9
 
10
  # --- Core Logic Classes ---
11
 
@@ -215,16 +215,27 @@ st.markdown(
215
  "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
216
  st.info(
217
  """
 
 
 
 
 
 
 
 
 
218
  **How Layout-Based Chunking is Implemented Here**
219
- This app uses a sophisticated, two-step process to create meaningful chunks based on the document's visual and semantic structure:
220
- 1. **Structural Preservation (HTML → Markdown):**
221
- The code first converts the webpage's HTML into Markdown. This is a critical step because it translates structural tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`). This preserves the document's original layout and hierarchy.
222
- 2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
223
- Next, it uses the `MarkdownNodeParser` from the LlamaIndex library. This specialized tool is designed to read the structured Markdown and split it at its logical boundaries—primarily the headers (`#`, `##`, etc.).
224
- The result is a set of context-aware chunks that respect the original document's sections, rather than being arbitrary splits.
225
- "**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
226
  """,
227
- icon="ℹ️")
 
 
228
 
229
  url_input = st.text_input("Enter a webpage URL to start", key="url_input")
230
 
 
5
  from llama_index.core.node_parser import MarkdownNodeParser
6
  from llama_index.core.schema import Document, MetadataMode
7
  import textstat
8
+ from markdownify import markdownify as md
9
 
10
  # --- Core Logic Classes ---
11
 
 
215
  "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
216
  st.info(
217
  """
218
+ • **App version:** v0.0 (alpha) — this is the very first public release, so you may run into bugs or incomplete features.
219
+ • **Server policy warning:** this app relies on automated requests (“bots”) under the hood.
220
+ If the target server enforces a restrictive bot policy (e.g., rate-limits requests, blocks unknown user-agents or IP addresses), parts of the app **may not work** as expected.
221
+
222
+ **What to do if you hit an issue:**
223
+ 1. Check the server’s logs or policy settings to ensure it allows automated clients.
224
+ 2. Keep an eye out for updates — v0.x → v1.0 is coming soon!
225
+
226
+ ---
227
  **How Layout-Based Chunking is Implemented Here**
228
+ This app uses a sophisticated, two-step process to create meaningful chunks based on the documents visual and semantic structure:
229
+ 1. **Structural Preservation (HTML → Markdown):**
230
+ Converts the webpages HTML into Markdown, translating tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`) to preserve layout and hierarchy.
231
+ 2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
232
+ Uses LlamaIndex’s `MarkdownNodeParser` to read the structured Markdown and split it at logical boundaries (headers like `#`, `##`, etc.), yielding context-aware chunks that respect original sections.
233
+
234
+ _Note: Some websites may block content scraping. This is an early version, so you might encounter bugs._
235
  """,
236
+ icon="ℹ️"
237
+ )
238
+
239
 
240
  url_input = st.text_input("Enter a webpage URL to start", key="url_input")
241