Spaces:

Em4e
/

chunk-based-text-editor

Sleeping

App Files Files Community

Em4e commited on Jun 9, 2025

Commit

223e4c3

verified ·

1 Parent(s): 8a0ee2e

Update app.py

Browse files

Files changed (1) hide show

app.py +20 -9

app.py CHANGED Viewed

@@ -5,7 +5,7 @@ import re
 from llama_index.core.node_parser import MarkdownNodeParser
 from llama_index.core.schema import Document, MetadataMode
 import textstat
-from markdownify import markdownify as md # <-- MODIFIED: Switched to markdownify
 # --- Core Logic Classes ---
@@ -215,16 +215,27 @@ st.markdown(
     "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
 st.info(
     """
     **How Layout-Based Chunking is Implemented Here**
-    This app uses a sophisticated, two-step process to create meaningful chunks based on the document's visual and semantic structure:
-    1.  **Structural Preservation (HTML → Markdown):**
-        The code first converts the webpage's HTML into Markdown. This is a critical step because it translates structural tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`). This preserves the document's original layout and hierarchy.
-    2.  **Layout-Aware Parsing (`MarkdownNodeParser`):**
-        Next, it uses the `MarkdownNodeParser` from the LlamaIndex library. This specialized tool is designed to read the structured Markdown and split it at its logical boundaries—primarily the headers (`#`, `##`, etc.).
-    The result is a set of context-aware chunks that respect the original document's sections, rather than being arbitrary splits.
-    "**Note:** Some websites may block content scraping. This is an early version, so you might encounter bugs.",
     """,
-    icon="ℹ️")
 url_input = st.text_input("Enter a webpage URL to start", key="url_input")

 from llama_index.core.node_parser import MarkdownNodeParser
 from llama_index.core.schema import Document, MetadataMode
 import textstat
+from markdownify import markdownify as md
 # --- Core Logic Classes ---
     "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
 st.info(
     """
+    • **App version:** v0.0 (alpha) — this is the very first public release, so you may run into bugs or incomplete features.
+    • **Server policy warning:** this app relies on automated requests (“bots”) under the hood.
+      If the target server enforces a restrictive bot policy (e.g., rate-limits requests, blocks unknown user-agents or IP addresses), parts of the app **may not work** as expected.
+    **What to do if you hit an issue:**
+    1. Check the server’s logs or policy settings to ensure it allows automated clients.
+    2. Keep an eye out for updates — v0.x → v1.0 is coming soon!
+    ---
     **How Layout-Based Chunking is Implemented Here**
+    This app uses a sophisticated, two-step process to create meaningful chunks based on the document’s visual and semantic structure:
+    1. **Structural Preservation (HTML → Markdown):**
+       Converts the webpage’s HTML into Markdown, translating tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`) to preserve layout and hierarchy.
+    2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
+       Uses LlamaIndex’s `MarkdownNodeParser` to read the structured Markdown and split it at logical boundaries (headers like `#`, `##`, etc.), yielding context-aware chunks that respect original sections.
+    _Note: Some websites may block content scraping. This is an early version, so you might encounter bugs._
     """,
+    icon="ℹ️"
+)
 url_input = st.text_input("Enter a webpage URL to start", key="url_input")