Em4e commited on
Commit
7926d85
·
verified ·
1 Parent(s): 223e4c3

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +22 -23
app.py CHANGED
@@ -213,29 +213,28 @@ st.caption("A tool to fetch, chunk, and refine web content.")
213
  st.markdown(
214
  "Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
215
  "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
216
- st.info(
217
- """
218
- • **App version:** v0.0 (alpha) — this is the very first public release, so you may run into bugs or incomplete features.
219
- • **Server policy warning:** this app relies on automated requests (“bots”) under the hood.
220
- If the target server enforces a restrictive bot policy (e.g., rate-limits requests, blocks unknown user-agents or IP addresses), parts of the app **may not work** as expected.
221
-
222
- **What to do if you hit an issue:**
223
- 1. Check the server’s logs or policy settings to ensure it allows automated clients.
224
- 2. Keep an eye out for updates v0.x v1.0 is coming soon!
225
-
226
- ---
227
- **How Layout-Based Chunking is Implemented Here**
228
- This app uses a sophisticated, two-step process to create meaningful chunks based on the document’s visual and semantic structure:
229
- 1. **Structural Preservation (HTML Markdown):**
230
- Converts the webpage’s HTML into Markdown, translating tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`) to preserve layout and hierarchy.
231
- 2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
232
- Uses LlamaIndex’s `MarkdownNodeParser` to read the structured Markdown and split it at logical boundaries (headers like `#`, `##`, etc.), yielding context-aware chunks that respect original sections.
233
-
234
- _Note: Some websites may block content scraping. This is an early version, so you might encounter bugs._
235
- """,
236
- icon="ℹ️"
237
- )
238
-
239
 
240
  url_input = st.text_input("Enter a webpage URL to start", key="url_input")
241
 
 
213
  st.markdown(
214
  "Developed by [Emilija Gjorgjevska](https://www.linkedin.com/in/emilijagjorgjevska/). "
215
  "Inspired by Andrea Volpini's [work on content chunking](https://www.linkedin.com/pulse/understanding-chunking-google-ai-mode-practical-content-volpini-zseaf/)")
216
+ with st.expander("ℹ️ App Information & Chunking Details", expanded=False):
217
+ st.info(
218
+ """
219
+ • **App version:** v0.0 (alpha) — this is the very first public release, so you may run into bugs or incomplete features.
220
+ **Server policy warning:** this app relies on automated requests (“bots”) under the hood.
221
+ If the target server enforces a restrictive bot policy (e.g., rate-limits requests, blocks unknown user-agents or IP addresses), parts of the app **may not work** as expected.
222
+
223
+ **What to do if you hit an issue:**
224
+ 1. Check the server’s logs or policy settings to ensure it allows automated clients.
225
+ 2. Keep an eye out for updates — v0.x → v1.0 is coming soon!
226
+
227
+ ---
228
+ **How Layout-Based Chunking is Implemented Here**
229
+ This app uses a sophisticated, two-step process to create meaningful chunks based on the document’s visual and semantic structure:
230
+ 1. **Structural Preservation (HTML Markdown):**
231
+ Converts the webpage’s HTML into Markdown, translating tags (`<h1>`, `<p>`, `<ul>`) into their Markdown equivalents (`#`, paragraph breaks, `*`) to preserve layout and hierarchy.
232
+ 2. **Layout-Aware Parsing (`MarkdownNodeParser`):**
233
+ Uses LlamaIndex’s `MarkdownNodeParser` to read the structured Markdown and split it at logical boundaries (headers like `#`, `##`, etc.), yielding context-aware chunks that respect original sections.
234
+
235
+ _Note: Some websites may block content scraping. This is an early version, so you might encounter bugs._
236
+ """
237
+ , icon="ℹ️")
 
238
 
239
  url_input = st.text_input("Enter a webpage URL to start", key="url_input")
240