axelsirota commited on
Commit
d49e945
·
verified ·
1 Parent(s): 828e609

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +10 -4
  2. app.py +243 -0
  3. requirements.txt +1 -0
README.md CHANGED
@@ -1,12 +1,18 @@
1
  ---
2
  title: Chunking Visualizer
3
- emoji: 🐢
4
- colorFrom: blue
5
  colorTo: blue
6
  sdk: gradio
7
- sdk_version: 6.5.1
8
  app_file: app.py
9
  pinned: false
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
1
  ---
2
  title: Chunking Visualizer
3
+ emoji: ✂️
4
+ colorFrom: green
5
  colorTo: blue
6
  sdk: gradio
7
+ sdk_version: 5.9.1
8
  app_file: app.py
9
  pinned: false
10
+ license: mit
11
+ short_description: See how chunking strategies split your documents
12
  ---
13
 
14
+ # Chunking Visualizer
15
+
16
+ Visualize how different chunking strategies split your documents. Spot problems before they hit production.
17
+
18
+ Part of the **AI for Product Managers** course by Data Trainers LLC.
app.py ADDED
@@ -0,0 +1,243 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import re
3
+
4
+ SAMPLE_DOCS = {
5
+ "FAQ Document": """Q: What is your return policy?
6
+ A: You can return most items within 30 days of purchase for a full refund. Items must be in original condition with tags attached.
7
+
8
+ Q: How long does shipping take?
9
+ A: Standard shipping takes 5-7 business days. Express shipping takes 2-3 business days.
10
+
11
+ Q: Do you offer international shipping?
12
+ A: Yes, we ship to over 50 countries. International shipping typically takes 10-14 business days.
13
+
14
+ Q: How do I track my order?
15
+ A: Once your order ships, you'll receive an email with tracking information. You can also check order status in your account.
16
+
17
+ Q: What payment methods do you accept?
18
+ A: We accept Visa, Mastercard, American Express, PayPal, and Apple Pay.""",
19
+
20
+ "Product Documentation": """Smart Thermostat Pro - User Guide
21
+
22
+ Installation:
23
+ Turn off power at the circuit breaker before beginning installation. Remove your old thermostat and take a photo of the wiring. The Smart Thermostat Pro is compatible with most 24V heating and cooling systems.
24
+
25
+ Setup:
26
+ Download the SmartHome app and create an account. The thermostat will automatically enter pairing mode when powered on. Follow the in-app instructions to connect to your WiFi network.
27
+
28
+ Daily Use:
29
+ The touchscreen displays current temperature and humidity. Swipe left or right to adjust target temperature. Tap the calendar icon to view and edit your schedule.
30
+
31
+ Energy Saving Features:
32
+ Auto-Away detects when you leave and adjusts temperature to save energy. The monthly energy report shows your usage patterns and savings. Eco mode reduces heating/cooling by 2 degrees to save up to 15% on energy bills.
33
+
34
+ Troubleshooting:
35
+ If the display is blank, check that power is connected at the circuit breaker. If WiFi won't connect, ensure your network is 2.4GHz (5GHz is not supported). For heating/cooling issues, verify the system wires match the terminal labels.""",
36
+
37
+ "Policy Document": """Employee Remote Work Policy
38
+
39
+ 1. Eligibility
40
+ All full-time employees who have completed their probationary period are eligible for remote work. Certain roles requiring physical presence are exempt from this policy.
41
+
42
+ 2. Core Hours
43
+ Remote employees must be available from 10am to 3pm in their local timezone. This ensures overlap for team collaboration and meetings.
44
+
45
+ 3. Equipment
46
+ The company provides a laptop and external monitor. Employees are responsible for maintaining reliable internet connectivity with minimum 25 Mbps speed.
47
+
48
+ 4. Communication
49
+ Employees must respond to messages within 2 hours during core hours. All meetings should be attended with camera on unless otherwise specified.
50
+
51
+ 5. Performance
52
+ Remote work privileges are contingent on meeting performance expectations. Managers will review remote work arrangements quarterly.
53
+
54
+ 6. Expenses
55
+ Home office setup stipend: $500 one-time. Monthly internet reimbursement: up to $50. Coworking space usage requires pre-approval."""
56
+ }
57
+
58
+
59
+ def chunk_fixed_size(text, size, overlap_pct):
60
+ """Fixed-size chunking."""
61
+ overlap = int(size * overlap_pct / 100)
62
+ chunks = []
63
+ start = 0
64
+ while start < len(text):
65
+ end = min(start + size, len(text))
66
+ chunks.append(text[start:end])
67
+ start = end - overlap if overlap > 0 else end
68
+ return chunks
69
+
70
+
71
+ def chunk_sentence(text, sentences_per_chunk):
72
+ """Sentence-based chunking."""
73
+ sentences = re.split(r'(?<=[.!?])\s+', text)
74
+ chunks = []
75
+ for i in range(0, len(sentences), sentences_per_chunk):
76
+ chunk = ' '.join(sentences[i:i + sentences_per_chunk])
77
+ if chunk.strip():
78
+ chunks.append(chunk.strip())
79
+ return chunks
80
+
81
+
82
+ def chunk_paragraph(text):
83
+ """Paragraph-based chunking."""
84
+ paragraphs = text.split('\n\n')
85
+ return [p.strip() for p in paragraphs if p.strip()]
86
+
87
+
88
+ def chunk_qa_pairs(text):
89
+ """Q&A pair chunking (for FAQ documents)."""
90
+ pattern = r'(Q:.*?A:.*?)(?=Q:|$)'
91
+ matches = re.findall(pattern, text, re.DOTALL)
92
+ return [m.strip() for m in matches if m.strip()]
93
+
94
+
95
+ def visualize_chunks(text, strategy, chunk_size, overlap_pct, sentences_per_chunk):
96
+ """Generate chunk visualization."""
97
+ if not text.strip():
98
+ return "Please provide text to chunk.", "", ""
99
+
100
+ # Apply chunking strategy
101
+ if strategy == "Fixed Size":
102
+ chunks = chunk_fixed_size(text, chunk_size, overlap_pct)
103
+ elif strategy == "Sentence-Based":
104
+ chunks = chunk_sentence(text, sentences_per_chunk)
105
+ elif strategy == "Paragraph-Based":
106
+ chunks = chunk_paragraph(text)
107
+ elif strategy == "Q&A Pairs":
108
+ chunks = chunk_qa_pairs(text)
109
+ else:
110
+ chunks = [text]
111
+
112
+ if not chunks:
113
+ return "No chunks generated. Try a different strategy.", "", ""
114
+
115
+ # Calculate stats
116
+ total_chars = sum(len(c) for c in chunks)
117
+ avg_size = total_chars / len(chunks)
118
+ min_size = min(len(c) for c in chunks)
119
+ max_size = max(len(c) for c in chunks)
120
+
121
+ # Check for problems
122
+ problems = []
123
+ split_sentences = 0
124
+ for i, chunk in enumerate(chunks):
125
+ if not chunk.rstrip().endswith(('.', '!', '?', '"')) and i < len(chunks) - 1:
126
+ split_sentences += 1
127
+
128
+ if split_sentences > 0:
129
+ problems.append(f"⚠️ {split_sentences} chunks end mid-sentence")
130
+ if min_size < 50:
131
+ problems.append(f"⚠️ Some chunks are very small ({min_size} chars)")
132
+ if max_size > 2000:
133
+ problems.append(f"⚠️ Some chunks are very large ({max_size} chars)")
134
+
135
+ # Stats display
136
+ stats = f"""### Chunking Statistics
137
+
138
+ | Metric | Value |
139
+ |--------|-------|
140
+ | Total Chunks | {len(chunks)} |
141
+ | Average Size | {avg_size:.0f} characters |
142
+ | Min Size | {min_size} characters |
143
+ | Max Size | {max_size} characters |
144
+ | Total Characters | {total_chars} |
145
+
146
+ """
147
+ if problems:
148
+ stats += "### ⚠️ Potential Issues\n" + "\n".join(problems)
149
+ else:
150
+ stats += "### ✅ No obvious issues detected"
151
+
152
+ # Chunk display with color coding
153
+ colors = ['#E6F7F5', '#D4F0EC', '#C2E9E3', '#B0E2DA', '#9EDBD1', '#8CD4C8', '#7ACDBF']
154
+ chunk_display = "### Chunk Preview\n\n"
155
+
156
+ for i, chunk in enumerate(chunks[:10]): # Show first 10
157
+ color = colors[i % len(colors)]
158
+ ends_mid_sentence = not chunk.rstrip().endswith(('.', '!', '?', '"')) and i < len(chunks) - 1
159
+ border = "2px solid #dc2626" if ends_mid_sentence else "1px solid #40B8A6"
160
+ warning = " ⚠️ *ends mid-sentence*" if ends_mid_sentence else ""
161
+
162
+ preview = chunk[:200] + "..." if len(chunk) > 200 else chunk
163
+ chunk_display += f"**Chunk {i+1}** ({len(chunk)} chars){warning}\n```\n{preview}\n```\n\n"
164
+
165
+ if len(chunks) > 10:
166
+ chunk_display += f"*... and {len(chunks) - 10} more chunks*"
167
+
168
+ return stats, chunk_display, f"Strategy: {strategy} | Chunks: {len(chunks)}"
169
+
170
+
171
+ def load_sample(sample_name):
172
+ return SAMPLE_DOCS.get(sample_name, "")
173
+
174
+
175
+ # Build interface
176
+ with gr.Blocks(title="Chunking Visualizer", theme=gr.themes.Soft()) as demo:
177
+ gr.Markdown(
178
+ "# Chunking Visualizer\n\n"
179
+ "**PM Decision:** Your engineering team says they'll 'chunk the documents.' "
180
+ "This tool shows you exactly what that means and helps you spot potential problems "
181
+ "before they affect retrieval quality.\n\n"
182
+ "Try different strategies and see how they split your documents."
183
+ )
184
+
185
+ with gr.Row():
186
+ with gr.Column(scale=1):
187
+ sample_dropdown = gr.Dropdown(
188
+ choices=list(SAMPLE_DOCS.keys()),
189
+ label="Load Sample Document",
190
+ value="FAQ Document"
191
+ )
192
+ text_input = gr.Textbox(
193
+ label="Document Text",
194
+ placeholder="Paste your document here...",
195
+ lines=12,
196
+ value=SAMPLE_DOCS["FAQ Document"]
197
+ )
198
+
199
+ strategy = gr.Radio(
200
+ choices=["Fixed Size", "Sentence-Based", "Paragraph-Based", "Q&A Pairs"],
201
+ label="Chunking Strategy",
202
+ value="Fixed Size"
203
+ )
204
+
205
+ with gr.Row():
206
+ chunk_size = gr.Slider(100, 1000, value=300, step=50, label="Chunk Size (chars)")
207
+ overlap = gr.Slider(0, 50, value=10, step=5, label="Overlap (%)")
208
+
209
+ sentences = gr.Slider(1, 10, value=3, step=1, label="Sentences per Chunk")
210
+
211
+ visualize_btn = gr.Button("Visualize Chunks", variant="primary")
212
+
213
+ with gr.Column(scale=1):
214
+ summary_output = gr.Textbox(label="Summary", interactive=False)
215
+ stats_output = gr.Markdown(label="Statistics")
216
+ chunks_output = gr.Markdown(label="Chunks")
217
+
218
+ # Events
219
+ sample_dropdown.change(load_sample, sample_dropdown, text_input)
220
+
221
+ visualize_btn.click(
222
+ visualize_chunks,
223
+ inputs=[text_input, strategy, chunk_size, overlap, sentences],
224
+ outputs=[stats_output, chunks_output, summary_output]
225
+ )
226
+
227
+ # Auto-update on strategy change
228
+ strategy.change(
229
+ visualize_chunks,
230
+ inputs=[text_input, strategy, chunk_size, overlap, sentences],
231
+ outputs=[stats_output, chunks_output, summary_output]
232
+ )
233
+
234
+ gr.Markdown(
235
+ "---\n"
236
+ "**PM Takeaway:** The right chunking strategy depends on your document type. "
237
+ "FAQs work best with Q&A pair chunking. Product docs work with paragraph or sentence-based. "
238
+ "Always test with real queries to verify retrieval quality.\n\n"
239
+ "*AI for Product Managers*"
240
+ )
241
+
242
+ if __name__ == "__main__":
243
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ # No additional requirements - uses Gradio only