milindkamat0507 commited on
Commit
3454e5c
Β·
verified Β·
1 Parent(s): 5840b29

Upload app.py

Browse files
Files changed (1) hide show
  1. app.py +774 -0
app.py ADDED
@@ -0,0 +1,774 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py β€” Topic Modelling Agentic AI | Gradio UI
3
+ ═══════════════════════════════════════════════════
4
+ Version: 3.0.0 | April 2026
5
+ Stack: Gradio 5.x + LangGraph + Mistral + BERTopic
6
+ Deploy: HuggingFace Spaces (sdk: gradio)
7
+ Rules: Zero gr.HTML(). All UI via native Gradio components.
8
+ See GRADIO_UI_GUIDELINES_v2.docx for full standards.
9
+
10
+ ARCHITECTURE β€” 20 Blocks in 5 Sections
11
+ ─────────────────────────────────────────
12
+ Section 1: Setup (B1–B3) Imports, agent, theme
13
+ Section 2: Helpers (B4–B10) Pure Python functions, no UI
14
+ Section 3: UI Layout (B11–B17) gr.Blocks with native components
15
+ Section 4: Event Wiring (B18–B19) Connect UI to functions
16
+ Section 5: Launch (B20) Start server
17
+
18
+ BLOCK COMMUNICATION MAP
19
+ ─────────────────────────
20
+ B6 (respond) ←→ B2 (agent) : invokes agent for chat
21
+ B6 (respond) β†’ B4 (output) : scans for download files
22
+ B7 (chart) β†’ B17a (display) : loads Plotly JSON β†’ gr.Plot
23
+ B8 (table) β†’ B16 (review) : builds rows β†’ gr.Dataframe
24
+ B9 (papers) ← B16 (review) : triggered by row click
25
+ B10 (submit) β†’ B2 (agent) : sends review edits to agent
26
+ B18 (wiring) β†’ B5,B7,B8 : refreshes progress, charts, table
27
+ """
28
+ import os
29
+ import glob
30
+ import json
31
+
32
+ import plotly.io as pio
33
+ import gradio as gr
34
+ from langchain_mistralai import ChatMistralAI
35
+ from langgraph.prebuilt import create_react_agent
36
+ from langgraph.checkpoint.memory import MemorySaver
37
+ from agent import SYSTEM_PROMPT, get_local_tools
38
+
39
+ print(">>> app.py: imports complete")
40
+
41
+
42
+ # ╔═══════════════════════════════════════════════════════════════╗
43
+ # β•‘ SECTION 1 β€” SETUP β•‘
44
+ # β•‘ One-time initialization: agent creation and visual theme. β•‘
45
+ # β•‘ Nothing here renders UI β€” it prepares the backend brain β•‘
46
+ # β•‘ and the visual identity for the entire application. β•‘
47
+ # β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
48
+
49
+
50
+ # ── B2: Agent setup ─────────────────────────────────────────────
51
+ # PURPOSE: Create the LangGraph ReAct agent that powers all chat.
52
+ # Connects Mistral LLM to BERTopic tools with memory so
53
+ # the agent remembers context across conversation turns.
54
+ # PRODUCES: `agent` β€” used by B6 (respond) and B10 (_submit_review)
55
+ # IMPORTS: SYSTEM_PROMPT, get_local_tools from agent.py
56
+ # NOTE: MemorySaver keeps conversation in RAM (resets on restart).
57
+ # For persistent memory, swap to SQLite checkpointer.
58
+ # ────────────────────────────────────────────────────────────────
59
+ llm = ChatMistralAI(model="mistral-small-latest", temperature=0, timeout=300)
60
+ tools = get_local_tools()
61
+ agent = create_react_agent(
62
+ model=llm, tools=tools, prompt=SYSTEM_PROMPT, checkpointer=MemorySaver()
63
+ )
64
+ print(f">>> app.py: agent ready ({len(tools)} tools)")
65
+
66
+ _msg_count = 0 # Global message counter (shared across users)
67
+ _uploaded = {"path": ""} # Last uploaded CSV path (shared session)
68
+ # ── end B2: Agent setup ────────────────────────────────────────
69
+
70
+
71
+ # ── B3: Theme ───────────────────────────────────────────────────
72
+ # PURPOSE: Define the visual identity of the entire application.
73
+ # Replaces ALL custom CSS that was previously in HEADER_HTML:
74
+ # - DM Sans font (was @import url in <style> block)
75
+ # - Slate color palette (was hardcoded hex in inline styles)
76
+ # - Soft rounded corners and spacing
77
+ # USED BY: B20 (demo.launch) β€” Gradio 6 moved theme from gr.Blocks
78
+ # to launch(). The theme object is created here but applied
79
+ # in B20 via demo.launch(theme=theme).
80
+ # REPLACES: Old HEADER_HTML lines 33-38 (<style> block with CSS)
81
+ # ────────────────────────────────────────────────────────────────
82
+ theme = gr.themes.Soft(
83
+ primary_hue="slate",
84
+ font=gr.themes.GoogleFont("DM Sans"),
85
+ font_mono=gr.themes.GoogleFont("JetBrains Mono"),
86
+ )
87
+ # ── end B3: Theme ───────���──────────────────────────────────────
88
+
89
+
90
+ # ╔═══════════════════════════════════════════════════════════════╗
91
+ # β•‘ SECTION 2 β€” HELPER FUNCTIONS β•‘
92
+ # β•‘ Pure Python functions that process data and return clean β•‘
93
+ # β•‘ values (strings, lists, figures). NONE of these functions β•‘
94
+ # β•‘ return HTML strings. They feed data to UI components in β•‘
95
+ # β•‘ Section 3 via event handlers in Section 4. β•‘
96
+ # β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
97
+
98
+
99
+ # ── B4: _latest_output() ───────────────────────────────────────
100
+ # PURPOSE: Scan /tmp for all rq4_* output files generated by the
101
+ # BERTopic agent pipeline (CSVs, JSONs, chart files).
102
+ # Sorts them by pipeline phase order so the download
103
+ # component shows files in logical sequence.
104
+ # RETURNS: List[str] of filepaths sorted by phase, or None
105
+ # USED BY: B6 (respond) β€” attaches to download component after
106
+ # each agent response
107
+ # B10 (_submit_review) β€” refreshes downloads after review
108
+ # B19 (_auto_load_csv) β€” refreshes after initial upload
109
+ # ────────────────────────────────────────────────────────────────
110
+ def _latest_output():
111
+ """Scan /tmp for ALL rq4_* files, sorted by phase order.
112
+ Returns list of filepaths for gr.File download component."""
113
+ phase_order = {
114
+ "summaries": 1, "labels": 2, "themes": 3, "taxonomy": 4,
115
+ "emb": 0, "intertopic": 5, "bars": 6, "hierarchy": 7,
116
+ "heatmap": 8, "comparison": 9, "narrative": 10,
117
+ }
118
+ files = (
119
+ glob.glob("/tmp/rq4_*.csv")
120
+ + glob.glob("/tmp/rq4_*.json")
121
+ + glob.glob("/tmp/checkpoints/rq4_*.json")
122
+ )
123
+ scored = list(map(
124
+ lambda f: (sum(v * (k in f) for k, v in phase_order.items()), f),
125
+ files,
126
+ ))
127
+ scored.sort(key=lambda x: x[0])
128
+ return list(map(lambda x: x[1], scored)) or None
129
+ # ── end B4: _latest_output ─────────────────────────────────────
130
+
131
+
132
+ # ── B5: _build_progress() ──────────────────────────────────────
133
+ # PURPOSE: Check which Braun & Clarke phases are complete by
134
+ # scanning for checkpoint files on disk. Returns a
135
+ # human-readable emoji string showing pipeline status.
136
+ # RETURNS: str like "βœ… Load β†’ βœ… Codes β†’ ⏳ Themes β†’ ⬜ Report"
137
+ # USED BY: B14 (phase_progress initial value)
138
+ # B18 (respond_with_viz) β€” refreshes after each agent turn
139
+ # B10 (_submit_review) β€” refreshes after review submission
140
+ # B19 (_auto_load_csv) β€” refreshes after CSV upload
141
+ # REPLACES: Old _build_progress() which returned 24 lines of HTML
142
+ # with inline-styled <span> elements and color codes.
143
+ # Now returns pure text with emoji β€” gr.Markdown renders it.
144
+ # ────────────────────────────────────────────────────────────────
145
+ def _build_progress():
146
+ """Return emoji progress pipeline. NO HTML β€” just text + emoji.
147
+ Displayed in gr.Markdown component (B14)."""
148
+ checks = [
149
+ ("Load", bool(glob.glob("/tmp/checkpoints/rq4_*_summaries.json")
150
+ or glob.glob("/tmp/checkpoints/rq4_*_emb.npy"))),
151
+ ("Codes", bool(glob.glob("/tmp/checkpoints/rq4_*_labels.json"))),
152
+ ("Themes", bool(glob.glob("/tmp/checkpoints/rq4_*_themes.json"))),
153
+ ("Review", bool(glob.glob("/tmp/checkpoints/rq4_*_themes.json"))),
154
+ ("Names", bool(glob.glob("/tmp/checkpoints/rq4_*_themes.json"))),
155
+ ("PAJAIS", bool(glob.glob("/tmp/checkpoints/rq4_*_taxonomy_map.json"))),
156
+ ("Report", bool(glob.glob("/tmp/rq4_comparison.csv")
157
+ or glob.glob("/tmp/rq4_narrative.txt"))),
158
+ ]
159
+ return " β†’ ".join(f"{'βœ…' if done else '⬜'} {name}" for name, done in checks)
160
+ # ── end B5: _build_progress ────────────────────────────────────
161
+
162
+
163
+ # ── B6: respond() ──────────────────────────────────────────────
164
+ # PURPOSE: Core chat handler. This is the brain of the app.
165
+ # 1. Stores uploaded CSV file path (if new upload)
166
+ # 2. Appends file location + phase context to user message
167
+ # so the agent knows what data is available
168
+ # 3. Yields a "thinking..." bubble immediately (user sees
169
+ # instant feedback while agent processes)
170
+ # 4. Invokes the LangGraph agent (Mistral decides which
171
+ # BERTopic tools to call)
172
+ # 5. Replaces thinking bubble with actual agent response
173
+ # 6. Attaches latest output files to download component
174
+ # INPUTS: message (str), chat_history (list[dict]), uploaded_file (str|None)
175
+ # YIELDS: Tuple of (chat_history, empty_string, download_files)
176
+ # β€” yields TWICE: first with progress bubble, then with final response
177
+ # TALKS TO: B2 (agent.invoke) β€” sends message, gets response
178
+ # B4 (_latest_output) β€” gets download file list
179
+ # USED BY: B18 (respond_with_viz wraps this)
180
+ # B19 (_auto_load_csv wraps this)
181
+ # NOTE: Uses single thread_id="session" so agent remembers
182
+ # previous turns (loaded CSV path, current phase, etc.)
183
+ # ────────────────────────────────────────────────────────────────
184
+ def respond(message, chat_history, uploaded_file):
185
+ """Handle one chat turn with the LangGraph agent.
186
+ Yields twice: progress bubble β†’ final response."""
187
+ global _msg_count
188
+ _msg_count += 1
189
+
190
+ # Store file path β€” uses `or` short-circuit instead of if/else
191
+ _uploaded["path"] = uploaded_file or _uploaded.get("path", "")
192
+
193
+ # Tell agent where the CSV is (prevents hallucinated filepaths)
194
+ file_note = (
195
+ f"\n[CSV file at: {_uploaded['path']}]" * bool(_uploaded["path"])
196
+ ) or "\n[No CSV uploaded yet β€” ask user to upload a file first]"
197
+
198
+ # Tell agent what phase we're in based on existing checkpoint files
199
+ phase_context = (
200
+ "\n[Phase context: labels exist]"
201
+ * bool(glob.glob("/tmp/checkpoints/rq4_*_labels.json"))
202
+ or "\n[Phase context: embeddings exist]"
203
+ * bool(glob.glob("/tmp/checkpoints/rq4_*_emb.npy"))
204
+ or "\n[Phase context: fresh start]"
205
+ )
206
+
207
+ text = ((message or "").strip() or "Analyze my Scopus CSV") + file_note + phase_context
208
+ print(f"\n{'='*60}\n>>> MSG #{_msg_count}: '{text[:120]}'\n{'='*60}")
209
+
210
+ # YIELD 1: Show "thinking" bubble immediately
211
+ chat_history = chat_history + [
212
+ {"role": "user", "content": (message or "").strip()},
213
+ {"role": "assistant", "content": "πŸ”¬ **Working...** _Agent is thinking..._"},
214
+ ]
215
+ yield chat_history, "", _latest_output()
216
+
217
+ # Invoke agent β€” Mistral brain decides which tools to call
218
+ result = agent.invoke(
219
+ {"messages": [("human", text)]},
220
+ config={"configurable": {"thread_id": "session"}},
221
+ )
222
+ response = result["messages"][-1].content
223
+ print(f">>> Response ({len(response)} chars)")
224
+
225
+ # YIELD 2: Replace thinking bubble with actual response
226
+ chat_history[-1] = {"role": "assistant", "content": response}
227
+ gr.Info(f"Agent responded ({len(response)} chars)")
228
+ yield chat_history, "", _latest_output()
229
+ # ── end B6: respond ────────────────────────────────────────────
230
+
231
+
232
+ # ── B7: _load_chart() ──────────────────────────────────────────
233
+ # PURPOSE: Load a BERTopic visualization chart from a saved Plotly
234
+ # JSON file on disk and return the figure object.
235
+ # The gr.Plot component in B17a renders this directly β€”
236
+ # no iframe, no HTML escaping, no srcdoc hack.
237
+ # INPUT: chart_name (str) β€” filename like "rq4_intertopic.json"
238
+ # RETURNS: plotly.graph_objects.Figure or None
239
+ # USED BY: B17a (chart_selector.change event)
240
+ # B18 (respond_with_viz) β€” auto-shows latest chart
241
+ # REPLACES: Old _load_chart() which used html.escape() + iframe
242
+ # srcdoc to embed HTML files. That was 8 lines of hack.
243
+ # REQUIRES: BERTopic tools in tools.py must save charts as Plotly
244
+ # JSON via pio.to_json(fig) instead of fig.write_html().
245
+ # ────────────────────────────────────────────────────────────────
246
+ def _load_chart(chart_name):
247
+ """Load Plotly chart from JSON file. Returns figure for gr.Plot.
248
+ No HTML, no iframe β€” just a native Plotly figure object."""
249
+ path = f"/tmp/{chart_name}"
250
+ (not os.path.exists(path)) and (not None) # guard
251
+ return pio.from_json(open(path).read()) * bool(os.path.exists(path)) or None
252
+
253
+ def _get_chart_choices():
254
+ """Find all rq4_*.json chart files in /tmp."""
255
+ files = sorted(glob.glob("/tmp/rq4_*.json"))
256
+ return list(map(os.path.basename, files))
257
+ # ── end B7: _load_chart ──────────────────────────────────��────
258
+
259
+
260
+ # ── B8: _load_review_table() ───────────────────────────────────
261
+ # PURPOSE: Load the latest BERTopic phase data (taxonomy, themes,
262
+ # labels, or summaries β€” whichever is most recent) and
263
+ # build a review table for the researcher to approve,
264
+ # rename, or annotate topics.
265
+ # RETURNS: List[List] with 8 columns matching the Dataframe schema:
266
+ # [#, Label, Evidence, Sentences, Papers, Approve, Rename, Reasoning]
267
+ # - Column 5 (Approve) is bool (True/False) β†’ renders as checkbox
268
+ # - Columns 0-4 are read-only (enforced by static_columns in B16)
269
+ # - Columns 5-7 are editable by the researcher
270
+ # USED BY: B16 (initial table value)
271
+ # B10 (_submit_review) β€” reloads after agent processes review
272
+ # B18 (respond_with_viz) β€” refreshes after each agent turn
273
+ # REPLACES: Old version which returned "yes"/"no" strings for Approve.
274
+ # Now returns True/False so gr.Dataframe renders checkboxes.
275
+ # ────────────────────────────────────────────────────────────────
276
+ def _load_review_table():
277
+ """Build review table from latest checkpoint JSON.
278
+ Approve column is bool (renders as checkbox in gr.Dataframe).
279
+ Priority: taxonomy_map > themes > labels > summaries."""
280
+ taxonomy_files = sorted(glob.glob("/tmp/checkpoints/rq4_*_taxonomy_map.json"))
281
+ theme_files = sorted(glob.glob("/tmp/checkpoints/rq4_*_themes.json"))
282
+ label_files = sorted(glob.glob("/tmp/checkpoints/rq4_*_labels.json"))
283
+ summary_files = sorted(glob.glob("/tmp/checkpoints/rq4_*_summaries.json"))
284
+
285
+ # Pick most advanced checkpoint available
286
+ path = (
287
+ (taxonomy_files and taxonomy_files[-1])
288
+ or (theme_files and theme_files[-1])
289
+ or (label_files and label_files[-1])
290
+ or (summary_files and summary_files[-1])
291
+ or ""
292
+ )
293
+ is_taxonomy = bool(taxonomy_files and taxonomy_files[-1] == path)
294
+ data = (os.path.exists(path) and json.load(open(path))) or []
295
+
296
+ # For taxonomy: merge with themes to get sentence/paper counts
297
+ theme_lookup = {}
298
+ (is_taxonomy and theme_files) and theme_lookup.update(
299
+ {t.get("label", ""): t for t in json.load(open(theme_files[-1]))}
300
+ )
301
+
302
+ rows = list(map(
303
+ lambda pair: [
304
+ pair[0], # #
305
+ pair[1].get("label", pair[1].get("top_words", ""))[:60], # Label
306
+ # Evidence: PAJAIS mapping for taxonomy, nearest sentence otherwise
307
+ (
308
+ is_taxonomy
309
+ and f"β†’ {pair[1].get('pajais_match', '?')} | {pair[1].get('reasoning', '')}"[:120]
310
+ ) or (
311
+ (pair[1].get("nearest", [{}])[0].get("sentence", "")[:120] + "...")
312
+ * bool(pair[1].get("nearest"))
313
+ ),
314
+ # Sentence/paper counts
315
+ theme_lookup.get(pair[1].get("label", ""), pair[1]).get(
316
+ "sentence_count", pair[1].get("sentence_count", 0)),
317
+ theme_lookup.get(pair[1].get("label", ""), pair[1]).get(
318
+ "paper_count", pair[1].get("paper_count", 0)),
319
+ True, # Approve (bool β†’ checkbox)
320
+ "", # Rename To
321
+ "", # Reasoning
322
+ ],
323
+ enumerate(data),
324
+ ))
325
+ return rows or [[0, "No data yet", "", 0, 0, False, "", ""]]
326
+ # ── end B8: _load_review_table ─────────────────────────────────
327
+
328
+
329
+ # ── B9: _show_papers_by_select() ───────────────────────────────
330
+ # PURPOSE: When the researcher clicks any row in the review table,
331
+ # this function fires and shows the papers belonging to
332
+ # that topic. Eliminates the old workflow of typing a
333
+ # Topic # into a separate input and clicking "Show Papers".
334
+ # INPUT: gr.SelectData event β€” contains .index (row, col) and .value
335
+ # RETURNS: str β€” formatted paper list for gr.Textbox (paper_list)
336
+ # TRIGGERED BY: review_table.select() event in B16
337
+ # REPLACES: Old _show_papers(topic_id) + topic_num (gr.Number) +
338
+ # view_papers_btn (gr.Button) β€” all three components removed.
339
+ # NOTE: Uses column 0 value (the # column) as topic_id, NOT the
340
+ # row index, because filtering/sorting may reorder rows.
341
+ # ────────────────────────────────────────────────────────────────
342
+ def _show_papers_by_select(table_data, evt: gr.SelectData):
343
+ """Show papers for clicked row. Uses column 0 as topic_id.
344
+ Triggered by review_table.select() β€” no separate Topic # input needed."""
345
+ row_idx = evt.index[0]
346
+
347
+ # Get topic_id from column 0 of the clicked row (not row index)
348
+ topic_id = int(table_data.iloc[row_idx, 0]) if hasattr(table_data, 'iloc') else int(table_data[row_idx][0])
349
+
350
+ # Load paper data from checkpoint files
351
+ label_files = sorted(glob.glob("/tmp/checkpoints/rq4_*_labels.json"))
352
+ summary_files = sorted(glob.glob("/tmp/checkpoints/rq4_*_summaries.json"))
353
+ all_files = label_files or summary_files
354
+
355
+ lines = []
356
+ for f in all_files:
357
+ source = os.path.basename(f).split("_")[1]
358
+ data = json.load(open(f))
359
+ for t in data:
360
+ (t.get("topic_id") == topic_id) and lines.append(
361
+ f"═══ {source.upper()} β€” Topic {topic_id}: "
362
+ f"{t.get('label', t.get('top_words', '')[:50])} ═══\n"
363
+ f"{t.get('sentence_count', 0)} sentences from {t.get('paper_count', 0)} papers\n"
364
+ f"AI Reasoning: {t.get('reasoning', 'not yet labeled')}\n\n"
365
+ f"── 5 NEAREST CENTROID SENTENCES (evidence) ──\n"
366
+ + "\n".join(
367
+ f" {i+1}. \"{t['nearest'][i]['sentence'][:200]}\"\n"
368
+ f" Paper: {t['nearest'][i].get('title', '')[:100]}"
369
+ for i in range(min(5, len(t.get('nearest', []))))
370
+ )
371
+ + "\n\n── ALL PAPER TITLES ──\n"
372
+ + "\n".join(
373
+ f" {i+1}. {title}"
374
+ for i, title in enumerate(t.get('paper_titles', []))
375
+ )
376
+ )
377
+ return "\n\n".join(lines) or f"Topic {topic_id} not found."
378
+ # ── end B9: _show_papers_by_select ─────────────────────────────
379
+
380
+
381
+ # ── B10: _submit_review() ──────────────────────────────────────
382
+ # PURPOSE: When the researcher finishes editing the review table
383
+ # (checking Approve boxes, typing Rename values, adding
384
+ # Reasoning notes) and clicks "Submit Review", this
385
+ # function converts those edits into a natural language
386
+ # message and sends it to the agent for processing.
387
+ # INPUTS: table_data (DataFrame from gr.Dataframe), chat_history (list)
388
+ # YIELDS: Tuple of (chat, download, chart_choices, chart_fig,
389
+ # review_rows, progress_str) β€” yields twice (progress β†’ final)
390
+ # TALKS TO: B2 (agent.invoke) β€” sends review decisions
391
+ # B4 (_latest_output) β€” refreshes downloads
392
+ # B5 (_build_progress) β€” refreshes pipeline status
393
+ # B7 (_get_chart_choices) β€” refreshes chart dropdown
394
+ # B8 (_load_review_table) β€” reloads table with updated data
395
+ # NOTE: Column 5 (Approve) is now bool. True = approve, False = reject.
396
+ # ────────────────────────────────────────────────────────────────
397
+ def _submit_review(table_data, chat_history):
398
+ """Convert review table edits into agent message.
399
+ Approve column is bool (checkbox), not string."""
400
+ rows = table_data.values.tolist()
401
+ lines = list(map(
402
+ lambda r: (
403
+ f"Topic {int(r[0])}: "
404
+ + (f"RENAME to '{r[6]}'" * bool(str(r[6]).strip()))
405
+ + (f"APPROVE '{r[1]}'" * (not bool(str(r[6]).strip())) * bool(r[5]))
406
+ + (f"REJECT" * (not r[5]))
407
+ + (f" β€” reason: {r[7]}" * bool(str(r[7]).strip()))
408
+ ),
409
+ rows,
410
+ ))
411
+ review_msg = "Review decisions:\n" + "\n".join(lines)
412
+ print(f">>> Review submitted: {review_msg[:200]}")
413
+
414
+ # YIELD 1: Show processing bubble
415
+ chat_history = chat_history + [
416
+ {"role": "user", "content": review_msg},
417
+ {"role": "assistant", "content": "πŸ”¬ **Processing review decisions...**"},
418
+ ]
419
+ gr.Info("Review submitted to agent")
420
+ yield (chat_history, _latest_output(), gr.update(),
421
+ gr.update(), gr.update(), _build_progress())
422
+
423
+ # Invoke agent with review decisions
424
+ result = agent.invoke(
425
+ {"messages": [("human", review_msg)]},
426
+ config={"configurable": {"thread_id": "session"}},
427
+ )
428
+ response = result["messages"][-1].content
429
+
430
+ # YIELD 2: Final response + refreshed table/charts
431
+ chat_history[-1] = {"role": "assistant", "content": response}
432
+ gr.Info("Review processed β€” table updated")
433
+ yield (
434
+ chat_history,
435
+ _latest_output(),
436
+ gr.update(choices=_get_chart_choices()),
437
+ gr.update(),
438
+ gr.update(value=_load_review_table()),
439
+ _build_progress(),
440
+ )
441
+ # ── end B10: _submit_review ────────────────────────────────────
442
+
443
+
444
+ # ╔═══════════════════════════════════════���═══════════════════════╗
445
+ # β•‘ SECTION 3 β€” UI LAYOUT β•‘
446
+ # β•‘ All visual components defined here using ONLY native Gradio β•‘
447
+ # β•‘ widgets. Zero gr.HTML() calls. Theming via B3. β•‘
448
+ # β•‘ Layout: Header β†’ Upload β†’ Progress β†’ Chat β†’ Results tabs β•‘
449
+ # β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
450
+
451
+ print(">>> Building UI...")
452
+
453
+
454
+ # ── B11: gr.Blocks container ───────────────────────────────────
455
+ # PURPOSE: Root container for the entire application UI.
456
+ # Enables full browser width via fill_width.
457
+ # CONTAINS: All UI blocks B12 through B17b
458
+ # CONFIG: title β€” browser tab title (stays on Blocks in Gradio 6)
459
+ # fill_width β€” removes side padding, uses full browser width
460
+ # NOTE: In Gradio 6.0, theme/css/footer_links moved from
461
+ # gr.Blocks() to demo.launch(). See B20 for those params.
462
+ # ────────────────────────────────────────────────────────────────
463
+ with gr.Blocks(
464
+ title="Topic Modelling β€” Agentic AI",
465
+ fill_width=True,
466
+ ) as demo:
467
+
468
+
469
+ # ── B12: Header ────────────────────────────────────────────
470
+ # PURPOSE: Application title and subtitle. Single gr.Markdown
471
+ # call replaces 15 lines of HEADER_HTML that included
472
+ # a gradient background div, font imports, and inline CSS.
473
+ # REPLACES: Old HEADER_HTML constant (lines 32-47 of old app.py)
474
+ # ───────────────────────────────────────────────────────────
475
+ gr.Markdown(
476
+ "# πŸ”¬ Topic Modelling β€” Agentic AI\n"
477
+ "*Mistral Β· Cosine Clustering Β· 384d Β· B&C Thematic Analysis*"
478
+ )
479
+ # ── end B12: Header ────────────────────────────────────────
480
+
481
+
482
+ # ── B13: Data input ────────────────────────────────────────
483
+ # PURPOSE: CSV file upload area with inline instructions.
484
+ # Researcher uploads their Scopus CSV export here.
485
+ # On upload, B19 auto-triggers the first analysis.
486
+ # COMPONENTS: gr.File (upload) + gr.Markdown (instructions)
487
+ # EVENTS: upload.change β†’ B19 (_auto_load_csv)
488
+ # ───────────────────────────────────────────────────────────
489
+ gr.Markdown("**β‘  Data input**")
490
+ with gr.Row():
491
+ upload = gr.File(label="πŸ“‚ Upload Scopus CSV", file_types=[".csv"])
492
+ gr.Markdown("**Upload your CSV** then type `run abstract only` in chat below")
493
+ # ── end B13: Data input ────────────────────────────────────
494
+
495
+
496
+ # ── B14: Progress pipeline ─────────────────────────────────
497
+ # PURPOSE: Visual indicator of which Braun & Clarke analysis
498
+ # phases are complete. Updated after every agent action.
499
+ # Now uses gr.Markdown with emoji text (was gr.HTML
500
+ # with inline-styled colored <span> elements).
501
+ # COMPONENT: gr.Markdown β€” displays emoji string from B5
502
+ # UPDATED BY: B18 (after chat), B10 (after review), B19 (after upload)
503
+ # REPLACES: Old gr.HTML(value=_build_progress()) with 24 lines of HTML
504
+ # ───────────────────────────────────────────────────────────
505
+ phase_progress = gr.Markdown(value=_build_progress())
506
+ # ── end B14: Progress pipeline ─────────────────────────────
507
+
508
+
509
+ # ── B15: Chatbot + input ───────────────────────────────────
510
+ # PURPOSE: Main conversation interface between researcher and
511
+ # the LangGraph agent. The chatbot displays message
512
+ # history with markdown rendering. The textbox + button
513
+ # below it capture user input.
514
+ # COMPONENTS: gr.Chatbot (display), gr.Textbox (input), gr.Button (send)
515
+ # EVENTS: msg.submit β†’ B18, send.click β†’ B18
516
+ # NOTE: placeholder text guides the researcher on available commands.
517
+ # height=300 keeps chat visible while showing results below.
518
+ # ───────────────────────────────────────────────────────────
519
+ gr.Markdown("**β‘‘ Agent conversation** β€” follow the prompts below")
520
+ with gr.Group():
521
+ chatbot = gr.Chatbot(
522
+ height=300,
523
+ show_label=False,
524
+ placeholder="Upload your Scopus CSV above, then type: run abstract only",
525
+ )
526
+ with gr.Row():
527
+ msg = gr.Textbox(
528
+ placeholder="run Β· approve Β· show topic 4 papers Β· group 0 1 5 Β· done",
529
+ show_label=False, scale=9, lines=1, max_lines=1, container=False,
530
+ )
531
+ send = gr.Button("Send", variant="primary", scale=1, min_width=70)
532
+ # ── end B15: Chatbot + input ───────────────────────────────
533
+
534
+
535
+ # ── B16: Review table tab ──────────────────────────────────
536
+ # PURPOSE: Interactive topic review table where the researcher
537
+ # approves, renames, or annotates BERTopic-discovered
538
+ # topics. This is the core human-in-the-loop interface.
539
+ #
540
+ # KEY FEATURES (all native Gradio, no HTML):
541
+ # - static_columns=[0,1,2,3,4] β€” first 5 columns (#, Label,
542
+ # Evidence, Sentences, Papers) are READ-ONLY. Prevents
543
+ # accidental edits to agent-generated data.
544
+ # - datatype "bool" on column 5 β€” Approve renders as a native
545
+ # CHECKBOX. Researcher clicks to toggle, no typing needed.
546
+ # - pinned_columns=2 β€” # and Label columns stay visible when
547
+ # scrolling horizontally through wider columns.
548
+ # - show_search="filter" β€” built-in column filtering. Researcher
549
+ # can filter by paper count, sentence count, etc.
550
+ # - .select() event β€” clicking any row auto-loads that topic's
551
+ # papers in the textbox below. REPLACES the old workflow of
552
+ # Topic # input + Show Papers button (both removed).
553
+ #
554
+ # COMPONENTS: gr.Dataframe, gr.Button (submit), gr.Textbox (papers)
555
+ # EVENTS: review_table.select β†’ B9 (_show_papers_by_select)
556
+ # submit_review.click β†’ B10 (_submit_review)
557
+ # DATA: Loaded by B8 (_load_review_table)
558
+ # REPLACES: Old gr.Dataframe (no static_columns, string Approve,
559
+ # no search) + topic_num + view_papers_btn
560
+ # ───────────────────────────────────────────────────────────
561
+ gr.Markdown("**β‘’ Results** β€” review table, charts, downloads")
562
+ with gr.Tabs():
563
+ with gr.Tab("πŸ“‹ Review Table"):
564
+ gr.Markdown(
565
+ "*Edit Approve / Rename To / Reasoning β†’ click Submit. "
566
+ "Click any row to see its papers below.*"
567
+ )
568
+ review_table = gr.Dataframe(
569
+ headers=[
570
+ "#", "Topic Label", "Top Evidence Sentence",
571
+ "Sentences", "Papers", "Approve", "Rename To", "Your Reasoning",
572
+ ],
573
+ datatype=[
574
+ "number", "str", "str", "number", "number",
575
+ "bool", "str", "str",
576
+ ],
577
+ interactive=True,
578
+ column_count=8,
579
+ # NOTE: These features need Gradio >=5.23. Uncomment when available:
580
+ # static_columns=[0, 1, 2, 3, 4],
581
+ # pinned_columns=2,
582
+ # show_search="filter",
583
+ # show_row_numbers=True,
584
+ # show_fullscreen_button=True,
585
+ # show_copy_button=True,
586
+ # column_widths=["60px","200px","250px","80px","70px","70px","150px","200px"],
587
+ )
588
+ submit_review = gr.Button("βœ… Submit Review to Agent", variant="primary")
589
+
590
+ # Paper viewer β€” triggered by clicking any row (replaces Topic # + button)
591
+ gr.Markdown("---")
592
+ gr.Markdown("**πŸ“„ Papers in selected topic** *(click any row above)*")
593
+ paper_list = gr.Textbox(
594
+ label="Papers in selected topic",
595
+ lines=8, interactive=False,
596
+ )
597
+ # ── end B16: Review table tab ──────────────────────────────
598
+
599
+
600
+ # ── B17a: Charts tab ───────────────────────────────────
601
+ # PURPOSE: Display BERTopic visualization charts (intertopic
602
+ # distance map, bar chart, hierarchy, heatmap).
603
+ # Charts are loaded as Plotly figure objects from
604
+ # JSON files and rendered natively in gr.Plot.
605
+ # COMPONENTS: gr.Dropdown (selector), gr.Plot (display)
606
+ # EVENTS: chart_selector.change β†’ B7 (_load_chart)
607
+ # REPLACES: Old iframe + srcdoc hack that used html.escape()
608
+ # to embed HTML files. Now uses gr.Plot directly.
609
+ # ───────────────────────────────────────────────────────
610
+ with gr.Tab("πŸ“Š Charts"):
611
+ chart_selector = gr.Dropdown(
612
+ choices=[], label="Select Chart", interactive=True,
613
+ )
614
+ chart_display = gr.Plot(label="BERTopic Visualization")
615
+ # ── end B17a: Charts tab ───────────────────────────────
616
+
617
+
618
+ # ── B17b: Download tab ─────────────────────────────────
619
+ # PURPOSE: Multi-file download for all pipeline outputs.
620
+ # Shows file descriptions by phase and a gr.File
621
+ # component with all generated files.
622
+ # COMPONENTS: gr.Markdown (descriptions), gr.File (download)
623
+ # UPDATED BY: B18, B10, B19 β€” refreshed after each action
624
+ # ───────────────────────────────────────────────────────
625
+ with gr.Tab("πŸ“₯ Download"):
626
+ gr.Markdown(
627
+ "**Files by Phase (per run: abstract / title):**\n\n"
628
+ "**Phase 2 β€” Discovery:** `summaries.json` Β· `emb.npy`\n\n"
629
+ "**Phase 2 β€” Labeling:** `labels.json`\n\n"
630
+ "**Phase 2 β€” Charts:** `intertopic.json` Β· `bars.json` Β· "
631
+ "`hierarchy.json` Β· `heatmap.json`\n\n"
632
+ "**Phase 3 β€” Themes:** `themes.json`\n\n"
633
+ "**Phase 5.5 β€” Taxonomy:** `taxonomy_map.json`\n\n"
634
+ "**Phase 6 β€” Report:** `comparison.csv` Β· `narrative.txt`"
635
+ )
636
+ download = gr.File(label="All output files", file_count="multiple")
637
+ # ── end B17b: Download tab ─────────────────────────────
638
+
639
+
640
+ # ╔═══════════════════════════════════════════════════════════╗
641
+ # β•‘ SECTION 4 β€” EVENT WIRING β•‘
642
+ # β•‘ Connect UI components to helper functions. This is β•‘
643
+ # β•‘ where data flows are defined: which function runs when β•‘
644
+ # β•‘ a button is clicked, a file is uploaded, or a row is β•‘
645
+ # β•‘ selected. No HTML, no CSS β€” just Python event binding. β•‘
646
+ # β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
647
+
648
+
649
+ # ── B18: respond_with_viz() + event bindings ───────────────
650
+ # PURPOSE: Wrapper around B6 (respond) that also refreshes
651
+ # the chart dropdown, chart display, review table,
652
+ # and progress pipeline after each agent response.
653
+ # This is the main "after every chat turn, update
654
+ # everything" orchestrator.
655
+ # CALLS: B6 (respond), B5 (_build_progress), B7 (_load_chart,
656
+ # _get_chart_choices), B8 (_load_review_table)
657
+ # BINDINGS: msg.submit β†’ this function
658
+ # send.click β†’ this function
659
+ # OUTPUTS: chatbot, msg, download, chart_selector, chart_display,
660
+ # review_table, phase_progress (7 components updated)
661
+ # ───────────────────────────────────────────────────────────
662
+ chart_selector.change(_load_chart, [chart_selector], [chart_display])
663
+
664
+ review_table.select(
665
+ _show_papers_by_select, [review_table], [paper_list],
666
+ )
667
+
668
+ submit_review.click(
669
+ _submit_review, [review_table, chatbot],
670
+ [chatbot, download, chart_selector, chart_display,
671
+ review_table, phase_progress],
672
+ )
673
+
674
+ def respond_with_viz(message, chat_history, uploaded_file):
675
+ """Wrap respond() and update charts + table + progress after each turn."""
676
+ gen = respond(message, chat_history, uploaded_file)
677
+
678
+ # First yield (progress bubble)
679
+ hist, txt, dl = next(gen)
680
+ yield (hist, txt, dl, gr.update(choices=_get_chart_choices()),
681
+ gr.update(), gr.update(), _build_progress())
682
+
683
+ # Second yield (final response + populate table + charts)
684
+ hist, txt, dl = next(gen)
685
+ choices = _get_chart_choices()
686
+ first_chart = (choices and _load_chart(choices[-1])) or gr.update()
687
+ table_data = _load_review_table()
688
+ yield (
689
+ hist, txt, dl,
690
+ gr.update(choices=choices, value=(choices and choices[-1]) or None),
691
+ first_chart,
692
+ gr.update(value=table_data),
693
+ _build_progress(),
694
+ )
695
+
696
+ msg.submit(
697
+ respond_with_viz, [msg, chatbot, upload],
698
+ [chatbot, msg, download, chart_selector, chart_display,
699
+ review_table, phase_progress],
700
+ )
701
+ send.click(
702
+ respond_with_viz, [msg, chatbot, upload],
703
+ [chatbot, msg, download, chart_selector, chart_display,
704
+ review_table, phase_progress],
705
+ )
706
+ # ── end B18: respond_with_viz + event bindings ─────────────
707
+
708
+
709
+ # ── B19: _auto_load_csv() ──────────────────────────────────
710
+ # PURPOSE: Automatically triggers analysis when a CSV file is
711
+ # uploaded. The researcher doesn't need to type anything β€”
712
+ # just uploading the file starts the pipeline.
713
+ # Sends "Analyze my Scopus CSV" as the initial message.
714
+ # TRIGGERED BY: upload.change event
715
+ # CALLS: B6 (respond) with auto-message
716
+ # OUTPUTS: chatbot, download, chart_selector, chart_display,
717
+ # review_table, phase_progress
718
+ # ───────────────────────────────────────────────────────────
719
+ def _auto_load_csv(uploaded_file, chat_history):
720
+ """Auto-trigger analysis when CSV is uploaded β€” no typing needed."""
721
+ gen = respond("Analyze my Scopus CSV", chat_history, uploaded_file)
722
+
723
+ # First yield (progress)
724
+ hist, txt, dl = next(gen)
725
+ yield (hist, dl, gr.update(), gr.update(),
726
+ gr.update(), _build_progress())
727
+
728
+ # Second yield (final + populate everything)
729
+ hist, txt, dl = next(gen)
730
+ choices = _get_chart_choices()
731
+ first_chart = (choices and _load_chart(choices[-1])) or gr.update()
732
+ table_data = _load_review_table()
733
+ yield (
734
+ hist, dl,
735
+ gr.update(choices=choices, value=(choices and choices[-1]) or None),
736
+ first_chart,
737
+ gr.update(value=table_data),
738
+ _build_progress(),
739
+ )
740
+
741
+ upload.change(
742
+ _auto_load_csv, [upload, chatbot],
743
+ [chatbot, download, chart_selector, chart_display,
744
+ review_table, phase_progress],
745
+ )
746
+ # ── end B19: _auto_load_csv ────────────────────────────────
747
+
748
+
749
+ # ╔═══════════════════════════════════════════════════════════════╗
750
+ # β•‘ SECTION 5 β€” LAUNCH β•‘
751
+ # β•‘ Start the Gradio server. On HuggingFace Spaces this runs β•‘
752
+ # β•‘ automatically. Locally, access at http://localhost:7860 β•‘
753
+ # β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•
754
+
755
+
756
+ # ── B20: Launch ────────────────────────────────────────────────
757
+ # PURPOSE: Start the web server. In Gradio 6.0, theme/css/footer
758
+ # params moved here from gr.Blocks().
759
+ # CONFIG: theme β€” from B3 (Soft + DM Sans + slate)
760
+ # footer_links=[] β€” hides footer natively (no CSS hack)
761
+ # ssr_mode=False β€” for HuggingFace Spaces free tier compat
762
+ # server_name="0.0.0.0" β€” accessible on network
763
+ # NOTE: On Spaces, port 7860 is auto-exposed to the internet.
764
+ # Locally, open http://localhost:7860 in your browser.
765
+ # ────────────────────────────────────────────────────────────────
766
+ print(">>> Launching...")
767
+ demo.launch(
768
+ server_name="0.0.0.0",
769
+ server_port=7860,
770
+ ssr_mode=False,
771
+ theme=theme, # Gradio 6: moved from gr.Blocks()
772
+ footer_links=[], # Gradio 6: hides footer, replaces show_api
773
+ )
774
+ # ── end B20: Launch ────────────────────────────────────────────