chrissoria Claude Opus 4.5 commited on
Commit
c44ee53
·
1 Parent(s): d3c6788

Initial summarizer app

Browse files

- Streamlit app for text and PDF summarization
- Based on classifier app structure
- Supports free models and bring-your-own-key
- Generates methodology report PDFs

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (5) hide show
  1. README.md +45 -8
  2. app.py +679 -0
  3. example_data.csv +21 -0
  4. logo.png +0 -0
  5. requirements.txt +12 -0
README.md CHANGED
@@ -1,14 +1,51 @@
1
  ---
2
- title: Survey Summarizer
3
- emoji: 🏃
4
- colorFrom: indigo
5
- colorTo: blue
6
- sdk: gradio
7
- sdk_version: 6.3.0
8
  app_file: app.py
9
  pinned: false
10
  license: gpl-3.0
11
- short_description: Research-grade summarization of survey responses, PDFs, and
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: CatLLM - Survey Response Summarizer
3
+ emoji: 🐱
4
+ colorFrom: yellow
5
+ colorTo: yellow
6
+ sdk: streamlit
7
+ sdk_version: "1.32.0"
8
  app_file: app.py
9
  pinned: false
10
  license: gpl-3.0
11
+ short_description: Summarize survey responses and PDFs using LLMs
12
  ---
13
 
14
+ # CatLLM - Survey Response Summarizer
15
+
16
+ A web interface for the [catllm](https://github.com/chrissoria/cat-llm) Python package. Summarize survey responses and PDF documents using various LLM providers.
17
+
18
+ ## How to Use
19
+
20
+ 1. **Upload Your Data**: Upload a CSV, Excel file, or PDF documents
21
+ 2. **Select Column** (for text): Choose the column containing the text responses to summarize
22
+ 3. **Add Context**: Describe your data and optionally add focus/instructions
23
+ 4. **Choose a Model**: Select your preferred LLM (free models available!)
24
+ 5. **Click Summarize**: View and download results with generated summaries
25
+
26
+ ## Features
27
+
28
+ - **Text Summarization**: Summarize survey responses, feedback, or any text data
29
+ - **PDF Summarization**: Extract and summarize content from PDF documents
30
+ - **Customizable**: Add focus areas, max length limits, and custom instructions
31
+ - **Methodology Report**: Download a PDF report documenting your summarization process
32
+
33
+ ## Supported Models
34
+
35
+ | Provider | Models |
36
+ |----------|--------|
37
+ | **OpenAI** | gpt-4.1, gpt-4o, gpt-4o-mini |
38
+ | **Anthropic** | claude-sonnet-4.5, claude-opus-4, claude-3.5-haiku |
39
+ | **Google** | gemini-2.5-pro, gemini-2.5-flash |
40
+ | **Mistral** | mistral-large-latest |
41
+ | **Free Models** | Qwen3 235B, DeepSeek V3.1, Llama 3.3 70B |
42
+
43
+ ## Privacy
44
+
45
+ Your API key is **never stored**. It is only used for the current summarization request and is not logged or saved.
46
+
47
+ ## Related
48
+
49
+ - [CatLLM Survey Classifier](https://huggingface.co/spaces/CatLLM/survey-classifier) - Classify survey responses into categories
50
+ - [catllm on PyPI](https://pypi.org/project/cat-llm/)
51
+ - [GitHub Repository](https://github.com/chrissoria/cat-llm)
app.py ADDED
@@ -0,0 +1,679 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Streamlit app - CatLLM Survey Response Summarizer
3
+ Based on the classifier app but focused on text/PDF summarization
4
+ """
5
+
6
+ import streamlit as st
7
+ import pandas as pd
8
+ import tempfile
9
+ import os
10
+ import time
11
+ import sys
12
+ from datetime import datetime
13
+
14
+ # Import catllm
15
+ try:
16
+ import catllm
17
+ CATLLM_AVAILABLE = True
18
+ except ImportError as e:
19
+ print(f"Warning: Could not import catllm: {e}")
20
+ CATLLM_AVAILABLE = False
21
+
22
+ MAX_FILE_SIZE_MB = 100
23
+
24
+ def count_pdf_pages(pdf_path):
25
+ """Count the number of pages in a PDF file."""
26
+ try:
27
+ import fitz # PyMuPDF
28
+ doc = fitz.open(pdf_path)
29
+ page_count = len(doc)
30
+ doc.close()
31
+ return page_count
32
+ except Exception:
33
+ return 1 # Default to 1 if can't read
34
+
35
+
36
+ # Free models - display name -> actual API model name
37
+ FREE_MODELS_MAP = {
38
+ "Qwen3 235B": "Qwen/Qwen3-VL-235B-A22B-Instruct:novita",
39
+ "DeepSeek V3.1": "deepseek-ai/DeepSeek-V3.1:novita",
40
+ "Llama 3.3 70B": "meta-llama/Llama-3.3-70B-Instruct:groq",
41
+ "Gemini 2.5 Flash": "gemini-2.5-flash",
42
+ "GPT-4o Mini": "gpt-4o-mini",
43
+ "Mistral Medium": "mistral-medium-2505",
44
+ "Claude 3 Haiku": "claude-3-haiku-20240307",
45
+ "Grok 4 Fast": "grok-4-fast-non-reasoning",
46
+ }
47
+ FREE_MODEL_DISPLAY_NAMES = list(FREE_MODELS_MAP.keys())
48
+
49
+ # Paid models (user provides their own API key)
50
+ PAID_MODEL_CHOICES = [
51
+ "gpt-4.1",
52
+ "gpt-4o",
53
+ "gpt-4o-mini",
54
+ "claude-sonnet-4-5-20250929",
55
+ "claude-opus-4-20250514",
56
+ "claude-3-5-haiku-20241022",
57
+ "gemini-2.5-pro",
58
+ "gemini-2.5-flash",
59
+ "mistral-large-latest",
60
+ ]
61
+
62
+ # Models routed through HuggingFace
63
+ HF_ROUTED_MODELS = [
64
+ "Qwen/Qwen3-VL-235B-A22B-Instruct:novita",
65
+ "deepseek-ai/DeepSeek-V3.1:novita",
66
+ "meta-llama/Llama-3.3-70B-Instruct:groq",
67
+ ]
68
+
69
+
70
+ def is_free_model(model, model_tier):
71
+ """Check if using free tier (Space pays for API)."""
72
+ return model_tier == "Free Models"
73
+
74
+
75
+ def get_model_source(model):
76
+ """Auto-detect model source."""
77
+ model_lower = model.lower()
78
+ if "gpt" in model_lower:
79
+ return "openai"
80
+ elif "claude" in model_lower:
81
+ return "anthropic"
82
+ elif "gemini" in model_lower:
83
+ return "google"
84
+ elif "mistral" in model_lower and ":novita" not in model_lower:
85
+ return "mistral"
86
+ elif any(x in model_lower for x in [":novita", ":groq", "qwen", "llama", "deepseek"]):
87
+ return "huggingface"
88
+ elif "sonar" in model_lower:
89
+ return "perplexity"
90
+ elif "grok" in model_lower:
91
+ return "xai"
92
+ return "huggingface"
93
+
94
+
95
+ def get_api_key(model, model_tier, api_key_input):
96
+ """Get the appropriate API key based on model and tier."""
97
+ if is_free_model(model, model_tier):
98
+ if model in HF_ROUTED_MODELS:
99
+ return os.environ.get("HF_API_KEY", ""), "HuggingFace"
100
+ elif "gpt" in model.lower():
101
+ return os.environ.get("OPENAI_API_KEY", ""), "OpenAI"
102
+ elif "gemini" in model.lower():
103
+ return os.environ.get("GOOGLE_API_KEY", ""), "Google"
104
+ elif "mistral" in model.lower():
105
+ return os.environ.get("MISTRAL_API_KEY", ""), "Mistral"
106
+ elif "claude" in model.lower():
107
+ return os.environ.get("ANTHROPIC_API_KEY", ""), "Anthropic"
108
+ elif "sonar" in model.lower():
109
+ return os.environ.get("PERPLEXITY_API_KEY", ""), "Perplexity"
110
+ elif "grok" in model.lower():
111
+ return os.environ.get("XAI_API_KEY", ""), "xAI"
112
+ else:
113
+ return os.environ.get("HF_API_KEY", ""), "HuggingFace"
114
+ else:
115
+ if api_key_input and api_key_input.strip():
116
+ return api_key_input.strip(), "User"
117
+ return "", "User"
118
+
119
+
120
+ def generate_summarize_code(input_type, description, model, model_source, focus=None, max_length=None, instructions=None, mode=None):
121
+ """Generate Python code for summarization."""
122
+ focus_param = f',\n focus="{focus}"' if focus else ''
123
+ length_param = f',\n max_length={max_length}' if max_length else ''
124
+ instructions_param = f',\n instructions="{instructions}"' if instructions else ''
125
+
126
+ if input_type == "text":
127
+ return f'''import catllm
128
+ import pandas as pd
129
+
130
+ # Load your data
131
+ df = pd.read_csv("your_data.csv")
132
+
133
+ # Summarize the text column
134
+ result = catllm.summarize(
135
+ input_data=df["your_column"].tolist(),
136
+ api_key="YOUR_API_KEY",
137
+ description="{description}",
138
+ user_model="{model}",
139
+ model_source="{model_source}"{focus_param}{length_param}{instructions_param}
140
+ )
141
+
142
+ # View results
143
+ print(result)
144
+ result.to_csv("summarized_results.csv", index=False)
145
+ '''
146
+ else: # pdf
147
+ mode_param = f',\n mode="{mode}"' if mode else ''
148
+ return f'''import catllm
149
+
150
+ # Summarize PDF documents
151
+ result = catllm.summarize(
152
+ input_data="path/to/your/pdfs/",
153
+ api_key="YOUR_API_KEY",
154
+ description="{description}",
155
+ user_model="{model}",
156
+ model_source="{model_source}"{mode_param}{focus_param}{length_param}{instructions_param}
157
+ )
158
+
159
+ # View results
160
+ print(result)
161
+ result.to_csv("summarized_results.csv", index=False)
162
+ '''
163
+
164
+
165
+ def generate_methodology_report_pdf(model, column_name, num_rows, model_source, filename, success_rate,
166
+ result_df=None, processing_time=None,
167
+ catllm_version=None, python_version=None,
168
+ input_type="text", description=None, focus=None, max_length=None):
169
+ """Generate a PDF methodology report for summarization."""
170
+ from reportlab.lib.pagesizes import letter
171
+ from reportlab.lib import colors
172
+ from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
173
+ from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak
174
+
175
+ pdf_file = tempfile.NamedTemporaryFile(mode='wb', suffix='_methodology_report.pdf', delete=False)
176
+ doc = SimpleDocTemplate(pdf_file.name, pagesize=letter)
177
+ styles = getSampleStyleSheet()
178
+
179
+ title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=18, spaceAfter=20)
180
+ heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=14, spaceAfter=10, spaceBefore=15)
181
+ normal_style = styles['Normal']
182
+
183
+ story = []
184
+
185
+ report_title = "CatLLM Summarization Report"
186
+ story.append(Paragraph(report_title, title_style))
187
+ story.append(Paragraph(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}", normal_style))
188
+ story.append(Spacer(1, 15))
189
+
190
+ story.append(Paragraph("About This Report", heading_style))
191
+ about_text = """This methodology report documents the automated summarization process. \
192
+ CatLLM uses LLMs to generate concise summaries of text or PDF documents, providing \
193
+ consistent and reproducible results."""
194
+ story.append(Paragraph(about_text, normal_style))
195
+ story.append(Spacer(1, 15))
196
+
197
+ # Summary section
198
+ story.append(Paragraph("Summarization Summary", heading_style))
199
+ story.append(Spacer(1, 10))
200
+
201
+ summary_data = [
202
+ ["Source File", filename],
203
+ ["Source Column/Type", column_name],
204
+ ["Model Used", model],
205
+ ["Model Source", model_source],
206
+ ["Items Summarized", str(num_rows)],
207
+ ["Success Rate", f"{success_rate:.2f}%"],
208
+ ]
209
+ if focus:
210
+ summary_data.append(["Focus", focus])
211
+ if max_length:
212
+ summary_data.append(["Max Length", f"{max_length} words"])
213
+
214
+ summary_table = Table(summary_data, colWidths=[150, 300])
215
+ summary_table.setStyle(TableStyle([
216
+ ('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
217
+ ('GRID', (0, 0), (-1, -1), 1, colors.black),
218
+ ('PADDING', (0, 0), (-1, -1), 6),
219
+ ('FONTSIZE', (0, 0), (-1, -1), 9),
220
+ ]))
221
+ story.append(summary_table)
222
+ story.append(Spacer(1, 15))
223
+
224
+ if processing_time is not None:
225
+ story.append(Paragraph("Processing Time", heading_style))
226
+ rows_per_min = (num_rows / processing_time) * 60 if processing_time > 0 else 0
227
+ avg_time = processing_time / num_rows if num_rows > 0 else 0
228
+
229
+ time_data = [
230
+ ["Total Processing Time", f"{processing_time:.1f} seconds"],
231
+ ["Average Time per Item", f"{avg_time:.2f} seconds"],
232
+ ["Processing Rate", f"{rows_per_min:.1f} items/minute"],
233
+ ]
234
+ time_table = Table(time_data, colWidths=[180, 270])
235
+ time_table.setStyle(TableStyle([
236
+ ('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
237
+ ('GRID', (0, 0), (-1, -1), 1, colors.black),
238
+ ('PADDING', (0, 0), (-1, -1), 6),
239
+ ('FONTSIZE', (0, 0), (-1, -1), 9),
240
+ ]))
241
+ story.append(time_table)
242
+
243
+ story.append(Spacer(1, 15))
244
+ story.append(Paragraph("Version Information", heading_style))
245
+ version_data = [
246
+ ["CatLLM Version", catllm_version or "unknown"],
247
+ ["Python Version", python_version or "unknown"],
248
+ ["Timestamp", datetime.now().strftime('%Y-%m-%d %H:%M:%S')],
249
+ ]
250
+ version_table = Table(version_data, colWidths=[180, 270])
251
+ version_table.setStyle(TableStyle([
252
+ ('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
253
+ ('GRID', (0, 0), (-1, -1), 1, colors.black),
254
+ ('PADDING', (0, 0), (-1, -1), 6),
255
+ ('FONTSIZE', (0, 0), (-1, -1), 9),
256
+ ]))
257
+ story.append(version_table)
258
+
259
+ story.append(Spacer(1, 30))
260
+ story.append(Paragraph("Citation", heading_style))
261
+ story.append(Paragraph("If you use CatLLM in your research, please cite:", normal_style))
262
+ story.append(Spacer(1, 5))
263
+ story.append(Paragraph("Soria, C. (2025). CatLLM: A Python package for LLM-based text classification. DOI: 10.5281/zenodo.15532316", normal_style))
264
+
265
+ doc.build(story)
266
+ return pdf_file.name
267
+
268
+
269
+ # Page config
270
+ st.set_page_config(
271
+ page_title="CatLLM - Research Data Summarizer",
272
+ page_icon="🐱",
273
+ layout="wide"
274
+ )
275
+
276
+ # Initialize session state
277
+ if 'results' not in st.session_state:
278
+ st.session_state.results = None
279
+ if 'survey_data' not in st.session_state:
280
+ st.session_state.survey_data = None
281
+ if 'pdf_data' not in st.session_state:
282
+ st.session_state.pdf_data = None
283
+
284
+ # Logo and title
285
+ col_logo, col_title = st.columns([1, 6])
286
+ with col_logo:
287
+ st.image("logo.png", width=100)
288
+ with col_title:
289
+ st.title("CatLLM - Research Data Summarizer")
290
+ st.markdown("Generate concise summaries of survey responses and PDF documents using LLMs.")
291
+
292
+ # About section
293
+ with st.expander("About This App"):
294
+ st.markdown("""
295
+ **Privacy Notice:** Your data is sent to third-party LLM APIs for summarization. Do not upload sensitive, confidential, or personally identifiable information (PII).
296
+
297
+ ---
298
+
299
+ **CatLLM** is an open-source Python package for processing text and document data using Large Language Models.
300
+
301
+ ### What It Does
302
+ - **Summarize Text**: Generate concise summaries of survey responses or text data
303
+ - **Summarize PDFs**: Extract key information from PDF documents page-by-page
304
+ - **Focus Summaries**: Guide the model to focus on specific aspects of your data
305
+
306
+ ### Beta Test - We Want Your Feedback!
307
+ This app is currently in **beta** and **free to use** while CatLLM is under review for publication, made possible by **Bashir Ahmed's generous fellowship support**.
308
+
309
+ - Found a bug? Have a feature request? Please open an issue on [GitHub](https://github.com/chrissoria/cat-llm)
310
+ - Reach out directly: [chrissoria@berkeley.edu](mailto:chrissoria@berkeley.edu)
311
+
312
+ ### Links
313
+ - **PyPI**: [pip install cat-llm](https://pypi.org/project/cat-llm/)
314
+ - **GitHub**: [github.com/chrissoria/cat-llm](https://github.com/chrissoria/cat-llm)
315
+ - **Classifier App**: [CatLLM Survey Classifier](https://huggingface.co/spaces/CatLLM/survey-classifier)
316
+
317
+ ### Citation
318
+ If you use CatLLM in your research, please cite:
319
+ ```
320
+ Soria, C. (2025). CatLLM: A Python package for LLM-based text classification. DOI: 10.5281/zenodo.15532316
321
+ ```
322
+ """)
323
+
324
+ # Main layout
325
+ col_input, col_output = st.columns([1, 1])
326
+
327
+ with col_input:
328
+ # Input type selector
329
+ input_type_choice = st.radio(
330
+ "Input Type",
331
+ options=["Survey Responses", "PDF Documents"],
332
+ horizontal=True,
333
+ key="input_type_radio"
334
+ )
335
+
336
+ # Initialize variables
337
+ input_data = None
338
+ input_type_selected = "text"
339
+ description = ""
340
+ original_filename = "data"
341
+ pdf_mode = "Image (visual documents)"
342
+
343
+ if input_type_choice == "Survey Responses":
344
+ input_type_selected = "text"
345
+
346
+ uploaded_file = st.file_uploader(
347
+ "Upload Data (CSV or Excel)",
348
+ type=['csv', 'xlsx', 'xls'],
349
+ key="survey_file"
350
+ )
351
+
352
+ if st.button("Try Example Dataset", key="example_btn"):
353
+ st.session_state.example_loaded = True
354
+
355
+ columns = []
356
+ df = None
357
+ if uploaded_file is not None:
358
+ try:
359
+ if uploaded_file.name.endswith('.csv'):
360
+ df = pd.read_csv(uploaded_file)
361
+ else:
362
+ df = pd.read_excel(uploaded_file)
363
+ columns = df.columns.tolist()
364
+ st.success(f"Loaded {len(df):,} rows")
365
+ except Exception as e:
366
+ st.error(f"Error loading file: {e}")
367
+ elif hasattr(st.session_state, 'example_loaded') and st.session_state.example_loaded:
368
+ try:
369
+ df = pd.read_csv("example_data.csv")
370
+ columns = df.columns.tolist()
371
+ st.success(f"Loaded example dataset ({len(df)} rows)")
372
+ except:
373
+ pass
374
+
375
+ selected_column = st.selectbox(
376
+ "Column to Summarize",
377
+ options=columns if columns else ["Upload a file first"],
378
+ disabled=not columns,
379
+ key="survey_column"
380
+ )
381
+
382
+ description = selected_column if columns else ""
383
+ original_filename = uploaded_file.name if uploaded_file else "example_data.csv"
384
+
385
+ if df is not None and columns and selected_column in columns:
386
+ input_data = df[selected_column].tolist()
387
+
388
+ else: # PDF Documents
389
+ input_type_selected = "pdf"
390
+
391
+ pdf_files = st.file_uploader(
392
+ "Upload PDF Document(s)",
393
+ type=['pdf'],
394
+ accept_multiple_files=True,
395
+ key="pdf_files"
396
+ )
397
+
398
+ pdf_description = st.text_input(
399
+ "Document Description",
400
+ placeholder="e.g., 'research papers', 'interview transcripts'",
401
+ help="Helps the LLM understand context",
402
+ key="pdf_desc"
403
+ )
404
+
405
+ pdf_mode = st.radio(
406
+ "Processing Mode",
407
+ options=["Image (visual documents)", "Text (text-heavy)", "Both (comprehensive)"],
408
+ key="pdf_mode"
409
+ )
410
+
411
+ if pdf_files:
412
+ input_data = []
413
+ pdf_name_map = {} # Map temp paths to original filenames
414
+ for f in pdf_files:
415
+ with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
416
+ tmp.write(f.read())
417
+ input_data.append(tmp.name)
418
+ pdf_name_map[tmp.name] = f.name.replace('.pdf', '')
419
+ st.session_state.pdf_name_map = pdf_name_map
420
+ description = pdf_description or "document"
421
+ original_filename = "pdf_files"
422
+ st.success(f"Uploaded {len(pdf_files)} PDF file(s)")
423
+
424
+ st.markdown("---")
425
+
426
+ # Summarization options
427
+ st.markdown("### Summarization Options")
428
+
429
+ focus = st.text_input(
430
+ "Focus (optional)",
431
+ placeholder="e.g., 'main arguments', 'emotional content', 'key findings'",
432
+ help="Guide the model to focus on specific aspects"
433
+ )
434
+
435
+ max_length = st.number_input(
436
+ "Maximum Summary Length (words, optional)",
437
+ min_value=0,
438
+ max_value=1000,
439
+ value=0,
440
+ help="Leave at 0 for no limit"
441
+ )
442
+ max_length = max_length if max_length > 0 else None
443
+
444
+ instructions = st.text_input(
445
+ "Additional Instructions (optional)",
446
+ placeholder="e.g., 'use bullet points', 'include quotes'",
447
+ help="Custom instructions for the summarization"
448
+ )
449
+
450
+ st.markdown("---")
451
+
452
+ # Model selection
453
+ st.markdown("### Model Selection")
454
+ model_tier = st.radio(
455
+ "Model Tier",
456
+ options=["Free Models", "Bring Your Own Key"],
457
+ key="model_tier"
458
+ )
459
+
460
+ if model_tier == "Free Models":
461
+ model_display = st.selectbox("Model", options=FREE_MODEL_DISPLAY_NAMES, key="model")
462
+ model = FREE_MODELS_MAP[model_display]
463
+ api_key = ""
464
+ else:
465
+ model = st.selectbox("Model", options=PAID_MODEL_CHOICES, key="model_paid")
466
+ api_key = st.text_input("API Key", type="password", key="api_key")
467
+
468
+ # Summarize button
469
+ if st.button("Summarize Data", type="primary", use_container_width=True):
470
+ if input_data is None:
471
+ st.error("Please upload data first")
472
+ else:
473
+ mode = None
474
+ if input_type_selected == "pdf":
475
+ mode_mapping = {
476
+ "Image (visual documents)": "image",
477
+ "Text (text-heavy)": "text",
478
+ "Both (comprehensive)": "both"
479
+ }
480
+ mode = mode_mapping.get(pdf_mode, "image")
481
+
482
+ actual_api_key, provider = get_api_key(model, model_tier, api_key)
483
+ if not actual_api_key:
484
+ st.error(f"{provider} API key not configured")
485
+ else:
486
+ model_source = get_model_source(model)
487
+ items_list = input_data if isinstance(input_data, list) else [input_data]
488
+
489
+ # Calculate estimated time
490
+ num_items = len(items_list)
491
+ if input_type_selected == "pdf":
492
+ total_pages = sum(count_pdf_pages(p) for p in items_list)
493
+ est_seconds = total_pages * 5
494
+ else:
495
+ est_seconds = max(10, num_items * 2)
496
+
497
+ est_time_str = f"{est_seconds:.0f}s" if est_seconds < 60 else f"{est_seconds/60:.1f}m"
498
+
499
+ # Progress UI
500
+ progress_bar = st.progress(0)
501
+ status_text = st.empty()
502
+ start_time = time.time()
503
+
504
+ def progress_callback(current_idx, total, label=None):
505
+ progress = current_idx / total if total > 0 else 0
506
+ progress_bar.progress(min(progress, 1.0))
507
+
508
+ elapsed = time.time() - start_time
509
+ if current_idx > 0:
510
+ avg_time = elapsed / current_idx
511
+ eta_seconds = avg_time * (total - current_idx)
512
+ eta_str = f" | ETA: {eta_seconds:.0f}s" if eta_seconds < 60 else f" | ETA: {eta_seconds/60:.1f}m"
513
+ else:
514
+ eta_str = ""
515
+
516
+ label_str = f" ({label})" if label else ""
517
+ status_text.text(f"Processing item {current_idx+1} of {total}{label_str} ({progress*100:.0f}%){eta_str}")
518
+
519
+ try:
520
+ # Build kwargs for summarize
521
+ summarize_kwargs = {
522
+ "input_data": items_list,
523
+ "api_key": actual_api_key,
524
+ "description": description,
525
+ "user_model": model,
526
+ "model_source": model_source,
527
+ "progress_callback": progress_callback,
528
+ }
529
+ if mode:
530
+ summarize_kwargs["mode"] = mode
531
+ if focus and focus.strip():
532
+ summarize_kwargs["focus"] = focus.strip()
533
+ if max_length:
534
+ summarize_kwargs["max_length"] = max_length
535
+ if instructions and instructions.strip():
536
+ summarize_kwargs["instructions"] = instructions.strip()
537
+
538
+ result_df = catllm.summarize(**summarize_kwargs)
539
+
540
+ processing_time = time.time() - start_time
541
+ total_items = len(result_df)
542
+ progress_bar.progress(1.0)
543
+ status_text.text(f"Completed {total_items} items in {processing_time:.1f}s")
544
+
545
+ # Replace temp paths with original filenames for PDF input
546
+ if input_type_selected == "pdf" and 'pdf_path' in result_df.columns:
547
+ pdf_name_map = st.session_state.get('pdf_name_map', {})
548
+ def replace_temp_path(val):
549
+ if pd.isna(val):
550
+ return val
551
+ val_str = str(val)
552
+ for temp_path, orig_name in pdf_name_map.items():
553
+ if temp_path in val_str:
554
+ return val_str.replace(temp_path, orig_name + '.pdf')
555
+ return val_str
556
+ result_df['pdf_path'] = result_df['pdf_path'].apply(replace_temp_path)
557
+
558
+ # Save CSV
559
+ with tempfile.NamedTemporaryFile(mode='w', suffix='_summarized.csv', delete=False) as f:
560
+ result_df.to_csv(f.name, index=False)
561
+ csv_path = f.name
562
+
563
+ # Calculate success rate
564
+ if 'processing_status' in result_df.columns:
565
+ success_count = (result_df['processing_status'] == 'success').sum()
566
+ success_rate = (success_count / len(result_df)) * 100
567
+ else:
568
+ success_rate = 100.0
569
+
570
+ # Get version info
571
+ try:
572
+ catllm_version = catllm.__version__
573
+ except AttributeError:
574
+ catllm_version = "unknown"
575
+ python_version = sys.version.split()[0]
576
+
577
+ # Generate methodology report
578
+ pdf_path = generate_methodology_report_pdf(
579
+ model=model,
580
+ column_name=description,
581
+ num_rows=total_items,
582
+ model_source=model_source,
583
+ filename=original_filename,
584
+ success_rate=success_rate,
585
+ result_df=result_df,
586
+ processing_time=processing_time,
587
+ catllm_version=catllm_version,
588
+ python_version=python_version,
589
+ input_type=input_type_selected,
590
+ description=description,
591
+ focus=focus if focus else None,
592
+ max_length=max_length
593
+ )
594
+
595
+ # Generate code
596
+ code = generate_summarize_code(
597
+ input_type_selected, description, model, model_source,
598
+ focus=focus if focus else None,
599
+ max_length=max_length,
600
+ instructions=instructions if instructions else None,
601
+ mode=mode
602
+ )
603
+
604
+ st.session_state.results = {
605
+ 'df': result_df,
606
+ 'csv_path': csv_path,
607
+ 'pdf_path': pdf_path,
608
+ 'code': code,
609
+ 'status': f"Summarized {total_items} items in {processing_time:.1f}s",
610
+ }
611
+ st.success(f"Summarized {total_items} items in {processing_time:.1f}s")
612
+ st.rerun()
613
+
614
+ except Exception as e:
615
+ st.error(f"Error: {str(e)}")
616
+
617
+ with col_output:
618
+ st.markdown("### Results")
619
+
620
+ if st.session_state.results:
621
+ results = st.session_state.results
622
+
623
+ # Placeholder for future chart
624
+ st.info("Summary visualization coming soon!")
625
+
626
+ # Results dataframe
627
+ display_df = results['df'].copy()
628
+ cols_to_hide = ['model_response', 'json', 'raw_response', 'raw_json']
629
+ display_df = display_df.drop(columns=[c for c in cols_to_hide if c in display_df.columns])
630
+ st.dataframe(display_df, use_container_width=True)
631
+
632
+ # Downloads
633
+ col_dl1, col_dl2 = st.columns(2)
634
+ with col_dl1:
635
+ with open(results['csv_path'], 'rb') as f:
636
+ st.download_button(
637
+ "Download Results (CSV)",
638
+ data=f,
639
+ file_name="summarized_results.csv",
640
+ mime="text/csv"
641
+ )
642
+ with col_dl2:
643
+ with open(results['pdf_path'], 'rb') as f:
644
+ st.download_button(
645
+ "Download Methodology Report (PDF)",
646
+ data=f,
647
+ file_name="methodology_report.pdf",
648
+ mime="application/pdf"
649
+ )
650
+
651
+ # Code
652
+ with st.expander("See the Code"):
653
+ st.code(results['code'], language='python')
654
+ else:
655
+ st.info("Upload data and click 'Summarize Data' to see results here.")
656
+
657
+ # Bottom buttons
658
+ col_reset, col_code = st.columns(2)
659
+ with col_reset:
660
+ if st.button("Reset", type="secondary", use_container_width=True):
661
+ st.session_state.results = None
662
+ if hasattr(st.session_state, 'example_loaded'):
663
+ del st.session_state.example_loaded
664
+ st.rerun()
665
+
666
+ with col_code:
667
+ if st.session_state.results:
668
+ if st.button("See in Code", use_container_width=True):
669
+ st.session_state.show_code_modal = True
670
+
671
+ # Code modal/dialog
672
+ if st.session_state.get('show_code_modal') and st.session_state.results:
673
+ st.markdown("---")
674
+ st.markdown("### Reproducibility Code")
675
+ st.markdown("Use this code to reproduce the summarization with the CatLLM Python package:")
676
+ st.code(st.session_state.results['code'], language='python')
677
+ if st.button("Close"):
678
+ st.session_state.show_code_modal = False
679
+ st.rerun()
example_data.csv ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Response
2
+ The weather was just too hot where I lived
3
+ I could no longer afford the rent in my old neighborhood
4
+ My company offered me a promotion but it required relocating
5
+ I wanted to be closer to my aging parents
6
+ The schools in my area were not good enough for my kids
7
+ I got accepted into graduate school across the country
8
+ My apartment had a terrible mold problem that the landlord refused to fix
9
+ I went through a divorce and needed a fresh start
10
+ The crime rate in my neighborhood kept getting worse
11
+ I was tired of the long commute to work every day
12
+ I fell in love with someone who lived in another city
13
+ The cost of living was much lower in the new area
14
+ I needed a bigger house because we were expecting twins
15
+ My doctor recommended a drier climate for my health
16
+ I got laid off and found a new job in a different state
17
+ I wanted to live somewhere with better outdoor recreation
18
+ My lease ended and my landlord decided to sell the building
19
+ I retired and wanted to move somewhere warmer
20
+ The noise from the construction next door was unbearable
21
+ I always dreamed of living near the ocean
logo.png ADDED
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ streamlit>=1.32.0
2
+ cat-llm[pdf]>=0.1.15
3
+ mistralai
4
+ pydantic==2.10.6
5
+ huggingface_hub<0.27.0
6
+ pandas
7
+ openpyxl
8
+ requests
9
+ regex
10
+ reportlab
11
+ matplotlib
12
+ Pillow