Spaces:
Running
Running
Commit
·
c44ee53
1
Parent(s):
d3c6788
Initial summarizer app
Browse files- Streamlit app for text and PDF summarization
- Based on classifier app structure
- Supports free models and bring-your-own-key
- Generates methodology report PDFs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- README.md +45 -8
- app.py +679 -0
- example_data.csv +21 -0
- logo.png +0 -0
- requirements.txt +12 -0
README.md
CHANGED
|
@@ -1,14 +1,51 @@
|
|
| 1 |
---
|
| 2 |
-
title: Survey Summarizer
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
-
sdk:
|
| 7 |
-
sdk_version:
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: gpl-3.0
|
| 11 |
-
short_description:
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
title: CatLLM - Survey Response Summarizer
|
| 3 |
+
emoji: 🐱
|
| 4 |
+
colorFrom: yellow
|
| 5 |
+
colorTo: yellow
|
| 6 |
+
sdk: streamlit
|
| 7 |
+
sdk_version: "1.32.0"
|
| 8 |
app_file: app.py
|
| 9 |
pinned: false
|
| 10 |
license: gpl-3.0
|
| 11 |
+
short_description: Summarize survey responses and PDFs using LLMs
|
| 12 |
---
|
| 13 |
|
| 14 |
+
# CatLLM - Survey Response Summarizer
|
| 15 |
+
|
| 16 |
+
A web interface for the [catllm](https://github.com/chrissoria/cat-llm) Python package. Summarize survey responses and PDF documents using various LLM providers.
|
| 17 |
+
|
| 18 |
+
## How to Use
|
| 19 |
+
|
| 20 |
+
1. **Upload Your Data**: Upload a CSV, Excel file, or PDF documents
|
| 21 |
+
2. **Select Column** (for text): Choose the column containing the text responses to summarize
|
| 22 |
+
3. **Add Context**: Describe your data and optionally add focus/instructions
|
| 23 |
+
4. **Choose a Model**: Select your preferred LLM (free models available!)
|
| 24 |
+
5. **Click Summarize**: View and download results with generated summaries
|
| 25 |
+
|
| 26 |
+
## Features
|
| 27 |
+
|
| 28 |
+
- **Text Summarization**: Summarize survey responses, feedback, or any text data
|
| 29 |
+
- **PDF Summarization**: Extract and summarize content from PDF documents
|
| 30 |
+
- **Customizable**: Add focus areas, max length limits, and custom instructions
|
| 31 |
+
- **Methodology Report**: Download a PDF report documenting your summarization process
|
| 32 |
+
|
| 33 |
+
## Supported Models
|
| 34 |
+
|
| 35 |
+
| Provider | Models |
|
| 36 |
+
|----------|--------|
|
| 37 |
+
| **OpenAI** | gpt-4.1, gpt-4o, gpt-4o-mini |
|
| 38 |
+
| **Anthropic** | claude-sonnet-4.5, claude-opus-4, claude-3.5-haiku |
|
| 39 |
+
| **Google** | gemini-2.5-pro, gemini-2.5-flash |
|
| 40 |
+
| **Mistral** | mistral-large-latest |
|
| 41 |
+
| **Free Models** | Qwen3 235B, DeepSeek V3.1, Llama 3.3 70B |
|
| 42 |
+
|
| 43 |
+
## Privacy
|
| 44 |
+
|
| 45 |
+
Your API key is **never stored**. It is only used for the current summarization request and is not logged or saved.
|
| 46 |
+
|
| 47 |
+
## Related
|
| 48 |
+
|
| 49 |
+
- [CatLLM Survey Classifier](https://huggingface.co/spaces/CatLLM/survey-classifier) - Classify survey responses into categories
|
| 50 |
+
- [catllm on PyPI](https://pypi.org/project/cat-llm/)
|
| 51 |
+
- [GitHub Repository](https://github.com/chrissoria/cat-llm)
|
app.py
ADDED
|
@@ -0,0 +1,679 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Streamlit app - CatLLM Survey Response Summarizer
|
| 3 |
+
Based on the classifier app but focused on text/PDF summarization
|
| 4 |
+
"""
|
| 5 |
+
|
| 6 |
+
import streamlit as st
|
| 7 |
+
import pandas as pd
|
| 8 |
+
import tempfile
|
| 9 |
+
import os
|
| 10 |
+
import time
|
| 11 |
+
import sys
|
| 12 |
+
from datetime import datetime
|
| 13 |
+
|
| 14 |
+
# Import catllm
|
| 15 |
+
try:
|
| 16 |
+
import catllm
|
| 17 |
+
CATLLM_AVAILABLE = True
|
| 18 |
+
except ImportError as e:
|
| 19 |
+
print(f"Warning: Could not import catllm: {e}")
|
| 20 |
+
CATLLM_AVAILABLE = False
|
| 21 |
+
|
| 22 |
+
MAX_FILE_SIZE_MB = 100
|
| 23 |
+
|
| 24 |
+
def count_pdf_pages(pdf_path):
|
| 25 |
+
"""Count the number of pages in a PDF file."""
|
| 26 |
+
try:
|
| 27 |
+
import fitz # PyMuPDF
|
| 28 |
+
doc = fitz.open(pdf_path)
|
| 29 |
+
page_count = len(doc)
|
| 30 |
+
doc.close()
|
| 31 |
+
return page_count
|
| 32 |
+
except Exception:
|
| 33 |
+
return 1 # Default to 1 if can't read
|
| 34 |
+
|
| 35 |
+
|
| 36 |
+
# Free models - display name -> actual API model name
|
| 37 |
+
FREE_MODELS_MAP = {
|
| 38 |
+
"Qwen3 235B": "Qwen/Qwen3-VL-235B-A22B-Instruct:novita",
|
| 39 |
+
"DeepSeek V3.1": "deepseek-ai/DeepSeek-V3.1:novita",
|
| 40 |
+
"Llama 3.3 70B": "meta-llama/Llama-3.3-70B-Instruct:groq",
|
| 41 |
+
"Gemini 2.5 Flash": "gemini-2.5-flash",
|
| 42 |
+
"GPT-4o Mini": "gpt-4o-mini",
|
| 43 |
+
"Mistral Medium": "mistral-medium-2505",
|
| 44 |
+
"Claude 3 Haiku": "claude-3-haiku-20240307",
|
| 45 |
+
"Grok 4 Fast": "grok-4-fast-non-reasoning",
|
| 46 |
+
}
|
| 47 |
+
FREE_MODEL_DISPLAY_NAMES = list(FREE_MODELS_MAP.keys())
|
| 48 |
+
|
| 49 |
+
# Paid models (user provides their own API key)
|
| 50 |
+
PAID_MODEL_CHOICES = [
|
| 51 |
+
"gpt-4.1",
|
| 52 |
+
"gpt-4o",
|
| 53 |
+
"gpt-4o-mini",
|
| 54 |
+
"claude-sonnet-4-5-20250929",
|
| 55 |
+
"claude-opus-4-20250514",
|
| 56 |
+
"claude-3-5-haiku-20241022",
|
| 57 |
+
"gemini-2.5-pro",
|
| 58 |
+
"gemini-2.5-flash",
|
| 59 |
+
"mistral-large-latest",
|
| 60 |
+
]
|
| 61 |
+
|
| 62 |
+
# Models routed through HuggingFace
|
| 63 |
+
HF_ROUTED_MODELS = [
|
| 64 |
+
"Qwen/Qwen3-VL-235B-A22B-Instruct:novita",
|
| 65 |
+
"deepseek-ai/DeepSeek-V3.1:novita",
|
| 66 |
+
"meta-llama/Llama-3.3-70B-Instruct:groq",
|
| 67 |
+
]
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
def is_free_model(model, model_tier):
|
| 71 |
+
"""Check if using free tier (Space pays for API)."""
|
| 72 |
+
return model_tier == "Free Models"
|
| 73 |
+
|
| 74 |
+
|
| 75 |
+
def get_model_source(model):
|
| 76 |
+
"""Auto-detect model source."""
|
| 77 |
+
model_lower = model.lower()
|
| 78 |
+
if "gpt" in model_lower:
|
| 79 |
+
return "openai"
|
| 80 |
+
elif "claude" in model_lower:
|
| 81 |
+
return "anthropic"
|
| 82 |
+
elif "gemini" in model_lower:
|
| 83 |
+
return "google"
|
| 84 |
+
elif "mistral" in model_lower and ":novita" not in model_lower:
|
| 85 |
+
return "mistral"
|
| 86 |
+
elif any(x in model_lower for x in [":novita", ":groq", "qwen", "llama", "deepseek"]):
|
| 87 |
+
return "huggingface"
|
| 88 |
+
elif "sonar" in model_lower:
|
| 89 |
+
return "perplexity"
|
| 90 |
+
elif "grok" in model_lower:
|
| 91 |
+
return "xai"
|
| 92 |
+
return "huggingface"
|
| 93 |
+
|
| 94 |
+
|
| 95 |
+
def get_api_key(model, model_tier, api_key_input):
|
| 96 |
+
"""Get the appropriate API key based on model and tier."""
|
| 97 |
+
if is_free_model(model, model_tier):
|
| 98 |
+
if model in HF_ROUTED_MODELS:
|
| 99 |
+
return os.environ.get("HF_API_KEY", ""), "HuggingFace"
|
| 100 |
+
elif "gpt" in model.lower():
|
| 101 |
+
return os.environ.get("OPENAI_API_KEY", ""), "OpenAI"
|
| 102 |
+
elif "gemini" in model.lower():
|
| 103 |
+
return os.environ.get("GOOGLE_API_KEY", ""), "Google"
|
| 104 |
+
elif "mistral" in model.lower():
|
| 105 |
+
return os.environ.get("MISTRAL_API_KEY", ""), "Mistral"
|
| 106 |
+
elif "claude" in model.lower():
|
| 107 |
+
return os.environ.get("ANTHROPIC_API_KEY", ""), "Anthropic"
|
| 108 |
+
elif "sonar" in model.lower():
|
| 109 |
+
return os.environ.get("PERPLEXITY_API_KEY", ""), "Perplexity"
|
| 110 |
+
elif "grok" in model.lower():
|
| 111 |
+
return os.environ.get("XAI_API_KEY", ""), "xAI"
|
| 112 |
+
else:
|
| 113 |
+
return os.environ.get("HF_API_KEY", ""), "HuggingFace"
|
| 114 |
+
else:
|
| 115 |
+
if api_key_input and api_key_input.strip():
|
| 116 |
+
return api_key_input.strip(), "User"
|
| 117 |
+
return "", "User"
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def generate_summarize_code(input_type, description, model, model_source, focus=None, max_length=None, instructions=None, mode=None):
|
| 121 |
+
"""Generate Python code for summarization."""
|
| 122 |
+
focus_param = f',\n focus="{focus}"' if focus else ''
|
| 123 |
+
length_param = f',\n max_length={max_length}' if max_length else ''
|
| 124 |
+
instructions_param = f',\n instructions="{instructions}"' if instructions else ''
|
| 125 |
+
|
| 126 |
+
if input_type == "text":
|
| 127 |
+
return f'''import catllm
|
| 128 |
+
import pandas as pd
|
| 129 |
+
|
| 130 |
+
# Load your data
|
| 131 |
+
df = pd.read_csv("your_data.csv")
|
| 132 |
+
|
| 133 |
+
# Summarize the text column
|
| 134 |
+
result = catllm.summarize(
|
| 135 |
+
input_data=df["your_column"].tolist(),
|
| 136 |
+
api_key="YOUR_API_KEY",
|
| 137 |
+
description="{description}",
|
| 138 |
+
user_model="{model}",
|
| 139 |
+
model_source="{model_source}"{focus_param}{length_param}{instructions_param}
|
| 140 |
+
)
|
| 141 |
+
|
| 142 |
+
# View results
|
| 143 |
+
print(result)
|
| 144 |
+
result.to_csv("summarized_results.csv", index=False)
|
| 145 |
+
'''
|
| 146 |
+
else: # pdf
|
| 147 |
+
mode_param = f',\n mode="{mode}"' if mode else ''
|
| 148 |
+
return f'''import catllm
|
| 149 |
+
|
| 150 |
+
# Summarize PDF documents
|
| 151 |
+
result = catllm.summarize(
|
| 152 |
+
input_data="path/to/your/pdfs/",
|
| 153 |
+
api_key="YOUR_API_KEY",
|
| 154 |
+
description="{description}",
|
| 155 |
+
user_model="{model}",
|
| 156 |
+
model_source="{model_source}"{mode_param}{focus_param}{length_param}{instructions_param}
|
| 157 |
+
)
|
| 158 |
+
|
| 159 |
+
# View results
|
| 160 |
+
print(result)
|
| 161 |
+
result.to_csv("summarized_results.csv", index=False)
|
| 162 |
+
'''
|
| 163 |
+
|
| 164 |
+
|
| 165 |
+
def generate_methodology_report_pdf(model, column_name, num_rows, model_source, filename, success_rate,
|
| 166 |
+
result_df=None, processing_time=None,
|
| 167 |
+
catllm_version=None, python_version=None,
|
| 168 |
+
input_type="text", description=None, focus=None, max_length=None):
|
| 169 |
+
"""Generate a PDF methodology report for summarization."""
|
| 170 |
+
from reportlab.lib.pagesizes import letter
|
| 171 |
+
from reportlab.lib import colors
|
| 172 |
+
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
|
| 173 |
+
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table, TableStyle, PageBreak
|
| 174 |
+
|
| 175 |
+
pdf_file = tempfile.NamedTemporaryFile(mode='wb', suffix='_methodology_report.pdf', delete=False)
|
| 176 |
+
doc = SimpleDocTemplate(pdf_file.name, pagesize=letter)
|
| 177 |
+
styles = getSampleStyleSheet()
|
| 178 |
+
|
| 179 |
+
title_style = ParagraphStyle('Title', parent=styles['Heading1'], fontSize=18, spaceAfter=20)
|
| 180 |
+
heading_style = ParagraphStyle('Heading', parent=styles['Heading2'], fontSize=14, spaceAfter=10, spaceBefore=15)
|
| 181 |
+
normal_style = styles['Normal']
|
| 182 |
+
|
| 183 |
+
story = []
|
| 184 |
+
|
| 185 |
+
report_title = "CatLLM Summarization Report"
|
| 186 |
+
story.append(Paragraph(report_title, title_style))
|
| 187 |
+
story.append(Paragraph(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}", normal_style))
|
| 188 |
+
story.append(Spacer(1, 15))
|
| 189 |
+
|
| 190 |
+
story.append(Paragraph("About This Report", heading_style))
|
| 191 |
+
about_text = """This methodology report documents the automated summarization process. \
|
| 192 |
+
CatLLM uses LLMs to generate concise summaries of text or PDF documents, providing \
|
| 193 |
+
consistent and reproducible results."""
|
| 194 |
+
story.append(Paragraph(about_text, normal_style))
|
| 195 |
+
story.append(Spacer(1, 15))
|
| 196 |
+
|
| 197 |
+
# Summary section
|
| 198 |
+
story.append(Paragraph("Summarization Summary", heading_style))
|
| 199 |
+
story.append(Spacer(1, 10))
|
| 200 |
+
|
| 201 |
+
summary_data = [
|
| 202 |
+
["Source File", filename],
|
| 203 |
+
["Source Column/Type", column_name],
|
| 204 |
+
["Model Used", model],
|
| 205 |
+
["Model Source", model_source],
|
| 206 |
+
["Items Summarized", str(num_rows)],
|
| 207 |
+
["Success Rate", f"{success_rate:.2f}%"],
|
| 208 |
+
]
|
| 209 |
+
if focus:
|
| 210 |
+
summary_data.append(["Focus", focus])
|
| 211 |
+
if max_length:
|
| 212 |
+
summary_data.append(["Max Length", f"{max_length} words"])
|
| 213 |
+
|
| 214 |
+
summary_table = Table(summary_data, colWidths=[150, 300])
|
| 215 |
+
summary_table.setStyle(TableStyle([
|
| 216 |
+
('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
|
| 217 |
+
('GRID', (0, 0), (-1, -1), 1, colors.black),
|
| 218 |
+
('PADDING', (0, 0), (-1, -1), 6),
|
| 219 |
+
('FONTSIZE', (0, 0), (-1, -1), 9),
|
| 220 |
+
]))
|
| 221 |
+
story.append(summary_table)
|
| 222 |
+
story.append(Spacer(1, 15))
|
| 223 |
+
|
| 224 |
+
if processing_time is not None:
|
| 225 |
+
story.append(Paragraph("Processing Time", heading_style))
|
| 226 |
+
rows_per_min = (num_rows / processing_time) * 60 if processing_time > 0 else 0
|
| 227 |
+
avg_time = processing_time / num_rows if num_rows > 0 else 0
|
| 228 |
+
|
| 229 |
+
time_data = [
|
| 230 |
+
["Total Processing Time", f"{processing_time:.1f} seconds"],
|
| 231 |
+
["Average Time per Item", f"{avg_time:.2f} seconds"],
|
| 232 |
+
["Processing Rate", f"{rows_per_min:.1f} items/minute"],
|
| 233 |
+
]
|
| 234 |
+
time_table = Table(time_data, colWidths=[180, 270])
|
| 235 |
+
time_table.setStyle(TableStyle([
|
| 236 |
+
('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
|
| 237 |
+
('GRID', (0, 0), (-1, -1), 1, colors.black),
|
| 238 |
+
('PADDING', (0, 0), (-1, -1), 6),
|
| 239 |
+
('FONTSIZE', (0, 0), (-1, -1), 9),
|
| 240 |
+
]))
|
| 241 |
+
story.append(time_table)
|
| 242 |
+
|
| 243 |
+
story.append(Spacer(1, 15))
|
| 244 |
+
story.append(Paragraph("Version Information", heading_style))
|
| 245 |
+
version_data = [
|
| 246 |
+
["CatLLM Version", catllm_version or "unknown"],
|
| 247 |
+
["Python Version", python_version or "unknown"],
|
| 248 |
+
["Timestamp", datetime.now().strftime('%Y-%m-%d %H:%M:%S')],
|
| 249 |
+
]
|
| 250 |
+
version_table = Table(version_data, colWidths=[180, 270])
|
| 251 |
+
version_table.setStyle(TableStyle([
|
| 252 |
+
('BACKGROUND', (0, 0), (0, -1), colors.lightgrey),
|
| 253 |
+
('GRID', (0, 0), (-1, -1), 1, colors.black),
|
| 254 |
+
('PADDING', (0, 0), (-1, -1), 6),
|
| 255 |
+
('FONTSIZE', (0, 0), (-1, -1), 9),
|
| 256 |
+
]))
|
| 257 |
+
story.append(version_table)
|
| 258 |
+
|
| 259 |
+
story.append(Spacer(1, 30))
|
| 260 |
+
story.append(Paragraph("Citation", heading_style))
|
| 261 |
+
story.append(Paragraph("If you use CatLLM in your research, please cite:", normal_style))
|
| 262 |
+
story.append(Spacer(1, 5))
|
| 263 |
+
story.append(Paragraph("Soria, C. (2025). CatLLM: A Python package for LLM-based text classification. DOI: 10.5281/zenodo.15532316", normal_style))
|
| 264 |
+
|
| 265 |
+
doc.build(story)
|
| 266 |
+
return pdf_file.name
|
| 267 |
+
|
| 268 |
+
|
| 269 |
+
# Page config
|
| 270 |
+
st.set_page_config(
|
| 271 |
+
page_title="CatLLM - Research Data Summarizer",
|
| 272 |
+
page_icon="🐱",
|
| 273 |
+
layout="wide"
|
| 274 |
+
)
|
| 275 |
+
|
| 276 |
+
# Initialize session state
|
| 277 |
+
if 'results' not in st.session_state:
|
| 278 |
+
st.session_state.results = None
|
| 279 |
+
if 'survey_data' not in st.session_state:
|
| 280 |
+
st.session_state.survey_data = None
|
| 281 |
+
if 'pdf_data' not in st.session_state:
|
| 282 |
+
st.session_state.pdf_data = None
|
| 283 |
+
|
| 284 |
+
# Logo and title
|
| 285 |
+
col_logo, col_title = st.columns([1, 6])
|
| 286 |
+
with col_logo:
|
| 287 |
+
st.image("logo.png", width=100)
|
| 288 |
+
with col_title:
|
| 289 |
+
st.title("CatLLM - Research Data Summarizer")
|
| 290 |
+
st.markdown("Generate concise summaries of survey responses and PDF documents using LLMs.")
|
| 291 |
+
|
| 292 |
+
# About section
|
| 293 |
+
with st.expander("About This App"):
|
| 294 |
+
st.markdown("""
|
| 295 |
+
**Privacy Notice:** Your data is sent to third-party LLM APIs for summarization. Do not upload sensitive, confidential, or personally identifiable information (PII).
|
| 296 |
+
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
**CatLLM** is an open-source Python package for processing text and document data using Large Language Models.
|
| 300 |
+
|
| 301 |
+
### What It Does
|
| 302 |
+
- **Summarize Text**: Generate concise summaries of survey responses or text data
|
| 303 |
+
- **Summarize PDFs**: Extract key information from PDF documents page-by-page
|
| 304 |
+
- **Focus Summaries**: Guide the model to focus on specific aspects of your data
|
| 305 |
+
|
| 306 |
+
### Beta Test - We Want Your Feedback!
|
| 307 |
+
This app is currently in **beta** and **free to use** while CatLLM is under review for publication, made possible by **Bashir Ahmed's generous fellowship support**.
|
| 308 |
+
|
| 309 |
+
- Found a bug? Have a feature request? Please open an issue on [GitHub](https://github.com/chrissoria/cat-llm)
|
| 310 |
+
- Reach out directly: [chrissoria@berkeley.edu](mailto:chrissoria@berkeley.edu)
|
| 311 |
+
|
| 312 |
+
### Links
|
| 313 |
+
- **PyPI**: [pip install cat-llm](https://pypi.org/project/cat-llm/)
|
| 314 |
+
- **GitHub**: [github.com/chrissoria/cat-llm](https://github.com/chrissoria/cat-llm)
|
| 315 |
+
- **Classifier App**: [CatLLM Survey Classifier](https://huggingface.co/spaces/CatLLM/survey-classifier)
|
| 316 |
+
|
| 317 |
+
### Citation
|
| 318 |
+
If you use CatLLM in your research, please cite:
|
| 319 |
+
```
|
| 320 |
+
Soria, C. (2025). CatLLM: A Python package for LLM-based text classification. DOI: 10.5281/zenodo.15532316
|
| 321 |
+
```
|
| 322 |
+
""")
|
| 323 |
+
|
| 324 |
+
# Main layout
|
| 325 |
+
col_input, col_output = st.columns([1, 1])
|
| 326 |
+
|
| 327 |
+
with col_input:
|
| 328 |
+
# Input type selector
|
| 329 |
+
input_type_choice = st.radio(
|
| 330 |
+
"Input Type",
|
| 331 |
+
options=["Survey Responses", "PDF Documents"],
|
| 332 |
+
horizontal=True,
|
| 333 |
+
key="input_type_radio"
|
| 334 |
+
)
|
| 335 |
+
|
| 336 |
+
# Initialize variables
|
| 337 |
+
input_data = None
|
| 338 |
+
input_type_selected = "text"
|
| 339 |
+
description = ""
|
| 340 |
+
original_filename = "data"
|
| 341 |
+
pdf_mode = "Image (visual documents)"
|
| 342 |
+
|
| 343 |
+
if input_type_choice == "Survey Responses":
|
| 344 |
+
input_type_selected = "text"
|
| 345 |
+
|
| 346 |
+
uploaded_file = st.file_uploader(
|
| 347 |
+
"Upload Data (CSV or Excel)",
|
| 348 |
+
type=['csv', 'xlsx', 'xls'],
|
| 349 |
+
key="survey_file"
|
| 350 |
+
)
|
| 351 |
+
|
| 352 |
+
if st.button("Try Example Dataset", key="example_btn"):
|
| 353 |
+
st.session_state.example_loaded = True
|
| 354 |
+
|
| 355 |
+
columns = []
|
| 356 |
+
df = None
|
| 357 |
+
if uploaded_file is not None:
|
| 358 |
+
try:
|
| 359 |
+
if uploaded_file.name.endswith('.csv'):
|
| 360 |
+
df = pd.read_csv(uploaded_file)
|
| 361 |
+
else:
|
| 362 |
+
df = pd.read_excel(uploaded_file)
|
| 363 |
+
columns = df.columns.tolist()
|
| 364 |
+
st.success(f"Loaded {len(df):,} rows")
|
| 365 |
+
except Exception as e:
|
| 366 |
+
st.error(f"Error loading file: {e}")
|
| 367 |
+
elif hasattr(st.session_state, 'example_loaded') and st.session_state.example_loaded:
|
| 368 |
+
try:
|
| 369 |
+
df = pd.read_csv("example_data.csv")
|
| 370 |
+
columns = df.columns.tolist()
|
| 371 |
+
st.success(f"Loaded example dataset ({len(df)} rows)")
|
| 372 |
+
except:
|
| 373 |
+
pass
|
| 374 |
+
|
| 375 |
+
selected_column = st.selectbox(
|
| 376 |
+
"Column to Summarize",
|
| 377 |
+
options=columns if columns else ["Upload a file first"],
|
| 378 |
+
disabled=not columns,
|
| 379 |
+
key="survey_column"
|
| 380 |
+
)
|
| 381 |
+
|
| 382 |
+
description = selected_column if columns else ""
|
| 383 |
+
original_filename = uploaded_file.name if uploaded_file else "example_data.csv"
|
| 384 |
+
|
| 385 |
+
if df is not None and columns and selected_column in columns:
|
| 386 |
+
input_data = df[selected_column].tolist()
|
| 387 |
+
|
| 388 |
+
else: # PDF Documents
|
| 389 |
+
input_type_selected = "pdf"
|
| 390 |
+
|
| 391 |
+
pdf_files = st.file_uploader(
|
| 392 |
+
"Upload PDF Document(s)",
|
| 393 |
+
type=['pdf'],
|
| 394 |
+
accept_multiple_files=True,
|
| 395 |
+
key="pdf_files"
|
| 396 |
+
)
|
| 397 |
+
|
| 398 |
+
pdf_description = st.text_input(
|
| 399 |
+
"Document Description",
|
| 400 |
+
placeholder="e.g., 'research papers', 'interview transcripts'",
|
| 401 |
+
help="Helps the LLM understand context",
|
| 402 |
+
key="pdf_desc"
|
| 403 |
+
)
|
| 404 |
+
|
| 405 |
+
pdf_mode = st.radio(
|
| 406 |
+
"Processing Mode",
|
| 407 |
+
options=["Image (visual documents)", "Text (text-heavy)", "Both (comprehensive)"],
|
| 408 |
+
key="pdf_mode"
|
| 409 |
+
)
|
| 410 |
+
|
| 411 |
+
if pdf_files:
|
| 412 |
+
input_data = []
|
| 413 |
+
pdf_name_map = {} # Map temp paths to original filenames
|
| 414 |
+
for f in pdf_files:
|
| 415 |
+
with tempfile.NamedTemporaryFile(delete=False, suffix='.pdf') as tmp:
|
| 416 |
+
tmp.write(f.read())
|
| 417 |
+
input_data.append(tmp.name)
|
| 418 |
+
pdf_name_map[tmp.name] = f.name.replace('.pdf', '')
|
| 419 |
+
st.session_state.pdf_name_map = pdf_name_map
|
| 420 |
+
description = pdf_description or "document"
|
| 421 |
+
original_filename = "pdf_files"
|
| 422 |
+
st.success(f"Uploaded {len(pdf_files)} PDF file(s)")
|
| 423 |
+
|
| 424 |
+
st.markdown("---")
|
| 425 |
+
|
| 426 |
+
# Summarization options
|
| 427 |
+
st.markdown("### Summarization Options")
|
| 428 |
+
|
| 429 |
+
focus = st.text_input(
|
| 430 |
+
"Focus (optional)",
|
| 431 |
+
placeholder="e.g., 'main arguments', 'emotional content', 'key findings'",
|
| 432 |
+
help="Guide the model to focus on specific aspects"
|
| 433 |
+
)
|
| 434 |
+
|
| 435 |
+
max_length = st.number_input(
|
| 436 |
+
"Maximum Summary Length (words, optional)",
|
| 437 |
+
min_value=0,
|
| 438 |
+
max_value=1000,
|
| 439 |
+
value=0,
|
| 440 |
+
help="Leave at 0 for no limit"
|
| 441 |
+
)
|
| 442 |
+
max_length = max_length if max_length > 0 else None
|
| 443 |
+
|
| 444 |
+
instructions = st.text_input(
|
| 445 |
+
"Additional Instructions (optional)",
|
| 446 |
+
placeholder="e.g., 'use bullet points', 'include quotes'",
|
| 447 |
+
help="Custom instructions for the summarization"
|
| 448 |
+
)
|
| 449 |
+
|
| 450 |
+
st.markdown("---")
|
| 451 |
+
|
| 452 |
+
# Model selection
|
| 453 |
+
st.markdown("### Model Selection")
|
| 454 |
+
model_tier = st.radio(
|
| 455 |
+
"Model Tier",
|
| 456 |
+
options=["Free Models", "Bring Your Own Key"],
|
| 457 |
+
key="model_tier"
|
| 458 |
+
)
|
| 459 |
+
|
| 460 |
+
if model_tier == "Free Models":
|
| 461 |
+
model_display = st.selectbox("Model", options=FREE_MODEL_DISPLAY_NAMES, key="model")
|
| 462 |
+
model = FREE_MODELS_MAP[model_display]
|
| 463 |
+
api_key = ""
|
| 464 |
+
else:
|
| 465 |
+
model = st.selectbox("Model", options=PAID_MODEL_CHOICES, key="model_paid")
|
| 466 |
+
api_key = st.text_input("API Key", type="password", key="api_key")
|
| 467 |
+
|
| 468 |
+
# Summarize button
|
| 469 |
+
if st.button("Summarize Data", type="primary", use_container_width=True):
|
| 470 |
+
if input_data is None:
|
| 471 |
+
st.error("Please upload data first")
|
| 472 |
+
else:
|
| 473 |
+
mode = None
|
| 474 |
+
if input_type_selected == "pdf":
|
| 475 |
+
mode_mapping = {
|
| 476 |
+
"Image (visual documents)": "image",
|
| 477 |
+
"Text (text-heavy)": "text",
|
| 478 |
+
"Both (comprehensive)": "both"
|
| 479 |
+
}
|
| 480 |
+
mode = mode_mapping.get(pdf_mode, "image")
|
| 481 |
+
|
| 482 |
+
actual_api_key, provider = get_api_key(model, model_tier, api_key)
|
| 483 |
+
if not actual_api_key:
|
| 484 |
+
st.error(f"{provider} API key not configured")
|
| 485 |
+
else:
|
| 486 |
+
model_source = get_model_source(model)
|
| 487 |
+
items_list = input_data if isinstance(input_data, list) else [input_data]
|
| 488 |
+
|
| 489 |
+
# Calculate estimated time
|
| 490 |
+
num_items = len(items_list)
|
| 491 |
+
if input_type_selected == "pdf":
|
| 492 |
+
total_pages = sum(count_pdf_pages(p) for p in items_list)
|
| 493 |
+
est_seconds = total_pages * 5
|
| 494 |
+
else:
|
| 495 |
+
est_seconds = max(10, num_items * 2)
|
| 496 |
+
|
| 497 |
+
est_time_str = f"{est_seconds:.0f}s" if est_seconds < 60 else f"{est_seconds/60:.1f}m"
|
| 498 |
+
|
| 499 |
+
# Progress UI
|
| 500 |
+
progress_bar = st.progress(0)
|
| 501 |
+
status_text = st.empty()
|
| 502 |
+
start_time = time.time()
|
| 503 |
+
|
| 504 |
+
def progress_callback(current_idx, total, label=None):
|
| 505 |
+
progress = current_idx / total if total > 0 else 0
|
| 506 |
+
progress_bar.progress(min(progress, 1.0))
|
| 507 |
+
|
| 508 |
+
elapsed = time.time() - start_time
|
| 509 |
+
if current_idx > 0:
|
| 510 |
+
avg_time = elapsed / current_idx
|
| 511 |
+
eta_seconds = avg_time * (total - current_idx)
|
| 512 |
+
eta_str = f" | ETA: {eta_seconds:.0f}s" if eta_seconds < 60 else f" | ETA: {eta_seconds/60:.1f}m"
|
| 513 |
+
else:
|
| 514 |
+
eta_str = ""
|
| 515 |
+
|
| 516 |
+
label_str = f" ({label})" if label else ""
|
| 517 |
+
status_text.text(f"Processing item {current_idx+1} of {total}{label_str} ({progress*100:.0f}%){eta_str}")
|
| 518 |
+
|
| 519 |
+
try:
|
| 520 |
+
# Build kwargs for summarize
|
| 521 |
+
summarize_kwargs = {
|
| 522 |
+
"input_data": items_list,
|
| 523 |
+
"api_key": actual_api_key,
|
| 524 |
+
"description": description,
|
| 525 |
+
"user_model": model,
|
| 526 |
+
"model_source": model_source,
|
| 527 |
+
"progress_callback": progress_callback,
|
| 528 |
+
}
|
| 529 |
+
if mode:
|
| 530 |
+
summarize_kwargs["mode"] = mode
|
| 531 |
+
if focus and focus.strip():
|
| 532 |
+
summarize_kwargs["focus"] = focus.strip()
|
| 533 |
+
if max_length:
|
| 534 |
+
summarize_kwargs["max_length"] = max_length
|
| 535 |
+
if instructions and instructions.strip():
|
| 536 |
+
summarize_kwargs["instructions"] = instructions.strip()
|
| 537 |
+
|
| 538 |
+
result_df = catllm.summarize(**summarize_kwargs)
|
| 539 |
+
|
| 540 |
+
processing_time = time.time() - start_time
|
| 541 |
+
total_items = len(result_df)
|
| 542 |
+
progress_bar.progress(1.0)
|
| 543 |
+
status_text.text(f"Completed {total_items} items in {processing_time:.1f}s")
|
| 544 |
+
|
| 545 |
+
# Replace temp paths with original filenames for PDF input
|
| 546 |
+
if input_type_selected == "pdf" and 'pdf_path' in result_df.columns:
|
| 547 |
+
pdf_name_map = st.session_state.get('pdf_name_map', {})
|
| 548 |
+
def replace_temp_path(val):
|
| 549 |
+
if pd.isna(val):
|
| 550 |
+
return val
|
| 551 |
+
val_str = str(val)
|
| 552 |
+
for temp_path, orig_name in pdf_name_map.items():
|
| 553 |
+
if temp_path in val_str:
|
| 554 |
+
return val_str.replace(temp_path, orig_name + '.pdf')
|
| 555 |
+
return val_str
|
| 556 |
+
result_df['pdf_path'] = result_df['pdf_path'].apply(replace_temp_path)
|
| 557 |
+
|
| 558 |
+
# Save CSV
|
| 559 |
+
with tempfile.NamedTemporaryFile(mode='w', suffix='_summarized.csv', delete=False) as f:
|
| 560 |
+
result_df.to_csv(f.name, index=False)
|
| 561 |
+
csv_path = f.name
|
| 562 |
+
|
| 563 |
+
# Calculate success rate
|
| 564 |
+
if 'processing_status' in result_df.columns:
|
| 565 |
+
success_count = (result_df['processing_status'] == 'success').sum()
|
| 566 |
+
success_rate = (success_count / len(result_df)) * 100
|
| 567 |
+
else:
|
| 568 |
+
success_rate = 100.0
|
| 569 |
+
|
| 570 |
+
# Get version info
|
| 571 |
+
try:
|
| 572 |
+
catllm_version = catllm.__version__
|
| 573 |
+
except AttributeError:
|
| 574 |
+
catllm_version = "unknown"
|
| 575 |
+
python_version = sys.version.split()[0]
|
| 576 |
+
|
| 577 |
+
# Generate methodology report
|
| 578 |
+
pdf_path = generate_methodology_report_pdf(
|
| 579 |
+
model=model,
|
| 580 |
+
column_name=description,
|
| 581 |
+
num_rows=total_items,
|
| 582 |
+
model_source=model_source,
|
| 583 |
+
filename=original_filename,
|
| 584 |
+
success_rate=success_rate,
|
| 585 |
+
result_df=result_df,
|
| 586 |
+
processing_time=processing_time,
|
| 587 |
+
catllm_version=catllm_version,
|
| 588 |
+
python_version=python_version,
|
| 589 |
+
input_type=input_type_selected,
|
| 590 |
+
description=description,
|
| 591 |
+
focus=focus if focus else None,
|
| 592 |
+
max_length=max_length
|
| 593 |
+
)
|
| 594 |
+
|
| 595 |
+
# Generate code
|
| 596 |
+
code = generate_summarize_code(
|
| 597 |
+
input_type_selected, description, model, model_source,
|
| 598 |
+
focus=focus if focus else None,
|
| 599 |
+
max_length=max_length,
|
| 600 |
+
instructions=instructions if instructions else None,
|
| 601 |
+
mode=mode
|
| 602 |
+
)
|
| 603 |
+
|
| 604 |
+
st.session_state.results = {
|
| 605 |
+
'df': result_df,
|
| 606 |
+
'csv_path': csv_path,
|
| 607 |
+
'pdf_path': pdf_path,
|
| 608 |
+
'code': code,
|
| 609 |
+
'status': f"Summarized {total_items} items in {processing_time:.1f}s",
|
| 610 |
+
}
|
| 611 |
+
st.success(f"Summarized {total_items} items in {processing_time:.1f}s")
|
| 612 |
+
st.rerun()
|
| 613 |
+
|
| 614 |
+
except Exception as e:
|
| 615 |
+
st.error(f"Error: {str(e)}")
|
| 616 |
+
|
| 617 |
+
with col_output:
|
| 618 |
+
st.markdown("### Results")
|
| 619 |
+
|
| 620 |
+
if st.session_state.results:
|
| 621 |
+
results = st.session_state.results
|
| 622 |
+
|
| 623 |
+
# Placeholder for future chart
|
| 624 |
+
st.info("Summary visualization coming soon!")
|
| 625 |
+
|
| 626 |
+
# Results dataframe
|
| 627 |
+
display_df = results['df'].copy()
|
| 628 |
+
cols_to_hide = ['model_response', 'json', 'raw_response', 'raw_json']
|
| 629 |
+
display_df = display_df.drop(columns=[c for c in cols_to_hide if c in display_df.columns])
|
| 630 |
+
st.dataframe(display_df, use_container_width=True)
|
| 631 |
+
|
| 632 |
+
# Downloads
|
| 633 |
+
col_dl1, col_dl2 = st.columns(2)
|
| 634 |
+
with col_dl1:
|
| 635 |
+
with open(results['csv_path'], 'rb') as f:
|
| 636 |
+
st.download_button(
|
| 637 |
+
"Download Results (CSV)",
|
| 638 |
+
data=f,
|
| 639 |
+
file_name="summarized_results.csv",
|
| 640 |
+
mime="text/csv"
|
| 641 |
+
)
|
| 642 |
+
with col_dl2:
|
| 643 |
+
with open(results['pdf_path'], 'rb') as f:
|
| 644 |
+
st.download_button(
|
| 645 |
+
"Download Methodology Report (PDF)",
|
| 646 |
+
data=f,
|
| 647 |
+
file_name="methodology_report.pdf",
|
| 648 |
+
mime="application/pdf"
|
| 649 |
+
)
|
| 650 |
+
|
| 651 |
+
# Code
|
| 652 |
+
with st.expander("See the Code"):
|
| 653 |
+
st.code(results['code'], language='python')
|
| 654 |
+
else:
|
| 655 |
+
st.info("Upload data and click 'Summarize Data' to see results here.")
|
| 656 |
+
|
| 657 |
+
# Bottom buttons
|
| 658 |
+
col_reset, col_code = st.columns(2)
|
| 659 |
+
with col_reset:
|
| 660 |
+
if st.button("Reset", type="secondary", use_container_width=True):
|
| 661 |
+
st.session_state.results = None
|
| 662 |
+
if hasattr(st.session_state, 'example_loaded'):
|
| 663 |
+
del st.session_state.example_loaded
|
| 664 |
+
st.rerun()
|
| 665 |
+
|
| 666 |
+
with col_code:
|
| 667 |
+
if st.session_state.results:
|
| 668 |
+
if st.button("See in Code", use_container_width=True):
|
| 669 |
+
st.session_state.show_code_modal = True
|
| 670 |
+
|
| 671 |
+
# Code modal/dialog
|
| 672 |
+
if st.session_state.get('show_code_modal') and st.session_state.results:
|
| 673 |
+
st.markdown("---")
|
| 674 |
+
st.markdown("### Reproducibility Code")
|
| 675 |
+
st.markdown("Use this code to reproduce the summarization with the CatLLM Python package:")
|
| 676 |
+
st.code(st.session_state.results['code'], language='python')
|
| 677 |
+
if st.button("Close"):
|
| 678 |
+
st.session_state.show_code_modal = False
|
| 679 |
+
st.rerun()
|
example_data.csv
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Response
|
| 2 |
+
The weather was just too hot where I lived
|
| 3 |
+
I could no longer afford the rent in my old neighborhood
|
| 4 |
+
My company offered me a promotion but it required relocating
|
| 5 |
+
I wanted to be closer to my aging parents
|
| 6 |
+
The schools in my area were not good enough for my kids
|
| 7 |
+
I got accepted into graduate school across the country
|
| 8 |
+
My apartment had a terrible mold problem that the landlord refused to fix
|
| 9 |
+
I went through a divorce and needed a fresh start
|
| 10 |
+
The crime rate in my neighborhood kept getting worse
|
| 11 |
+
I was tired of the long commute to work every day
|
| 12 |
+
I fell in love with someone who lived in another city
|
| 13 |
+
The cost of living was much lower in the new area
|
| 14 |
+
I needed a bigger house because we were expecting twins
|
| 15 |
+
My doctor recommended a drier climate for my health
|
| 16 |
+
I got laid off and found a new job in a different state
|
| 17 |
+
I wanted to live somewhere with better outdoor recreation
|
| 18 |
+
My lease ended and my landlord decided to sell the building
|
| 19 |
+
I retired and wanted to move somewhere warmer
|
| 20 |
+
The noise from the construction next door was unbearable
|
| 21 |
+
I always dreamed of living near the ocean
|
logo.png
ADDED
|
requirements.txt
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
streamlit>=1.32.0
|
| 2 |
+
cat-llm[pdf]>=0.1.15
|
| 3 |
+
mistralai
|
| 4 |
+
pydantic==2.10.6
|
| 5 |
+
huggingface_hub<0.27.0
|
| 6 |
+
pandas
|
| 7 |
+
openpyxl
|
| 8 |
+
requests
|
| 9 |
+
regex
|
| 10 |
+
reportlab
|
| 11 |
+
matplotlib
|
| 12 |
+
Pillow
|