Spaces:

Binxk
/

goon

Sleeping

App Files Files Community

Binx Claude Sonnet 4.6 commited on Apr 15

Commit

da605e9

0 Parent(s):

Initial commit: analysis app, deployment config, UI improvements

Browse files

Files changed (15) hide show

.gitignore +12 -0
.streamlit/config.toml +10 -0
CLAUDE.md +37 -0
Dockerfile +19 -0
README.md +32 -0
REPLICATION.md +395 -0
agent/.env.example +4 -0
agent/.streamlit/config.toml +10 -0
agent/CLAUDE.md +372 -0
agent/README.md +73 -0
agent/analysis.py +1480 -0
agent/rebuild_parquets.py +198 -0
agent/requirements.txt +9 -0
app.py +1059 -0
requirements.txt +10 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,12 @@

+.venv/
+.cache/
+.tmp-home/
+.env
+**/.env
+*.pyc
+__pycache__/
+data/*.parquet
+data/*.csv
+data/*.feather
+agent/outputs/
+agent/logs/

.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,10 @@

+[theme]
+base = "light"
+backgroundColor = "#ffffff"
+secondaryBackgroundColor = "#f5f5f5"
+textColor = "#000000"
+primaryColor = "#000000"
+font = "sans serif"
+[server]
+headless = true

CLAUDE.md ADDED Viewed

	@@ -0,0 +1,37 @@

+# CLAUDE.md — Project Conventions
+## File Operations
+- When asked to edit a file, use Glob/Grep to locate it rather than asking the user for the path.
+- Prefer targeted reads (offset + limit) over reading entire large files.
+- Before creating a new file, check whether a suitable one already exists.
+- Do not re-read a file after editing it to verify — edits would have errored if they failed.
+## Editing Conventions
+- When editing R/Quarto files, check for stale object/column name references after renames (e.g., `title` -> new name) across the entire file, not just the edit site.
+- Make the minimal change needed — do not refactor or clean up surrounding code unless asked.
+- Do not add comments, docstrings, or type annotations to code you did not change.
+- Avoid backwards-compatibility shims for removed code; delete unused code outright.
+## R Conventions
+- This project uses data.table, not data.frame/dplyr. Always use data.table syntax (e.g., `DT[, .(col)]`, `:=`) and verify column access patterns work on data.table objects before finalizing edits.
+- Use `set*` functions (`setnames`, `setorder`, `setkey`) for in-place mutations to avoid copies.
+- Prefer `fread`/`fwrite` over `read.csv`/`write.csv`.
+- Do not load tidyverse or dplyr unless explicitly requested.
+## Performance / Memory
+- Large datasets hit R memory limits — prefer chunking, avoid unnecessary copies, and add defensive guards for list vs data.table structures before `rbindlist`.
+- Never use `rbind` in a loop; accumulate results in a list and call `rbindlist` once.
+- Avoid `apply`-family functions on large data.tables; use vectorized data.table operations instead.
+- Check `object.size()` / `lobstr::obj_size()` when debugging memory issues rather than guessing.
+## Token Efficiency
+- Do not summarize what you just did at the end of a response — the diff is visible.
+- Do not repeat back the user's request before answering.
+- Skip pleasantries and filler phrases ("Great question!", "Certainly!", etc.).
+- When reading code to answer a question, read only the relevant section, not the whole file.
+- Use `files_with_matches` output mode for Grep unless line content is needed.

Dockerfile ADDED Viewed

	@@ -0,0 +1,19 @@

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+ENV HF_DATASET_REPO=Binxk/goon-data
+ENV DATA_DIR=/app/data
+RUN mkdir -p /app/data
+EXPOSE 8501
+CMD ["sh", "-c", \
+  "python -c \"import os; from huggingface_hub import snapshot_download; snapshot_download(repo_id=os.environ['HF_DATASET_REPO'], repo_type='dataset', local_dir=os.environ['DATA_DIR'])\" && \
+  streamlit run app.py --server.port=8501 --server.address=0.0.0.0"]

README.md ADDED Viewed

	@@ -0,0 +1,32 @@

+# Subreddits in subreddits25
+- FemboyGooned
+- furrygooners
+- gaygooncave
+- GayGoonerGuys
+- GirlcockEgirlgooning
+- girlgooners
+- gonegoonergirls
+- Goon_Galaxy
+- GoonCaves
+- GOONED
+- GOONEDISBACK
+- GOONEDmeetup
+- GoonerUtopia
+- GoonetteHub
+- GoonFeet
+- GoonForAlice
+- GoonForAss
+- GoonGay
+- gooning4bbw
+- GooningTrees
+- GooningUnlimited
+- JerkBudsGoonTogether
+- NickiMinajGoonFarm
+- NSFW_CAPTION_AND_GOON
+- ShemaleGOONED
+- sissyGOONED
+- TGOONER
+- TheGoonCaveOfficial
+- TransGooned
+- transgoonergirls

REPLICATION.md ADDED Viewed

	@@ -0,0 +1,395 @@

+---
+editor_options:
+  markdown:
+    wrap: 72
+---
+# Replication Guide — Grasping 'Gooning'
+An observational analysis of Reddit gooning communities via BERTopic
+topic modelling.
+------------------------------------------------------------------------
+## Project summary
+This study empirically analyses publicly available Reddit subreddits
+dedicated to gooning (prolonged, trancelike masturbation). The analysis
+pipeline combines:
+-   **Descriptive statistics** — demographics (age, gender, sexuality
+    from user flair), word frequencies, and substance mentions across 30
+    subreddits
+-   **BERTopic topic modelling** — semantic clustering of \~2.35M
+    documents (posts and comments) to identify the major themes and
+    discourse patterns
+**Full corpus:** 30 subreddits, \~22M raw records, 2019–2025
+------------------------------------------------------------------------
+## Repository structure
+```
+Goon/
+├── REPLICATION.md                 ← this file
+├── README.md                      ← subreddit list
+├── Goon.Rproj                     ← RStudio project (sets working directory)
+├── data_prep.qmd                  ← Step 1: R — ingest CSVs, clean, export Parquet
+├── goon_analysis.qmd              ← Step 2: R — descriptive analyses
+├── goon_topic_analysis.qmd        ← Step 4: R — visualise BERTopic outputs
+├── data/
+│   ├── *.csv                      ← raw Reddit exports (one file per subreddit per year)
+│   ├── GOONED_comments.csv        ← pre-concatenated GOONED comments (all years)
+│   ├── GOONED_submissions.csv     ← pre-concatenated GOONED submissions (all years)
+│   ├── comments.parquet           ← output of data_prep.qmd (18.2M rows)
+│   ├── posts.parquet              ← output of data_prep.qmd (3.8M rows)
+│   ├── corpus_clean.parquet       ← output of topic_analysis.qmd (modelling corpus)
+│   └── corpus_deleted.parquet     ← deleted/removed rows (tracked separately)
+├── topic analysis/
+│   ├── topic_analysis.qmd         ← Step 3: Python — BERTopic pipeline
+│   ├── topic_api_labeling.qmd     ← Step 3b: Python — optional API-based topic labelling
+│   ├── build_topic_results_summary.py  ← helper: generate an HTML summary from run outputs
+│   ├── run_topic_analysis.sh      ← shell wrapper that sets env vars and calls quarto
+│   ├── README.md                  ← detailed pipeline documentation
+│   ├── MEMORY_MANAGEMENT.md       ← strategies for large-corpus runs
+│   └── runs/
+│       └── full_run_v1/           ← complete output of the full corpus run
+│           ├── bertopic/          ← saved BERTopic models + doc-topic assignments
+│           ├── topics/            ← keyword tables, summaries, evaluation, API labels
+│           └── figures/           ← generated charts
+└── .venv/                         ← Python virtual environment (created per instructions below)
+```
+------------------------------------------------------------------------
+## System requirements
+### Hardware
+| Component | Minimum | Recommended |
+|----|----|----|
+| RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) |
+| Disk | 50 GB free | 100 GB free |
+| CPU | Any modern multi-core | 8+ cores for embedding generation |
+| GPU | Not required | Optional — speeds up embedding by \~10× |
+> The full corpus run was executed on a Linux VM with 96 GB RAM. A local
+> Mac with 24 GB will work for the pilot/sample runs but may struggle
+> with the full corpus embedding step.
+### Software
+| Tool | Version tested | Notes |
+|----|----|----|
+| R | ≥ 4.3 | For data_prep.qmd and goon_analysis.qmd |
+| RStudio / Quarto CLI | ≥ 1.4 | To render .qmd files |
+| Python | ≥ 3.10 | For topic_analysis.qmd |
+| Quarto | ≥ 1.4 | Installed with RStudio or standalone |
+------------------------------------------------------------------------
+## Step-by-step replication
+### Step 0: Clone / obtain the repository
+The `data/` folder containing raw CSVs is required. These are large
+files and are not distributed via git — they must be present locally.
+Open `Goon.Rproj` in RStudio. This sets the working directory to the
+project root so that `here::here()` paths resolve correctly.
+------------------------------------------------------------------------
+### Step 1: R data preparation (`data_prep.qmd`)
+**Purpose:** Reads all raw CSVs, combines them into unified data frames,
+applies minimal cleaning, and exports `data/comments.parquet` and
+`data/posts.parquet`.
+**R packages required:**
+``` r
+install.packages(c("dplyr", "tidyr", "tibble", "purrr",
+                   "data.table", "arrow", "here"))
+```
+**Run:**
+Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks
+in order).
+Alternatively, from the terminal:
+``` bash
+cd /Users/bkot7579/Desktop/Goon
+quarto render data_prep.qmd
+```
+**Expected outputs:** - `data/comments.parquet` — \~18.2M rows, \~528
+MB - `data/posts.parquet` — \~3.8M rows, \~186 MB
+**Time estimate:** 15–45 minutes depending on RAM and I/O speed (the
+GOONED CSV files alone are \~7 GB).
+------------------------------------------------------------------------
+### Step 2: R descriptive analysis (`goon_analysis.qmd`)
+**Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency
+analysis, substance mention counts.
+**R packages required:**
+``` r
+install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
+                   "stringr", "here", "e1071", "tidytext",
+                   "data.table", "arrow"))
+```
+**Run:**
+Open `goon_analysis.qmd` in RStudio and click **Render**.
+**Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must
+exist (Step 1).
+**Expected outputs:** - Rendered HTML with plots embedded - In-memory:
+word count tables, substance mention counts, demographic counts
+------------------------------------------------------------------------
+### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)
+**Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN
+topic modelling via BERTopic, reduces topics using c-TF-IDF
+agglomerative clustering, evaluates models, and exports all topic
+outputs.
+#### 3a. Create the Python virtual environment
+``` bash
+cd /Users/bkot7579/Desktop/Goon
+python3 -m venv .venv
+source .venv/bin/activate
+pip install bertopic umap-learn hdbscan sentence-transformers \
+            scikit-learn pandas pyarrow quarto
+```
+**Key package versions (tested):**
+| Package               | Version |
+|-----------------------|---------|
+| bertopic              | 0.16.x  |
+| umap-learn            | 0.5.x   |
+| hdbscan               | 0.8.x   |
+| sentence-transformers | 2.x     |
+| scikit-learn          | 1.x     |
+| pandas                | 2.x     |
+| pyarrow               | 14.x    |
+#### 3b. Run the pipeline
+**Pilot run (200k documents — recommended first):**
+``` bash
+cd /Users/bkot7579/Desktop/Goon
+./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
+```
+Outputs land in `topic analysis/runs/pilot_200k/`.
+**Full corpus run (\~2.35M cleaned documents after filtering):**
+``` bash
+cd /Users/bkot7579/Desktop/Goon
+./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
+```
+> **Warning:** The full run requires \~64 GB RAM for the embedding +
+> UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies
+> if RAM is limited.
+**Pipeline stages (automatically executed in order):**
+1.  **Data ingestion** — loads `data/posts.parquet` +
+    `data/comments.parquet`
+2.  **Light cleaning** — deduplication, URL/username/subreddit
+    anonymisation, markdown stripping; deleted/removed rows saved
+    separately to `data/corpus_deleted.parquet`
+3.  **Embedding** — `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved
+    as shards; skips already-generated shards on resume
+4.  **UMAP** — pre-computed once, reused across all HDBSCAN configs
+5.  **BERTopic** — 6 configurations: min_cluster_size ∈ {50, 100, 200} ×
+    method ∈ {eom, leaf}
+6.  **Topic reduction** — c-TF-IDF agglomerative clustering to 100, 50,
+    and 25 topics
+7.  **Evaluation** — NPMI coherence, topic diversity, outlier rates
+8.  **Export** — keyword tables, representative docs, summary CSVs
+**Reproducibility:** Random seed = 42 throughout. A
+`reproducibility_log.json` is written to the run folder with all
+settings and package versions.
+#### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)
+Sends reduced topic summaries (keywords + representative texts) to an
+LLM API to generate human-readable labels. Does NOT send raw corpus
+text.
+Set your API key, then render:
+``` bash
+export OPENAI_API_KEY="your-key-here"
+quarto render "topic analysis/topic_api_labeling.qmd"
+```
+Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`.
+------------------------------------------------------------------------
+### Step 4: R topic visualisations (`goon_topic_analysis.qmd`)
+**Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and
+produces exploratory visualisations: topic size bar chart, subreddit ×
+topic heatmap, topic prevalence over time, post vs comment split,
+representative documents.
+**R packages required:**
+``` r
+install.packages(c("dplyr", "tidyr", "tibble", "purrr",
+                   "ggplot2", "stringr", "here", "arrow"))
+```
+**Run:**
+``` bash
+quarto render goon_topic_analysis.qmd
+```
+**Prerequisites:** Step 3 must have completed and outputs must exist
+under `topic analysis/runs/full_run_v1/topics/`.
+------------------------------------------------------------------------
+## Execution order summary
+```
+Step 1  →  data_prep.qmd            (R)       ~30 min
+Step 2  →  goon_analysis.qmd        (R)       ~10 min
+Step 3  →  topic_analysis.qmd       (Python)  ~6–48 hours (full corpus)
+Step 3b →  topic_api_labeling.qmd   (Python)  ~5 min + API cost (optional)
+Step 4  →  goon_topic_analysis.qmd  (R)       ~2 min
+```
+Steps 2 and 3 are independent of each other and can run in parallel.
+------------------------------------------------------------------------
+## Key modelling decisions
+| Decision | Choice | Rationale |
+|-------------------------|--------------------|---------------------------|
+| Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure |
+| UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure |
+| UMAP n_components | 5 | Low enough for HDBSCAN to work well |
+| HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference |
+| Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters |
+| Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) |
+| Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords |
+| Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling |
+------------------------------------------------------------------------
+## Known issue: r/GOONED missing from BERTopic results
+r/GOONED is the dominant subreddit in the corpus by a large margin:
+|                     | Count          |
+|---------------------|----------------|
+| r/GOONED posts      | 2,765,119      |
+| r/GOONED comments   | 15,493,075     |
+| **r/GOONED total**  | **18,258,194** |
+| Full cleaned corpus | 22,011,124     |
+| **r/GOONED share**  | **82.9%**      |
+Despite this, r/GOONED is **entirely absent** from
+`topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the
+BERTopic pipeline did not have the `GOONED_comments.csv` and
+`GOONED_submissions.csv` files available (most likely because their
+combined size of \~7 GB made transfer impractical), and
+`corpus_clean.parquet` on the VM was generated without them.
+**Consequence:** All topic modelling results represent 29 subreddits
+(3.75M records) rather than the full 30-subreddit corpus (22M records).
+Topic proportions, dominant themes, and subreddit distribution tables
+are therefore not representative of the full corpus.
+**To fix:** Ensure the `GOONED_*.csv` files (or the pre-built
+`comments.parquet` / `posts.parquet`) are available on the compute
+environment, then re-run:
+``` bash
+./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
+```
+Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by
+this issue — they read directly from the local `data/posts.parquet` and
+`data/comments.parquet` files, which do contain r/GOONED data.
+------------------------------------------------------------------------
+## Known limitations
+1.  **Outlier rate:** 37.3% of documents are assigned to the outlier
+    topic (-1) by HDBSCAN. This is typical for short-text social media
+    corpora. Outliers are excluded from topic analyses but are retained
+    in the corpus parquet files.
+2.  **Flair ambiguity:** Gender/sexuality classifiers rely on
+    voluntarily set flair strings. Flair adoption is uneven across
+    subreddits and users, introducing selection bias. Abbreviations like
+    `\\bt\\b` may match unintended strings (e.g. US state TX).
+3.  **Deleted content:** Posts and comments marked `[deleted]` or
+    `[removed]` are excluded from modelling but counted separately.
+    These may disproportionately represent controversial content.
+4.  **Temporal coverage:** Coverage varies by subreddit. Some
+    communities only appear in later years (2024–2025); others span the
+    full 2019–2025 window.
+5.  **Platform-specific norms:** Moderation rules, flair conventions,
+    and posting styles differ across subreddits, which may shape topics
+    in ways that are not generalisable.
+6.  **Unobserved participants:** Lurkers, banned users, and deleted
+    accounts are not captured.
+------------------------------------------------------------------------
+## Output files (full_run_v1)
+| File | Description |
+|-----------------------|-------------------------------------------------|
+| `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts |
+| `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs |
+| `topics/evaluation_table.csv` | Same as above, alternate format |
+| `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations |
+| `topics/topic_by_subreddit.csv` | Topic × subreddit document counts |
+| `topics/topic_by_month.csv` | Topic × month document counts |
+| `topics/topic_by_doctype.csv` | Topic × doc type (post/comment) |
+| `topics/preprocessing_decisions.json` | Logged cleaning decisions |
+| `topics/api/` | API-generated labels, summaries, category annotations |
+| `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs |
+------------------------------------------------------------------------
+## Contacts and citation
+This analysis was conducted as part of a preliminary empirical study of
+online gooning communities. If replicating, please cite the original
+study and note the random seed, embedding model, and reduction target
+used.

agent/.env.example ADDED Viewed

	@@ -0,0 +1,4 @@

+ANTHROPIC_API_KEY=your_key_here
+TOGETHER_API_KEY=your_key_here
+# Optional: override default data directory (default: ../data relative to agent/)
+# DATA_DIR=/absolute/path/to/data

agent/.streamlit/config.toml ADDED Viewed

	@@ -0,0 +1,10 @@

+[theme]
+base = "light"
+backgroundColor = "#f5f3ef"
+secondaryBackgroundColor = "#edeae4"
+textColor = "#1a1a1a"
+primaryColor = "#1a1a1a"
+font = "sans serif"
+[server]
+headless = true

agent/CLAUDE.md ADDED Viewed

	@@ -0,0 +1,372 @@

+# CLAUDE.md
+## Project
+Build an agent that:
+1. Accepts user questions about a dataset.
+2. Inspects available data files and schema.
+3. Chooses appropriate analysis steps.
+4. Runs reproducible analyses in code.
+5. Returns answers in plain language plus supporting outputs.
+6. O([docs.anthropic.com](https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview?utm_source=chatgpt.com))ewing results.
+## Core objective
+This repository is for a question-driven data analysis agent.
+The agent should behave like an analysis system, not a general chatbot.
+Its job is to translate a user question into a reproducible analytic workflow grounded in the available data.
+The agent must:
+* inspect files before making assumptions
+* prefer deterministic code execution over unsupported claims
+* distinguish clearly between observed results, assumptions, and uncertainty
+* save intermediate outputs when useful
+* produce answers that map directly to the user's question
+## Non-goals
+Do not:
+* invent variables, files, columns, or results
+* answer from background knowledge when the answer should come from the data
+* silently skip failed steps
+* overwrite user data files unless explicitly instructed
+* present speculative findings as if they were measured
+## Default workflow
+For each user request, follow this sequence:
+1. Interpret the question
+* Restate the task internally in operational terms.
+* Identify whether the question is descriptive, comparative, predictive, inferential, filtering-based, text-analysis-based, or dashboard/UI-related.
+* Identify the likely required inputs, outputs, and constraints.
+2. Inspect the repository and data
+* Find relevant files, scripts, configs, notebooks, and prior outputs.
+* Detect available datasets and likely schema.
+* Read small samples of files first.
+* Determine data types, missingness, date fields, identifiers, and likely join keys.
+3. Plan before acting
+* Write a short task plan in the response or working notes.
+* Select the minimum analysis needed to answer the question properly.
+* Prefer existing project utilities where available.
+4. Execute reproducibly
+* Use code, scripts, or SQL rather than manual reasoning for calculations.
+* Save generated code in the repo when it is likely to be reused.
+* Save outputs to a structured outputs directory.
+5. Validate
+* Check row counts, duplicates, filters, and assumptions.
+* Inspect whether results are plausible.
+* Flag ambiguities, weak inference, or low data quality.
+6. Respond
+* Answer the exact question first.
+* Then show method, assumptions, and supporting numbers.
+* Keep prose direct and structured.
+## Response format
+Unless the user asks otherwise, structure final responses as:
+1. Answer
+* Direct answer to the question.
+2. What was analysed
+* Files used
+* Filters used
+* Key variables used
+3. Method
+* Exact analysis steps
+* Statistical or computational approach
+4. Results
+* Main numbers, estimates, comparisons, or model outputs
+5. Caveats
+* Missing data, ambiguity, low sample size, measurement issues, or assumptions
+6. Saved outputs
+* Paths to scripts, tables, figures, or logs created during the run
+## Behavioural rules
+* Be concise but complete.
+* Prefer bulletless structured sections unless lists are clearer.
+* Never claim an analysis was run if it was not run.
+* If a file or field is missing, say so explicitly.
+* If the user asks a vague question, propose the most reasonable operationalization and proceed.
+* If multiple interpretations are possible, state the one chosen and why.
+* Prefer transparency over fluency.
+## Analysis policy
+### Descriptive questions
+For questions like:
+* How many
+* What proportion
+* What is the average
+* What changed over time
+Use:
+* counts
+* percentages
+* grouped summaries
+* date aggregation
+* plots when helpful
+### Comparative questions
+For questions like:
+* Is group A different from group B
+* Which variables differ most across groups
+Use:
+* grouped summaries first
+* effect sizes where appropriate
+* inferential tests only if defensible for the data and design
+### Predictive questions
+For questions like:
+* Can we predict X from Y
+* Which variables matter most
+Use:
+* explicit train/test logic
+* baseline models before complex ones
+* interpretable models unless performance is the main goal
+* appropriate metrics for the target type
+### Text questions
+For questions over text data:
+* inspect raw examples first
+* identify the unit of analysis: document, post, comment, sentence, token
+* choose methods that fit the question: keyword rules, embeddings, topic modelling, clustering, classification, sentiment, summarisation, or retrieval
+* keep raw text outputs traceable to source rows where allowed
+### Causal language
+Do not use causal language unless the design supports it.
+Replace causal claims with associational language by default.
+## Reproducibility requirements
+* Every non-trivial analysis should produce a saved artifact.
+* Prefer these folders when they exist; otherwise create them:
+  * `data/`
+  * `analysis/`
+  * `app/`
+  * `outputs/`
+  * `logs/`
+* Save generated scripts with descriptive names.
+* Save machine-readable outputs in csv, json, parquet, or feather where sensible.
+* Save plots as png or svg.
+* Save a short run log for substantial jobs.
+## File and code conventions
+* Never modify raw source data in place.
+* Write derived data to `data/derived/`.
+* Write one-off analysis scripts to `analysis/`.
+* Write reusable utilities to `analysis/lib/` or an existing utilities folder.
+* Name files to reflect task and date when useful.
+## UI goal
+If UI work is requested, build a lightweight question interface that sits on top of the analysis pipeline.
+Preferred UI order:
+1. Streamlit for fastest single-user internal tool
+2. Gradio for rapid prototype interaction
+3. React + FastAPI if a more controlled app is needed
+Default recommendation:
+* Use Streamlit if no frontend stack is already specified.
+## Default UI specification
+When building a UI, include:
+* a file uploader or dataset selector
+* a question input box
+* an optional advanced settings panel
+* a results panel with direct answer
+* expandable method details
+* downloadable outputs when possible
+* a history pane of prior questions in the current session
+## Suggested app architecture
+### Minimal version
+* `app/app.py` for Streamlit UI
+* `analysis/router.py` to classify question type
+* `analysis/inspect_data.py` to inspect schema and file summaries
+* `analysis/run_analysis.py` to execute question-specific workflows
+* `analysis/format_response.py` to convert outputs into a final answer
+* `outputs/` for saved artifacts
+### API version
+* `app/frontend/` for React UI
+* `app/api/main.py` for FastAPI endpoints
+* `analysis/` for shared analysis logic
+* `outputs/` and `logs/` for artifacts and traceability
+## Question routing logic
+Map user questions to one of these modes:
+* `describe`
+* `compare`
+* `trend`
+* `predict`
+* `text_search`
+* `text_cluster`
+* `summarise_subset`
+* `dashboard_request`
+* `data_quality_check`
+When uncertain, start with `describe` plus schema inspection.
+## Tool preferences
+Prefer this order:
+1. Existing project scripts
+2. Python analysis scripts
+3. SQL queries if the data store is relational
+4. R scripts if the project already uses R heavily
+5. Shell utilities for fast inspection only
+## Statistical discipline
+* Report denominators.
+* Report missingness when relevant.
+* Do not run significance tests by reflex.
+* Match methods to the measurement level and design.
+* For small samples or noisy text outputs, emphasise uncertainty.
+## Error handling
+When something fails:
+* state exactly what failed
+* include the file, command, or function involved
+* propose the next best fallback
+* continue where possible instead of stopping entirely
+## Safe execution rules
+* Confirm before destructive operations.
+* Avoid network calls unless needed.
+* Do not expose secrets from `.env`, keys, or credentials.
+* Do not execute untrusted code from data files.
+## Preferred implementation plan when asked to build the system
+1. Inspect repo and identify current stack.
+2. Create analysis router.
+3. Create schema inspection utility.
+4. Create question-to-analysis execution module.
+5. Create response formatter.
+6. Create Streamlit UI if no existing frontend is present.
+7. Test on one small example dataset.
+8. Save outputs and usage instructions.
+## What to do when the repo is empty
+If the repository is mostly empty, scaffold this structure:
+```text
+project_root/
+  CLAUDE.md
+  README.md
+  requirements.txt
+  .env.example
+  data/
+    raw/
+    derived/
+  analysis/
+    __init__.py
+    inspect_data.py
+    router.py
+    run_analysis.py
+    format_response.py
+  app/
+    app.py
+  outputs/
+  logs/
+```
+Then implement:
+* a schema inspector
+* a rule-based question router
+* a first-pass analysis runner for descriptive and comparative questions
+* a simple Streamlit UI
+## Definition of done
+A task is complete only when:
+* the question has been answered directly
+* the analysis actually ran or the blocker is explicit
+* outputs are saved where appropriate
+* the response includes methods and caveats
+* any created UI or scripts can be run by another developer with minimal guesswork
+## Example high-quality user requests
+* Analyse `data/raw/survey.csv` and tell me whether anxiety differs by gender.
+* Compare yearly trends in posts by topic using the reddit export in `data/raw/`.
+* Build a UI where I can upload a csv and ask natural-language questions.
+* Inspect this dataset and tell me what questions it can answer reliably.
+## Example implementation prompt for Claude Code
+Build a question-driven data analysis agent in this repository.
+Requirements:
+* ingest tabular files and optionally text datasets
+* inspect schema before analysis
+* classify incoming questions into analysis modes
+* run reproducible analyses in Python
+* save scripts and outputs
+* return direct answers plus methods and caveats
+* build a simple Streamlit UI with file upload, question input, and results panel
+Start by scaffolding the project if needed, then implement the minimal working version for csv files.
+Do not invent results. Run actual analyses against the provided data.

agent/README.md ADDED Viewed

	@@ -0,0 +1,73 @@

+# Agent App
+## Run the UI
+```bash
+cd /Users/binx/Desktop/Goon/agent
+python3 -m venv .venv
+source .venv/bin/activate
+pip install -r requirements.txt
+streamlit run app/app.py
+```
+Set your Anthropic key in `agent/.env`:
+```bash
+ANTHROPIC_API_KEY=your_key_here
+```
+Or paste it into the sidebar after the app starts.
+## Local Python API
+This repo does not expose an HTTP API server. The supported programmatic interface is the local Python function in `analysis.agent`.
+### Basic usage
+```python
+from analysis.agent import run_agent
+result = run_agent("How many posts per subreddit?")
+print(result["answer"])
+print(result["tool_calls"])
+```
+### With prior context
+```python
+from analysis.agent import run_agent
+turns = [
+    {
+        "question": "How many posts per subreddit?",
+        "answer": "Previous answer text",
+        "tool_calls": [],
+        "artifacts": [],
+        "plotly_json": "",
+        "route": "describe",
+    }
+]
+result = run_agent(
+    "Which subreddits changed most over time?",
+    turns=turns,
+)
+print(result["route"])
+print(result["answer"])
+```
+## Return shape
+`run_agent(...)` returns a dictionary with:
+- `answer`: final assistant response
+- `tool_calls`: executed tool calls plus arguments and results
+- `plotly_json`: chart payload when a plot was generated
+- `route`: detected route for the question
+- `allowed_tools`: tools exposed for that route
+## Notes
+- Use `python3`, not `python`, in this environment.
+- The app stores structured turn state in the Streamlit session so follow-up questions can reuse prior analytical context.
+- Generated CSV and PNG artifacts are written to `agent/outputs/`.

agent/analysis.py ADDED Viewed

	@@ -0,0 +1,1480 @@

+"""
+Goon analysis agent — all modules in one file.
+Sections:
+  1. Response formatter
+  2. Data inspection & sampling
+  3. Question router
+  4. Image analysis
+  5. Text pattern extraction
+  6. Analysis execution (count, trend, stats, search, word freq, compare)
+  7. Core agent loop
+"""
+from __future__ import annotations
+# ── stdlib ─────────────────────────────────────────────────────────────────
+import base64
+import glob
+import io
+import json
+import os
+import re
+import traceback as _traceback
+import urllib.request
+from dataclasses import dataclass
+from datetime import datetime, timezone
+from pathlib import Path
+# ── third-party ────────────────────────────────────────────────────────────
+import anthropic
+import openai  # Together AI uses an OpenAI-compatible endpoint
+import pandas as pd
+import plotly.express as px
+import plotly.io as pio
+import pyarrow.dataset as ds
+import pyarrow.parquet as pq
+from pyarrow.compute import field
+from sklearn.metrics import cohen_kappa_score
+# ══════════════════════════════════════════════════════════════════════════════
+# 1. Response formatter
+# ══════════════════════════════════════════════════════════════════════════════
+def format_result(result: dict, answer_text: str = "") -> str:
+    """Combine Claude's prose answer with the structured analysis result as markdown."""
+    lines = []
+    if answer_text:
+        lines.append("## Answer\n")
+        lines.append(answer_text.strip())
+        lines.append("")
+    dataset = result.get("dataset", "")
+    if dataset:
+        lines.append("## What was analysed\n")
+        lines.append(f"- Dataset: `{dataset}`")
+        if result.get("subreddit_filter"):
+            lines.append(f"- Subreddit filter: `{result['subreddit_filter']}`")
+        if result.get("group_col"):
+            lines.append(f"- Grouped by: `{result['group_col']}`")
+        if result.get("value_col"):
+            lines.append(f"- Value column: `{result['value_col']}`")
+        lines.append("")
+    table = result.get("table")
+    if table:
+        lines.append("## Results\n")
+        lines.append(_dict_list_to_md_table(table))
+        lines.append("")
+    saved = [result[k] for k in ("saved_csv", "saved_png") if result.get(k)]
+    if saved:
+        lines.append("## Saved outputs\n")
+        for s in saved:
+            lines.append(f"- `{s}`")
+        lines.append("")
+    return "\n".join(lines)
+def _dict_list_to_md_table(records: list[dict]) -> str:
+    if not records:
+        return "_No results._"
+    headers = list(records[0].keys())
+    rows = [[str(r.get(h, "")) for h in headers] for r in records]
+    widths = [max(len(h), max((len(r[i]) for r in rows), default=0)) for i, h in enumerate(headers)]
+    sep = "| " + " | ".join("-" * w for w in widths) + " |"
+    header_row = "| " + " | ".join(h.ljust(widths[i]) for i, h in enumerate(headers)) + " |"
+    data_rows = [
+        "| " + " | ".join(cell.ljust(widths[i]) for i, cell in enumerate(row)) + " |"
+        for row in rows[:50]
+    ]
+    return "\n".join([header_row, sep] + data_rows)
+# ══════════════════════════════════════════════════════════════════════════════
+# 2. Data inspection & sampling
+# ══════════════════════════════════════════════════════════════════════════════
+DATA_DIR = Path(os.environ.get("DATA_DIR", Path(__file__).parent.parent / "data"))
+OUTPUTS_DIR = Path(__file__).parent / "outputs"
+OUTPUTS_DIR.mkdir(exist_ok=True)
+METADATA_CACHE = OUTPUTS_DIR / "dataset_metadata.json"
+def _best(name: str) -> Path:
+    """Prefer the full rebuilt parquet over the original, but validate it first."""
+    full = DATA_DIR / f"{name}_full.parquet"
+    orig = DATA_DIR / f"{name}.parquet"
+    if full.exists():
+        try:
+            pq.read_schema(full)
+            return full
+        except Exception:
+            pass
+    return orig
+DATASETS = {
+    "posts": _best("posts"),
+    "comments": _best("comments"),
+    "corpus_clean": DATA_DIR / "corpus_clean.parquet",
+    "titles": _best("titles"),
+}
+def _dataset_path(name: str) -> Path:
+    path = DATASETS.get(name)
+    if path is None or not path.exists():
+        raise FileNotFoundError(f"Dataset '{name}' not found at {path}")
+    return path
+def _load(name: str, columns: list[str] | None = None) -> pd.DataFrame:
+    return pd.read_parquet(_dataset_path(name), columns=columns)
+def _scanner(name: str, columns: list[str] | None = None, filters: dict | None = None) -> ds.Scanner:
+    path = _dataset_path(name)
+    dataset = ds.dataset(path, format="parquet")
+    expression = None
+    for col, value in (filters or {}).items():
+        if col not in dataset.schema.names or value in (None, ""):
+            continue
+        clause = field(col) == value
+        expression = clause if expression is None else expression & clause
+    return dataset.scanner(columns=columns, filter=expression)
+def _read_distinct_values(name: str, column: str, limit: int = 200) -> list[str] | None:
+    if column not in _schema_names(name):
+        return None
+    table = _scanner(name, columns=[column]).to_table()
+    values = table.column(column).drop_null().unique().to_pylist()
+    return sorted(str(v) for v in values)[:limit]
+def _read_date_range(name: str) -> dict | None:
+    if "created_utc" not in _schema_names(name):
+        return None
+    table = _scanner(name, columns=["created_utc"]).to_table()
+    if table.num_rows == 0:
+        return None
+    series = table.column("created_utc").to_pandas().dropna()
+    if series.empty:
+        return None
+    return {
+        "earliest": datetime.fromtimestamp(series.min(), tz=timezone.utc).strftime("%Y-%m-%d"),
+        "latest": datetime.fromtimestamp(series.max(), tz=timezone.utc).strftime("%Y-%m-%d"),
+    }
+def _schema_names(name: str) -> list[str]:
+    return pq.read_schema(_dataset_path(name)).names
+def compute_dataset_metadata() -> dict:
+    result = {}
+    for name, path in DATASETS.items():
+        if not path.exists():
+            result[name] = {"available": False}
+            continue
+        parquet = pq.ParquetFile(path)
+        schema = parquet.schema_arrow
+        result[name] = {
+            "available": True,
+            "path": str(path),
+            "rows": parquet.metadata.num_rows,
+            "columns": {f.name: str(f.type) for f in schema},
+            "subreddits": _read_distinct_values(name, "subreddit"),
+            "date_range": _read_date_range(name),
+            "metadata_cached_at": datetime.now(timezone.utc).isoformat(),
+        }
+    METADATA_CACHE.write_text(json.dumps(result, indent=2))
+    return result
+def get_dataset_metadata(refresh: bool = False) -> dict:
+    if METADATA_CACHE.exists() and not refresh:
+        return json.loads(METADATA_CACHE.read_text())
+    return compute_dataset_metadata()
+def list_datasets(refresh: bool = False) -> dict:
+    """Return cached dataset metadata instead of loading full tables."""
+    return get_dataset_metadata(refresh=refresh)
+def sample_rows(
+    dataset: str,
+    n: int = 5,
+    filters: dict | None = None,
+    columns: list[str] | None = None,
+) -> dict:
+    """Return a small deterministic preview of rows from a dataset, optionally filtered."""
+    selected_columns = columns or _schema_names(dataset)
+    table = _scanner(dataset, columns=selected_columns, filters=filters).head(n)
+    df = table.to_pandas() if table.num_rows else pd.DataFrame(columns=selected_columns)
+    return {
+        "dataset": dataset,
+        "filters": filters or {},
+        "n_returned": len(df),
+        "rows": df.fillna("").to_dict(orient="records"),
+    }
+# ══════════════════════════════════════════════════════════════════════════════
+# 3. Question router
+# ══════════════════════════════════════════════════════════════════════════════
+@dataclass(frozen=True)
+class RoutePlan:
+    mode: str
+    allowed_tools: list[str]
+    guidance: str
+ALL_TOOL_NAMES = [
+    "list_datasets", "sample_rows", "count_by_group", "trend_over_time",
+    "summary_stats", "top_posts", "text_search", "word_freq", "compare_groups",
+    "extract_frequency_patterns", "extract_dominance_patterns", "analyze_image_sample",
+    "export_reliability_sample", "compute_reliability",
+]
+def route_question(question: str) -> RoutePlan:
+    q = question.lower()
+    if any(t in q for t in ["image", "images", "photo", "photos", "visual", "depicted"]):
+        return RoutePlan(
+            mode="image",
+            allowed_tools=["list_datasets", "sample_rows", "analyze_image_sample", "export_reliability_sample", "compute_reliability"],
+            guidance="This is a visual-content question. Prefer image analysis tools and avoid text-only proxies. Always provide a coding_scheme.",
+        )
+    if any(t in q for t in ["reliability", "kappa", "human coding", "inter-rater", "validate"]):
+        return RoutePlan(
+            mode="reliability",
+            allowed_tools=["export_reliability_sample", "compute_reliability"],
+            guidance="This is a reliability/validation question. Use export_reliability_sample then compute_reliability.",
+        )
+    if any(t in q for t in ["how often", "how long", "times per", "every day", "session length", "streak"]):
+        return RoutePlan(
+            mode="pattern_frequency",
+            allowed_tools=["list_datasets", "sample_rows", "extract_frequency_patterns", "text_search"],
+            guidance="This is a behavioral frequency/duration question. Prefer regex pattern extraction over generic word counts.",
+        )
+    if any(t in q for t in ["dominant", "subordinate", "mistress", "goddess", "femdom", "submissive"]):
+        return RoutePlan(
+            mode="pattern_dominance",
+            allowed_tools=["list_datasets", "sample_rows", "extract_dominance_patterns", "text_search", "analyze_image_sample"],
+            guidance="This is a dominance/subordination framing question. Use the text pattern tool unless the user explicitly asks about images.",
+        )
+    if any(t in q for t in ["over time", "trend", "changed", "change over time", "monthly", "yearly"]):
+        return RoutePlan(
+            mode="trend",
+            allowed_tools=["list_datasets", "sample_rows", "trend_over_time", "count_by_group"],
+            guidance="This is a time-series question. Prefer trend_over_time and only use grouping/count tools to contextualize it.",
+        )
+    if any(t in q for t in ["compare", "difference", "versus", "vs", "higher", "lower"]):
+        return RoutePlan(
+            mode="compare",
+            allowed_tools=["list_datasets", "sample_rows", "compare_groups", "summary_stats", "count_by_group"],
+            guidance="This is a comparison question. Prefer compare_groups or summary_stats with explicit filters.",
+        )
+    if any(t in q for t in ["top", "highest", "best scoring", "most upvoted"]):
+        return RoutePlan(
+            mode="ranking",
+            allowed_tools=["list_datasets", "sample_rows", "top_posts", "summary_stats"],
+            guidance="This is a ranking question. Prefer top_posts and use summary_stats only if it supports the answer.",
+        )
+    if any(t in q for t in ["search", "find", "mention", "contains", "where people say"]):
+        return RoutePlan(
+            mode="search",
+            allowed_tools=["list_datasets", "sample_rows", "text_search", "top_posts"],
+            guidance="This is a retrieval question. Prefer text_search with the right dataset and text column.",
+        )
+    if any(t in q for t in ["common words", "most common words", "word frequency", "tokens"]):
+        return RoutePlan(
+            mode="lexical",
+            allowed_tools=["list_datasets", "sample_rows", "word_freq", "text_search"],
+            guidance="This is a lexical summary question. Prefer word_freq and inspect text samples only if needed.",
+        )
+    if any(t in q for t in ["how many", "count", "number of", "what proportion"]):
+        return RoutePlan(
+            mode="describe",
+            allowed_tools=["list_datasets", "sample_rows", "count_by_group", "summary_stats", "trend_over_time"],
+            guidance="This is a descriptive count question. Prefer count_by_group or summary_stats and keep the plan minimal.",
+        )
+    return RoutePlan(
+        mode="unknown",
+        allowed_tools=ALL_TOOL_NAMES,
+        guidance="Question type is ambiguous. Inspect metadata first, then choose the minimum reliable tool path.",
+    )
+# ══════════════════════════════════════════════════════════════════════════════
+# 4. Image analysis
+# ══════════════════════════════════════════════════════════════════════════════
+VISION_MODEL = "Qwen/Qwen2-VL-72B-Instruct"
+TOGETHER_BASE_URL = "https://api.together.xyz/v1"
+DIRECT_IMAGE_DOMAINS = {"i.redd.it", "i.imgur.com", "i.redgifs.com"}
+def _load_image_urls(subreddit: str | None = None, n: int = 50) -> pd.DataFrame:
+    pattern = str(DATA_DIR / "*_submissions_*.csv")
+    files = sorted(glob.glob(pattern))
+    if subreddit:
+        files = [f for f in files if Path(f).name.lower().startswith(subreddit.lower())]
+    needed_cols = ["subreddit", "title", "url", "domain", "score", "is_self"]
+    frames = []
+    for f in files:
+        try:
+            df = pd.read_csv(f, usecols=lambda c: c in needed_cols, low_memory=False)
+            if "is_self" in df.columns:
+                df = df[df["is_self"] == False]
+            if "url" in df.columns and "domain" in df.columns:
+                df = df[df["domain"].isin(DIRECT_IMAGE_DOMAINS)].dropna(subset=["url"])
+                frames.append(df[["subreddit", "title", "url", "domain", "score"]])
+        except Exception:
+            continue
+    if not frames:
+        return pd.DataFrame()
+    combined = pd.concat(frames, ignore_index=True)
+    if len(combined) > n * 10:
+        combined = combined.sample(min(n * 10, len(combined)), random_state=42)
+    return combined.head(n * 10)
+def _fetch_image_b64(url: str, timeout: int = 8) -> tuple[str, str] | None:
+    try:
+        url.encode("ascii")
+    except UnicodeEncodeError:
+        return None
+    try:
+        req = urllib.request.Request(url, headers={"User-Agent": "Mozilla/5.0 (research bot)"})
+        with urllib.request.urlopen(req, timeout=timeout) as resp:
+            content_type = resp.headers.get("Content-Type", "image/jpeg").split(";")[0].strip()
+            if not content_type.startswith("image/"):
+                return None
+            data = resp.read()
+            if len(data) < 1000:
+                return None
+            return base64.standard_b64encode(data).decode("utf-8"), content_type
+    except Exception:
+        return None
+def analyze_image_sample(
+    question: str,
+    subreddit: str | None = None,
+    n_sample: int = 100,
+    coding_scheme: dict | None = None,
+) -> dict:
+    """
+    Sample image posts, fetch them, and ask Qwen2-VL a structured content-analysis question.
+    Uses Together AI (no content filters). n_sample is uncapped — set as needed.
+    """
+    client = openai.OpenAI(
+        api_key=os.environ["TOGETHER_API_KEY"],
+        base_url=TOGETHER_BASE_URL,
+    )
+    candidates = _load_image_urls(subreddit=subreddit, n=n_sample * 5)
+    if candidates.empty:
+        return {
+            "analysis": "analyze_image_sample",
+            "error": "No direct image URLs found in raw CSVs for the given filters.",
+            "subreddit_filter": subreddit,
+        }
+    if coding_scheme:
+        scheme_text = "\n".join(f"- {k}: {v}" for k, v in coding_scheme.items())
+        prompt = (
+            f"{question}\n\nCoding scheme:\n{scheme_text}\n\n"
+            "Reply with ONLY the label and a one-sentence justification, "
+            "formatted as: LABEL | justification"
+        )
+    else:
+        prompt = (
+            f"{question}\n\n"
+            "Reply with a short structured answer. "
+            "If you cannot determine this from the image, reply: UNCLEAR | reason"
+        )
+    results = []
+    attempted = 0
+    for _, row in candidates.iterrows():
+        if len(results) >= n_sample:
+            break
+        attempted += 1
+        img = _fetch_image_b64(row["url"])
+        if img is None:
+            continue
+        b64data, media_type = img
+        try:
+            response = client.chat.completions.create(
+                model=VISION_MODEL,
+                max_tokens=200,
+                messages=[{
+                    "role": "user",
+                    "content": [
+                        {"type": "image_url", "image_url": {"url": f"data:{media_type};base64,{b64data}"}},
+                        {"type": "text", "text": prompt},
+                    ],
+                }],
+            )
+            answer = response.choices[0].message.content.strip()
+        except Exception as e:
+            answer = f"ERROR | {e}"
+        parts = answer.split("|", 1)
+        label = parts[0].strip().upper() if parts else "UNCLEAR"
+        justification = parts[1].strip() if len(parts) > 1 else ""
+        def _ascii_safe(s: str) -> str:
+            return s.encode("ascii", errors="replace").decode("ascii")
+        results.append({
+            "subreddit": _ascii_safe(str(row.get("subreddit", ""))),
+            "title": _ascii_safe(str(row.get("title", ""))),
+            "url": row["url"],
+            "label": label,
+            "justification": _ascii_safe(justification),
+            "score": row.get("score", None),
+        })
+    label_counts: dict[str, int] = {}
+    for r in results:
+        label_counts[r["label"]] = label_counts.get(r["label"], 0) + 1
+    total_coded = len(results)
+    saved_csv = None
+    if results:
+        out_df = pd.DataFrame(results)
+        stem = f"image_analysis_{subreddit or 'all'}"
+        saved_csv = str(OUTPUTS_DIR / f"{stem}.csv")
+        out_df.to_csv(saved_csv, index=False)
+    return {
+        "analysis": "analyze_image_sample",
+        "question": question,
+        "subreddit_filter": subreddit,
+        "n_attempted": attempted,
+        "n_successfully_coded": total_coded,
+        "label_counts": label_counts,
+        "label_pct": {k: round(v / total_coded * 100, 1) for k, v in label_counts.items()} if total_coded else {},
+        "per_image_results": results,
+        "saved_csv": saved_csv,
+        "caveats": [
+            "Sample limited to direct-image domains (i.redd.it, i.imgur.com, i.redgifs.com) — galleries and videos excluded.",
+            f"Vision model: {VISION_MODEL} via Together AI.",
+            "Coded by a single model — validate with human reliability sample before reporting.",
+        ],
+    }
+def export_reliability_sample(
+    source_csv: str | None = None,
+    n: int = 200,
+    random_state: int = 42,
+) -> dict:
+    """
+    Draw a stratified random sample of n images from a completed image_analysis CSV
+    for human coding. Saves a CSV with an empty human_label column.
+    """
+    if source_csv is None:
+        # Default to most recently written image_analysis file
+        candidates = sorted(OUTPUTS_DIR.glob("image_analysis_*.csv"))
+        if not candidates:
+            return {"error": "No image_analysis CSV found in outputs/. Run analyze_image_sample first."}
+        source_csv = str(candidates[-1])
+    df = pd.read_csv(source_csv)
+    df = df[df["label"].notna() & ~df["label"].str.startswith("ERROR")]
+    # Stratified sample by label
+    sampled = (
+        df.groupby("label", group_keys=False)
+        .apply(lambda g: g.sample(min(len(g), max(1, int(n * len(g) / len(df)))), random_state=random_state))
+    )
+    # Top up to exactly n if rounding left us short
+    if len(sampled) < n and len(df) >= n:
+        remaining = df[~df.index.isin(sampled.index)]
+        top_up = remaining.sample(n - len(sampled), random_state=random_state)
+        sampled = pd.concat([sampled, top_up])
+    sampled = sampled.sample(frac=1, random_state=random_state).reset_index(drop=True)  # shuffle
+    sampled.insert(0, "image_id", range(1, len(sampled) + 1))
+    sampled = sampled.rename(columns={"label": "model_label", "justification": "model_justification"})
+    sampled["human_label"] = ""
+    out_cols = ["image_id", "url", "title", "subreddit", "model_label", "model_justification", "human_label"]
+    out_cols = [c for c in out_cols if c in sampled.columns]
+    out_path = str(OUTPUTS_DIR / "reliability_sample.csv")
+    sampled[out_cols].to_csv(out_path, index=False)
+    return {
+        "analysis": "export_reliability_sample",
+        "source_csv": source_csv,
+        "n_exported": len(sampled),
+        "label_distribution": sampled["model_label"].value_counts().to_dict(),
+        "saved_csv": out_path,
+        "next_step": "Fill in the human_label column, then run compute_reliability.",
+    }
+def compute_reliability(human_csv_path: str | None = None) -> dict:
+    """
+    Compute Cohen's kappa between model_label and human_label columns
+    in a completed reliability sample CSV.
+    """
+    if human_csv_path is None:
+        human_csv_path = str(OUTPUTS_DIR / "reliability_sample.csv")
+    df = pd.read_csv(human_csv_path)
+    df = df[df["human_label"].notna() & (df["human_label"].astype(str).str.strip() != "")]
+    if len(df) < 2:
+        return {"error": "Not enough human-coded rows. Fill in human_label column first."}
+    model = df["model_label"].astype(str).str.strip().str.upper()
+    human = df["human_label"].astype(str).str.strip().str.upper()
+    kappa = cohen_kappa_score(human, model)
+    pct_agreement = round((human == model).mean() * 100, 1)
+    per_label = {}
+    for label in sorted(human.unique()):
+        h = (human == label)
+        m = (model == label)
+        tp = int((h & m).sum())
+        fp = int((~h & m).sum())
+        fn = int((h & ~m).sum())
+        per_label[label] = {"human_n": int(h.sum()), "model_n": int(m.sum()),
+                            "exact_matches": tp, "false_positives": fp, "false_negatives": fn}
+    report = {
+        "analysis": "compute_reliability",
+        "n_coded": len(df),
+        "cohens_kappa": round(kappa, 3),
+        "percent_agreement": pct_agreement,
+        "interpretation": (
+            "excellent (κ ≥ 0.80)" if kappa >= 0.80 else
+            "substantial (κ 0.60–0.79)" if kappa >= 0.60 else
+            "moderate (κ 0.40–0.59)" if kappa >= 0.40 else
+            "fair (κ 0.20–0.39)" if kappa >= 0.20 else
+            "poor (κ < 0.20)"
+        ),
+        "per_label": per_label,
+    }
+    out_path = str(OUTPUTS_DIR / "reliability_report.json")
+    Path(out_path).write_text(json.dumps(report, indent=2))
+    report["saved_json"] = out_path
+    return report
+# ══════════════════════════════════════════════════════════════════════════════
+# 5. Text pattern extraction
+# ══════════════════════════════════════════════════════════════════════════════
+FREQUENCY_PATTERNS = {
+    "times_per_day": [
+        r"\b(\d+)\s*(?:times?|x)\s*(?:a|per)\s*day\b",
+        r"\b(\d+)\s*(?:times?|x)\s*daily\b",
+    ],
+    "times_per_week": [
+        r"\b(\d+)\s*(?:times?|x)\s*(?:a|per)\s*week\b",
+        r"\b(\d+)\s*(?:times?|x)\s*weekly\b",
+    ],
+    "hours_per_session": [
+        r"\b(\d+(?:\.\d+)?)\s*(?:hours?|hrs?)\b",
+        r"\b(\d+(?:\.\d+)?)\s*(?:hours?|hrs?)\s*(?:session|goon|long|straight|solid|non.?stop)\b",
+    ],
+    "all_day": [
+        r"\ball\s*day\b", r"\ball\s*night\b", r"\ball\s*weekend\b", r"\bfor\s*hours\b",
+    ],
+    "daily_habit": [
+        r"\bevery\s*day\b", r"\bevery\s*night\b", r"\bdaily\b", r"\bmost\s*days?\b",
+    ],
+    "streak_days": [
+        r"\b(\d+)\s*(?:days?\s*(?:in\s*a\s*row|straight|streak|running))\b",
+        r"\b(\d+)\s*(?:-|–)?\s*day\s*(?:streak|binge)\b",
+    ],
+}
+DOMINANCE_PATTERNS = {
+    "dominant_language": [
+        r"\bdominat(?:e|es|ed|ing|ion|rix|rix)\b", r"\bfem(?:dom|domme)\b",
+        r"\bmistress\b", r"\bgoddess\b", r"\bqueen\b", r"\bowner\b",
+        r"\balpha\b", r"\bin\s*control\b", r"\bboss\b",
+    ],
+    "subordinate_language": [
+        r"\bsubmiss(?:ive|ion)\b", r"\bsub\b", r"\bobedient\b", r"\bslave\b",
+        r"\bpet\b", r"\bslut\b", r"\bwhore\b", r"\bused\b",
+        r"\bcontrolled\b", r"\bworshiped?\b", r"\bworship(?:ped|ing)\b",
+    ],
+    "neutral_object": [
+        r"\bperfect\b", r"\bbeautiful\b", r"\bhot\b", r"\bsexy\b", r"\bstunning\b",
+    ],
+}
+def _compile(patterns: list[str]) -> re.Pattern:
+    return re.compile("|".join(patterns), re.IGNORECASE)
+def extract_frequency_patterns(
+    dataset: str = "comments",
+    text_col: str = "body",
+    subreddit: str | None = None,
+    n_examples: int = 5,
+    sample_size: int = 5_000_000,
+) -> dict:
+    """Mine text for frequency and duration language across the full dataset."""
+    cols = [text_col] + (["subreddit"] if subreddit else [])
+    df = _scanner(
+        dataset, columns=cols,
+        filters={"subreddit": subreddit} if subreddit else None,
+    ).head(sample_size).to_pandas()
+    text = df[text_col].fillna("")
+    total_docs = len(text)
+    results = {}
+    for category, pats in FREQUENCY_PATTERNS.items():
+        regex = _compile(pats)
+        matches_mask = text.str.contains(regex.pattern, regex=True, na=False)
+        hit_texts = text[matches_mask]
+        values = []
+        for pat in pats:
+            r = re.compile(pat, re.IGNORECASE)
+            for t in hit_texts:
+                for m in r.finditer(t):
+                    if m.groups():
+                        try:
+                            values.append(float(m.group(1)))
+                        except (IndexError, ValueError):
+                            pass
+        raw_examples = hit_texts.sample(min(n_examples, len(hit_texts)), random_state=42).tolist() if len(hit_texts) > 0 else []
+        results[category] = {
+            "count": int(matches_mask.sum()),
+            "pct_of_docs": round(matches_mask.mean() * 100, 3),
+            "numeric_values": sorted(values)[:50] if values else [],
+            "mean_value": round(sum(values) / len(values), 2) if values else None,
+            "examples": [t.encode("ascii", errors="replace").decode("ascii") for t in raw_examples],
+        }
+    return {
+        "analysis": "extract_frequency_patterns",
+        "dataset": dataset,
+        "text_col": text_col,
+        "subreddit_filter": subreddit,
+        "total_docs_sampled": total_docs,
+        "patterns": results,
+    }
+def extract_dominance_patterns(
+    dataset: str = "comments",
+    text_col: str = "body",
+    subreddit: str | None = None,
+    sample_size: int = 5_000_000,
+) -> dict:
+    """Count dominant, subordinate, and neutral language in text."""
+    cols = [text_col] + (["subreddit"] if subreddit else [])
+    df = _scanner(
+        dataset, columns=cols,
+        filters={"subreddit": subreddit} if subreddit else None,
+    ).head(sample_size).to_pandas()
+    text = df[text_col].fillna("")
+    total_docs = len(text)
+    results = {}
+    for category, pats in DOMINANCE_PATTERNS.items():
+        regex = _compile(pats)
+        mask = text.str.contains(regex, na=False)
+        hits = text[mask]
+        raw_examples = hits.sample(min(5, len(hits)), random_state=42).tolist() if len(hits) > 0 else []
+        results[category] = {
+            "count": int(mask.sum()),
+            "pct_of_docs": round(mask.mean() * 100, 3),
+            "examples": [t.encode("ascii", errors="replace").decode("ascii") for t in raw_examples],
+        }
+    dom = results.get("dominant_language", {}).get("count", 0)
+    sub = results.get("subordinate_language", {}).get("count", 0)
+    total = dom + sub
+    ratio = {
+        "dominant_pct": round(dom / total * 100, 1) if total else None,
+        "subordinate_pct": round(sub / total * 100, 1) if total else None,
+        "interpretation": (
+            "More subordinate language" if sub > dom else
+            "More dominant language" if dom > sub else
+            "Roughly balanced"
+        ) if total else "No data",
+    }
+    return {
+        "analysis": "extract_dominance_patterns",
+        "dataset": dataset,
+        "text_col": text_col,
+        "subreddit_filter": subreddit,
+        "total_docs_sampled": total_docs,
+        "categories": results,
+        "dominance_ratio": ratio,
+        "caveat": (
+            "This analysis counts language patterns in text, not visual image content. "
+            "It reflects how women are described in text, not how they appear in images. "
+            "For image-based analysis use analyze_image_sample."
+        ),
+    }
+# ══════════════════════════════════════════════════════════════════════════════
+# 6. Analysis execution
+# ══════════════════════════════════════════════════════════════════════════════
+def _ts_to_date(series: pd.Series) -> pd.Series:
+    return pd.to_datetime(series, unit="s", utc=True)
+def _normalize_filters(
+    filters: dict | None = None,
+    subreddit: str | None = None,
+    date_from: str | None = None,
+    date_to: str | None = None,
+    min_score: float | None = None,
+) -> dict:
+    merged = dict(filters or {})
+    if subreddit:
+        merged["subreddit"] = subreddit
+    if date_from:
+        merged["date_from"] = date_from
+    if date_to:
+        merged["date_to"] = date_to
+    if min_score is not None:
+        merged["min_score"] = min_score
+    return merged
+def _apply_filters(df: pd.DataFrame, filters: dict | None = None) -> pd.DataFrame:
+    if not filters:
+        return df
+    filtered = df
+    if filters.get("subreddit") and "subreddit" in filtered.columns:
+        filtered = filtered[filtered["subreddit"] == filters["subreddit"]]
+    if filters.get("author") and "author" in filtered.columns:
+        filtered = filtered[filtered["author"] == filters["author"]]
+    if filters.get("min_score") is not None and "score" in filtered.columns:
+        filtered["score"] = pd.to_numeric(filtered["score"], errors="coerce")
+        filtered = filtered[filtered["score"] >= filters["min_score"]]
+    if ("date_from" in filters or "date_to" in filters) and "created_utc" in filtered.columns:
+        created = _ts_to_date(filtered["created_utc"])
+        if filters.get("date_from"):
+            filtered = filtered[created >= pd.Timestamp(filters["date_from"], tz="UTC")]
+            created = _ts_to_date(filtered["created_utc"])
+        if filters.get("date_to"):
+            filtered = filtered[created <= pd.Timestamp(filters["date_to"], tz="UTC")]
+    return filtered
+def _save_csv(df: pd.DataFrame, stem: str) -> str:
+    path = OUTPUTS_DIR / f"{stem}.csv"
+    df.to_csv(path, index=False)
+    return str(path)
+def _save_fig(fig, stem: str) -> str:
+    path = OUTPUTS_DIR / f"{stem}.png"
+    pio.write_image(fig, str(path), scale=2)
+    return str(path)
+def count_by_group(dataset: str, group_col: str, top_n: int = 30, filters: dict | None = None) -> dict:
+    """Count rows grouped by a column. Returns sorted table + bar chart."""
+    filter_cols = [c for c in ["subreddit", "author", "score", "created_utc"] if c != group_col]
+    df = _load(dataset, columns=list(dict.fromkeys([group_col] + filter_cols)))
+    df = _apply_filters(df, filters)
+    counts = (
+        df.groupby(group_col, dropna=False)
+        .size().reset_index(name="count")
+        .sort_values("count", ascending=False).head(top_n)
+    )
+    stem = f"count_by_{group_col}_{dataset}"
+    saved_csv = _save_csv(counts, stem)
+    fig = px.bar(
+        counts.sort_values("count"), x="count", y=group_col, orientation="h",
+        title=f"Count by {group_col}", labels={"count": "Count", group_col: group_col},
+    )
+    fig.update_layout(yaxis={"categoryorder": "total ascending"})
+    try:
+        saved_png = _save_fig(fig, stem)
+    except Exception:
+        saved_png = None
+    return {
+        "analysis": "count_by_group", "dataset": dataset, "group_col": group_col,
+        "filters": filters or {}, "total_rows": len(df),
+        "table": counts.to_dict(orient="records"),
+        "saved_csv": saved_csv, "saved_png": saved_png, "plotly_json": fig.to_json(),
+    }
+def trend_over_time(
+    dataset: str, freq: str = "M", group_col: str | None = None,
+    top_groups: int = 8, filters: dict | None = None,
+) -> dict:
+    """Count posts/comments over time, optionally broken out by a grouping column."""
+    cols = ["created_utc"] + ([group_col] if group_col else []) + ["subreddit", "author", "score"]
+    cols = list(dict.fromkeys(cols))
+    df = _load(dataset, columns=cols)
+    df = _apply_filters(df, filters)
+    df["period"] = _ts_to_date(df["created_utc"]).dt.to_period(freq).astype(str)
+    if group_col:
+        top = df[group_col].value_counts().head(top_groups).index.tolist()
+        df = df[df[group_col].isin(top)]
+        counts = (
+            df.groupby(["period", group_col]).size()
+            .reset_index(name="count").sort_values("period")
+        )
+        fig = px.line(counts, x="period", y="count", color=group_col,
+                      title=f"Activity over time by {group_col}")
+    else:
+        counts = (
+            df.groupby("period").size()
+            .reset_index(name="count").sort_values("period")
+        )
+        fig = px.line(counts, x="period", y="count", title="Activity over time")
+    stem = f"trend_{dataset}_{group_col or 'all'}_{freq}"
+    saved_csv = _save_csv(counts, stem)
+    try:
+        saved_png = _save_fig(fig, stem)
+    except Exception:
+        saved_png = None
+    return {
+        "analysis": "trend_over_time", "dataset": dataset, "freq": freq,
+        "group_col": group_col, "filters": filters or {},
+        "table": counts.to_dict(orient="records"),
+        "saved_csv": saved_csv, "saved_png": saved_png, "plotly_json": fig.to_json(),
+    }
+def summary_stats(
+    dataset: str, value_col: str, group_col: str | None = None,
+    top_n: int = 30, filters: dict | None = None,
+) -> dict:
+    """Descriptive statistics for a numeric column, optionally by group."""
+    cols = [value_col] + ([group_col] if group_col else []) + ["subreddit", "author", "score", "created_utc"]
+    cols = list(dict.fromkeys(cols))
+    df = _load(dataset, columns=cols)
+    df = _apply_filters(df, filters)
+    df[value_col] = pd.to_numeric(df[value_col], errors="coerce")
+    if group_col:
+        stats = (
+            df.groupby(group_col)[value_col]
+            .agg(["count", "mean", "median", "std", "min", "max"])
+            .reset_index().sort_values("mean", ascending=False).head(top_n).round(2)
+        )
+    else:
+        raw = df[value_col].describe().round(2)
+        stats = raw.reset_index()
+        stats.columns = ["stat", "value"]
+    stem = f"stats_{value_col}_{group_col or 'all'}_{dataset}"
+    saved_csv = _save_csv(stats, stem)
+    try:
+        if group_col:
+            fig = px.bar(stats, x=group_col, y="mean", error_y="std",
+                         title=f"{value_col} by {group_col}",
+                         labels={"mean": f"Mean {value_col}"})
+        else:
+            fig = px.histogram(df[value_col].dropna(), nbins=50,
+                               title=f"Distribution of {value_col}",
+                               labels={"value": value_col})
+        saved_png = _save_fig(fig, stem)
+        plotly_json = fig.to_json()
+    except Exception:
+        saved_png = None
+        plotly_json = None
+    return {
+        "analysis": "summary_stats", "dataset": dataset, "value_col": value_col,
+        "group_col": group_col, "filters": filters or {},
+        "n_total": len(df), "n_missing": int(df[value_col].isna().sum()),
+        "table": stats.to_dict(orient="records"),
+        "saved_csv": saved_csv, "saved_png": saved_png, "plotly_json": plotly_json,
+    }
+def top_posts(
+    dataset: str = "posts", n: int = 20,
+    subreddit: str | None = None, text_col: str = "title",
+    filters: dict | None = None,
+) -> dict:
+    """Return the highest-scoring posts, optionally filtered to a subreddit."""
+    filters = _normalize_filters(filters=filters, subreddit=subreddit)
+    cols = [c for c in ["subreddit", "author", text_col, "score", "created_utc"] if c]
+    df = _load(dataset, columns=cols)
+    df = _apply_filters(df, filters)
+    top = df.nlargest(n, "score")[cols].copy()
+    top["date"] = _ts_to_date(top["created_utc"]).dt.strftime("%Y-%m-%d")
+    top = top.drop(columns=["created_utc"])
+    stem = f"top_posts_{subreddit or 'all'}_{dataset}"
+    saved = _save_csv(top, stem)
+    return {
+        "analysis": "top_posts", "dataset": dataset,
+        "subreddit_filter": subreddit, "filters": filters, "n": n,
+        "table": top.fillna("").to_dict(orient="records"), "saved_csv": saved,
+    }
+def text_search(
+    dataset: str, query: str, text_col: str = "body",
+    n: int = 20, case_sensitive: bool = False,
+    subreddit: str | None = None, filters: dict | None = None,
+) -> dict:
+    """Search for a string pattern in a text column."""
+    filters = _normalize_filters(filters=filters, subreddit=subreddit)
+    cols = [c for c in ["subreddit", "author", text_col, "score", "created_utc"] if c]
+    df = _load(dataset, columns=cols)
+    df = _apply_filters(df, filters)
+    mask = df[text_col].fillna("").str.contains(query, case=case_sensitive, regex=False)
+    hits = df[mask].nlargest(n, "score").copy()
+    hits["date"] = _ts_to_date(hits["created_utc"]).dt.strftime("%Y-%m-%d")
+    hits = hits.drop(columns=["created_utc"])
+    stem = f"search_{query[:30].replace(' ', '_')}_{dataset}"
+    saved = _save_csv(hits, stem)
+    return {
+        "analysis": "text_search", "dataset": dataset, "query": query,
+        "text_col": text_col, "filters": filters,
+        "n_matches": int(mask.sum()), "n_returned": len(hits),
+        "table": hits.fillna("").to_dict(orient="records"), "saved_csv": saved,
+    }
+def word_freq(
+    dataset: str = "corpus_clean", text_col: str = "text_cleaned",
+    top_n: int = 50, subreddit: str | None = None,
+    min_length: int = 4, filters: dict | None = None,
+) -> dict:
+    """Count word frequencies in a text column."""
+    filters = _normalize_filters(filters=filters, subreddit=subreddit)
+    cols = list(dict.fromkeys([text_col] + (["subreddit"] if subreddit else []) + ["author", "score", "created_utc"]))
+    df = _load(dataset, columns=cols)
+    df = _apply_filters(df, filters)
+    stop = {
+        "the","and","for","that","with","this","you","are","was","not",
+        "have","from","they","will","what","been","when","your","more",
+        "just","about","like","there","were","would","into","than","then",
+        "some","also","very","only","over","back","can","out","all","but",
+        "one","had","has","its","which","their","time","our","who","may",
+        "after","other","these","those","such","each","him","her","his",
+        "she","how","did","being","now","way","any","too","much","even",
+        "get","got","got","could","should","make","made","said","still",
+        "here","because","really","know","think","going","reddit","post",
+        "comment","deleted","removed",
+    }
+    words = (
+        df[text_col].fillna("").str.lower()
+        .str.replace(r"[^a-z\s]", " ", regex=True).str.split().explode()
+    )
+    words = words[words.str.len() >= min_length]
+    words = words[~words.isin(stop)]
+    counts = words.value_counts().head(top_n).reset_index()
+    counts.columns = ["word", "count"]
+    stem = f"wordfreq_{text_col}_{subreddit or 'all'}_{dataset}"
+    saved_csv = _save_csv(counts, stem)
+    fig = px.bar(
+        counts.head(30).sort_values("count"), x="count", y="word", orientation="h",
+        title="Top words by frequency", labels={"count": "Count", "word": "Word"},
+    )
+    fig.update_layout(yaxis={"categoryorder": "total ascending"})
+    try:
+        saved_png = _save_fig(fig, stem)
+    except Exception:
+        saved_png = None
+    return {
+        "analysis": "word_freq", "dataset": dataset, "text_col": text_col,
+        "subreddit_filter": subreddit, "filters": filters, "total_docs": len(df),
+        "table": counts.to_dict(orient="records"),
+        "saved_csv": saved_csv, "saved_png": saved_png, "plotly_json": fig.to_json(),
+    }
+def compare_groups(
+    dataset: str, group_col: str, value_col: str,
+    groups: list[str] | None = None, filters: dict | None = None,
+) -> dict:
+    """Compare a numeric value across groups with descriptive stats."""
+    cols = list(dict.fromkeys([group_col, value_col, "subreddit", "author", "score", "created_utc"]))
+    df = _load(dataset, columns=cols)
+    df = _apply_filters(df, filters)
+    df[value_col] = pd.to_numeric(df[value_col], errors="coerce")
+    if groups:
+        df = df[df[group_col].isin(groups)]
+    stats = (
+        df.groupby(group_col)[value_col]
+        .agg(count="count", mean="mean", median="median", std="std")
+        .reset_index().sort_values("median", ascending=False).round(3)
+    )
+    stem = f"compare_{group_col}_{value_col}_{dataset}"
+    saved_csv = _save_csv(stats, stem)
+    fig = px.bar(stats, x=group_col, y="median", error_y="std",
+                 title=f"{value_col} by {group_col} (median ± std)",
+                 labels={"median": f"Median {value_col}"})
+    try:
+        saved_png = _save_fig(fig, stem)
+    except Exception:
+        saved_png = None
+    return {
+        "analysis": "compare_groups", "dataset": dataset,
+        "group_col": group_col, "value_col": value_col,
+        "filters": filters or {}, "groups_compared": stats[group_col].tolist(),
+        "table": stats.to_dict(orient="records"),
+        "saved_csv": saved_csv, "saved_png": saved_png, "plotly_json": fig.to_json(),
+    }
+# ══════════════════════════════════════════════════════════════════════════════
+# 7. Core agent loop
+# ══════════════════════════════════════════════════════════════════════════════
+MODEL = "claude-opus-4-6"
+TOOLS = [
+    {
+        "name": "list_datasets",
+        "description": (
+            "List cached dataset metadata: paths, row counts, columns, subreddits, and date ranges. "
+            "Use this to inspect the available data without loading full tables."
+        ),
+        "input_schema": {
+            "type": "object",
+            "properties": {"refresh": {"type": "boolean", "default": False,
+                                       "description": "Recompute metadata from source parquets instead of using the cache."}},
+            "required": [],
+        },
+    },
+    {
+        "name": "sample_rows",
+        "description": "Return a small deterministic preview of rows from a dataset, optionally filtered and column-limited.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"]},
+                "n": {"type": "integer", "default": 5},
+                "filters": {"type": "object", "description": "Optional equality filters, e.g. {\"subreddit\": \"GOONED\"}"},
+                "columns": {"type": "array", "items": {"type": "string"}, "description": "Optional subset of columns to preview."},
+            },
+            "required": ["dataset"],
+        },
+    },
+    {
+        "name": "count_by_group",
+        "description": "Count rows in a dataset grouped by one column, with optional shared filters.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"]},
+                "group_col": {"type": "string"},
+                "top_n": {"type": "integer", "default": 30},
+                "filters": {"type": "object"},
+            },
+            "required": ["dataset", "group_col"],
+        },
+    },
+    {
+        "name": "trend_over_time",
+        "description": "Count rows over time, optionally split by one grouping column, with optional shared filters.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"]},
+                "freq": {"type": "string", "enum": ["D", "W", "M", "Q", "Y"], "default": "M"},
+                "group_col": {"type": "string"},
+                "top_groups": {"type": "integer", "default": 8},
+                "filters": {"type": "object"},
+            },
+            "required": ["dataset"],
+        },
+    },
+    {
+        "name": "summary_stats",
+        "description": "Descriptive statistics for a numeric column, optionally grouped and filtered.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"]},
+                "value_col": {"type": "string"},
+                "group_col": {"type": "string"},
+                "top_n": {"type": "integer", "default": 30},
+                "filters": {"type": "object"},
+            },
+            "required": ["dataset", "value_col"],
+        },
+    },
+    {
+        "name": "top_posts",
+        "description": "Return the highest-scoring posts, optionally filtered by subreddit or shared filters.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "titles"], "default": "posts"},
+                "n": {"type": "integer", "default": 20},
+                "subreddit": {"type": "string"},
+                "text_col": {"type": "string", "default": "title"},
+                "filters": {"type": "object"},
+            },
+            "required": [],
+        },
+    },
+    {
+        "name": "text_search",
+        "description": "Search for a phrase in a text column and return top matching rows.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"]},
+                "query": {"type": "string"},
+                "text_col": {"type": "string", "default": "body"},
+                "n": {"type": "integer", "default": 20},
+                "subreddit": {"type": "string"},
+                "filters": {"type": "object"},
+            },
+            "required": ["dataset", "query"],
+        },
+    },
+    {
+        "name": "word_freq",
+        "description": "Count word frequencies in a text column with optional shared filters.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"], "default": "corpus_clean"},
+                "text_col": {"type": "string", "default": "text_cleaned"},
+                "top_n": {"type": "integer", "default": 50},
+                "subreddit": {"type": "string"},
+                "min_length": {"type": "integer", "default": 4},
+                "filters": {"type": "object"},
+            },
+            "required": [],
+        },
+    },
+    {
+        "name": "compare_groups",
+        "description": "Compare one numeric column across groups with optional shared filters.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean", "titles"]},
+                "group_col": {"type": "string"},
+                "value_col": {"type": "string"},
+                "groups": {"type": "array", "items": {"type": "string"}},
+                "filters": {"type": "object"},
+            },
+            "required": ["dataset", "group_col", "value_col"],
+        },
+    },
+    {
+        "name": "extract_frequency_patterns",
+        "description": "Mine text for frequency and duration language across the full dataset.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean"], "default": "comments"},
+                "text_col": {"type": "string", "default": "body"},
+                "subreddit": {"type": "string"},
+                "n_examples": {"type": "integer", "default": 5},
+                "sample_size": {"type": "integer", "default": 5000000},
+            },
+            "required": [],
+        },
+    },
+    {
+        "name": "extract_dominance_patterns",
+        "description": "Count dominant vs subordinate language in text, not images.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "dataset": {"type": "string", "enum": ["posts", "comments", "corpus_clean"], "default": "comments"},
+                "text_col": {"type": "string", "default": "body"},
+                "subreddit": {"type": "string"},
+                "sample_size": {"type": "integer", "default": 5000000},
+            },
+            "required": [],
+        },
+    },
+    {
+        "name": "analyze_image_sample",
+        "description": "Run vision coding on a sample of image posts using Qwen2-VL via Together AI (no content filters). Always provide a coding_scheme for research use.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "question": {"type": "string"},
+                "subreddit": {"type": "string"},
+                "n_sample": {"type": "integer", "default": 100, "description": "Number of images to code. No hard cap — set to 500+ for large analyses."},
+                "coding_scheme": {"type": "object", "description": "Dict of {label: definition}. Always provide this for research questions."},
+            },
+            "required": ["question"],
+        },
+    },
+    {
+        "name": "export_reliability_sample",
+        "description": "Export a stratified random sample of coded images for human validation. Run after analyze_image_sample.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "source_csv": {"type": "string", "description": "Path to image_analysis CSV. Defaults to most recent."},
+                "n": {"type": "integer", "default": 200},
+                "random_state": {"type": "integer", "default": 42},
+            },
+            "required": [],
+        },
+    },
+    {
+        "name": "compute_reliability",
+        "description": "Compute Cohen's kappa between model and human codes after the human_label column has been filled in.",
+        "input_schema": {
+            "type": "object",
+            "properties": {
+                "human_csv_path": {"type": "string", "description": "Path to completed reliability_sample.csv. Defaults to outputs/reliability_sample.csv."},
+            },
+            "required": [],
+        },
+    },
+]
+TOOL_FN_MAP = {
+    "list_datasets":               lambda args: list_datasets(**args),
+    "sample_rows":                 lambda args: sample_rows(**args),
+    "count_by_group":              lambda args: count_by_group(**args),
+    "trend_over_time":             lambda args: trend_over_time(**args),
+    "summary_stats":               lambda args: summary_stats(**args),
+    "top_posts":                   lambda args: top_posts(**args),
+    "text_search":                 lambda args: text_search(**args),
+    "word_freq":                   lambda args: word_freq(**args),
+    "compare_groups":              lambda args: compare_groups(**args),
+    "extract_frequency_patterns":  lambda args: extract_frequency_patterns(**args),
+    "extract_dominance_patterns":  lambda args: extract_dominance_patterns(**args),
+    "analyze_image_sample":        lambda args: analyze_image_sample(**args),
+    "export_reliability_sample":   lambda args: export_reliability_sample(**args),
+    "compute_reliability":         lambda args: compute_reliability(**args),
+}
+def _safe_str(obj: object) -> object:
+    """Recursively encode any non-ASCII strings as JSON-safe escaped text."""
+    if isinstance(obj, str):
+        return obj.encode("ascii", errors="backslashreplace").decode("ascii")
+    if isinstance(obj, dict):
+        return {k: _safe_str(v) for k, v in obj.items()}
+    if isinstance(obj, list):
+        return [_safe_str(item) for item in obj]
+    return obj
+def _compact_result(result: object) -> dict:
+    if not isinstance(result, dict):
+        return {"value": result}
+    compact = {}
+    for key in ("analysis", "dataset", "group_col", "value_col", "query", "filters",
+                "n_matches", "n_returned", "n_total", "groups_compared", "saved_csv", "saved_png", "error"):
+        if key in result and result.get(key) is not None:
+            compact[key] = result[key]
+    table = result.get("table")
+    if isinstance(table, list):
+        compact["table_preview"] = table[:3]
+        compact["table_rows"] = len(table)
+    return compact
+def _conversation_state_summary(turns: list[dict] | None) -> str:
+    if not turns:
+        return "No prior analytical state."
+    summary = []
+    for idx, turn in enumerate(turns[-3:], start=1):
+        summary.append({
+            "turn": idx,
+            "question": _safe_str(turn.get("question", "")),
+            "answer": _safe_str(turn.get("answer", "")),
+            "tool_calls": [
+                {"tool": tc.get("tool"), "args": _safe_str(tc.get("args", {})),
+                 "result": _safe_str(_compact_result(tc.get("result")))}
+                for tc in turn.get("tool_calls", [])
+            ],
+            "artifacts": turn.get("artifacts", []),
+        })
+    return json.dumps(summary, default=str, indent=2)
+def _tool_names(tools: list[dict]) -> list[str]:
+    return [t["name"] for t in tools]
+def _tool_subset(allowed_tools: list[str]) -> list[dict]:
+    allowed = set(allowed_tools)
+    return [t for t in TOOLS if t["name"] in allowed]
+def _system_prompt(route_mode: str, route_guidance: str, conversation_state: str) -> str:
+    metadata = get_dataset_metadata()
+    dataset_lines = []
+    for name, info in metadata.items():
+        if not info.get("available"):
+            continue
+        date_range = info.get("date_range") or {}
+        dataset_lines.append(
+            f"- {name}: {info.get('rows')} rows; columns={list(info.get('columns', {}).keys())}; "
+            f"date_range={date_range or 'n/a'}"
+        )
+    dataset_summary = "\n".join(dataset_lines)
+    return f"""You are a question-driven data analysis agent working over local Reddit datasets.
+Available dataset metadata:
+{dataset_summary}
+Current route mode: {route_mode}
+Route guidance: {route_guidance}
+Prior analytical state:
+{conversation_state}
+Rules:
+1. Use the route guidance and only the provided tools.
+2. Inspect metadata or row previews before making assumptions when the schema is unclear.
+3. Run actual tools for numbers; do not guess.
+4. Prefer one minimal reproducible tool path over exploratory tool spam.
+5. Distinguish direct findings from caveats.
+6. If prior turns already produced a relevant result, reuse that context instead of recomputing unless the user asks for a change.
+7. Answer with this structure: direct answer, what was analysed, method, caveats.
+8. ALWAYS prefer tools that produce charts (trend_over_time, count_by_group, compare_groups, summary_stats, word_freq) over plain text summaries when the question is quantitative. Every numeric answer should have a chart.
+9. For questions about images or visual content, use analyze_image_sample. It reads from raw CSV files with image URLs — no separate setup needed. ALWAYS generate an explicit coding_scheme dict (with label names as keys and definitions as values) before calling this tool — never leave coding_scheme null for a research question.
+10. After a large image coding run, offer to run export_reliability_sample to generate a human validation set, then compute_reliability once the user has filled in the human_label column.
+11. The dataset covers 30 subreddits including GOONED, GOONEDISBACK, GoonCaves, girlgooners, and more. Use subreddit filters to drill into specific communities."""
+def run_agent(
+    question: str,
+    history: list[dict] | None = None,
+    turns: list[dict] | None = None,
+    analysis_context: list[dict] | None = None,
+    conversation_state: list[dict] | None = None,
+) -> dict:
+    """Run the agent for a user question with deterministic routing and structured prior state."""
+    try:
+        return _run_agent_inner(question, history, turns, analysis_context, conversation_state)
+    except UnicodeEncodeError as exc:
+        tb = _traceback.format_exc()
+        raise RuntimeError(
+            f"Unicode encoding error (non-ASCII character in data pipeline).\n\n"
+            f"Detail: {exc}\n\nTraceback:\n{tb}"
+        ) from exc
+def _run_agent_inner(
+    question: str,
+    history: list[dict] | None = None,
+    turns: list[dict] | None = None,
+    analysis_context: list[dict] | None = None,
+    conversation_state: list[dict] | None = None,
+) -> dict:
+    client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
+    prior_turns = turns or analysis_context or conversation_state or []
+    route = route_question(question)
+    available_tools = _tool_subset(route.allowed_tools)
+    safe_history = [_safe_str(msg) for msg in (history or [])]
+    messages = safe_history
+    messages.append({"role": "user", "content": _safe_str(question)})
+    tool_calls_log = []
+    plotly_jsons = []
+    total_input_tokens = 0
+    total_output_tokens = 0
+    system = _system_prompt(
+        route_mode=route.mode,
+        route_guidance=route.guidance,
+        conversation_state=_conversation_state_summary(prior_turns),
+    )
+    while True:
+        safe_messages = _safe_str(messages)
+        safe_system = _safe_str(system)
+        try:
+            response = client.messages.create(
+                model=MODEL,
+                max_tokens=4096,
+                system=safe_system,
+                tools=available_tools,
+                messages=safe_messages,
+            )
+        except UnicodeEncodeError:
+            stripped_messages = json.loads(json.dumps(safe_messages, default=str, ensure_ascii=True))
+            stripped_system = safe_system.encode("ascii", errors="ignore").decode("ascii")
+            response = client.messages.create(
+                model=MODEL,
+                max_tokens=4096,
+                system=stripped_system,
+                tools=available_tools,
+                messages=stripped_messages,
+            )
+        text_parts = [block.text for block in response.content if block.type == "text"]
+        tool_use_blocks = [block for block in response.content if block.type == "tool_use"]
+        total_input_tokens  += getattr(response.usage, "input_tokens",  0) or 0
+        total_output_tokens += getattr(response.usage, "output_tokens", 0) or 0
+        if response.stop_reason == "end_turn" or not tool_use_blocks:
+            # Claude Opus 4.6 pricing: $15/M input, $75/M output
+            cost_usd = (total_input_tokens / 1_000_000 * 15.0) + (total_output_tokens / 1_000_000 * 75.0)
+            return {
+                "answer": "\n".join(text_parts).strip(),
+                "tool_calls": tool_calls_log,
+                "plotly_json": plotly_jsons[-1] if plotly_jsons else None,
+                "plotly_jsons": plotly_jsons,
+                "route": route.mode,
+                "allowed_tools": _tool_names(available_tools),
+                "usage": {
+                    "input_tokens":  total_input_tokens,
+                    "output_tokens": total_output_tokens,
+                    "cost_usd":      round(cost_usd, 4),
+                },
+            }
+        tool_results = []
+        for block in tool_use_blocks:
+            fn = TOOL_FN_MAP.get(block.name)
+            if fn is None:
+                result = {"error": f"Unknown tool: {block.name}"}
+            else:
+                try:
+                    result = fn(block.input)
+                    if isinstance(result, dict) and result.get("plotly_json"):
+                        plotly_jsons.append(result["plotly_json"])
+                except Exception as exc:
+                    result = {"error": str(exc)}
+            safe_result = _safe_str(result)
+            tool_calls_log.append({"tool": block.name, "args": block.input, "result": result})
+            tool_results.append({
+                "type": "tool_result",
+                "tool_use_id": block.id,
+                "content": json.dumps(safe_result, default=str, ensure_ascii=True),
+            })
+        assistant_content = [_safe_str(block.model_dump()) for block in response.content]
+        messages.append({"role": "assistant", "content": assistant_content})
+        messages.append({"role": "user", "content": tool_results})

agent/rebuild_parquets.py ADDED Viewed

	@@ -0,0 +1,198 @@

+"""
+Rebuild posts.parquet and comments.parquet from raw CSVs.
+Strategy:
+  - GOONED: use GOONED_submissions.csv + GOONED_comments.csv (full combined dumps)
+  - All other subreddits: concatenate their yearly CSV files
+  - Deduplicate by id within each type
+  - Write to data/posts_full.parquet and data/comments_full.parquet
+Run from the project root (Goon/):
+    python agent/rebuild_parquets.py
+Expect ~5-10 minutes and 4-8GB RAM for the comments file.
+"""
+import gc
+import glob
+import sys
+from pathlib import Path
+import pandas as pd
+import pyarrow as pa
+import pyarrow.parquet as pq
+DATA_DIR = Path("data")
+OUT_POSTS = DATA_DIR / "posts_full.parquet"
+OUT_COMMENTS = DATA_DIR / "comments_full.parquet"
+SUBMISSION_COLS = [
+    "id", "subreddit", "author", "title", "selftext",
+    "score", "num_comments", "upvote_ratio", "created_utc",
+    "author_flair_text", "url", "domain", "is_self", "is_video",
+]
+COMMENT_COLS = [
+    "id", "subreddit", "author", "body",
+    "score", "created_utc", "author_flair_text",
+    "parent_id", "is_submitter",
+]
+CHUNK_SIZE = 500_000
+def safe_read(path: Path, usecols: list[str], **kwargs) -> pd.DataFrame:
+    available = pd.read_csv(path, nrows=0).columns.tolist()
+    cols = [c for c in usecols if c in available]
+    return pd.read_csv(path, usecols=cols, low_memory=False, **kwargs)
+def _normalise(df: pd.DataFrame) -> pd.DataFrame:
+    """Force consistent dtypes so all chunks share the same Arrow schema."""
+    if "created_utc" in df.columns:
+        df["created_utc"] = pd.to_numeric(df["created_utc"], errors="coerce").astype("float64")
+    if "score" in df.columns:
+        df["score"] = pd.to_numeric(df["score"], errors="coerce").astype("Int64")
+    if "num_comments" in df.columns:
+        df["num_comments"] = pd.to_numeric(df["num_comments"], errors="coerce").astype("Int64")
+    if "upvote_ratio" in df.columns:
+        df["upvote_ratio"] = pd.to_numeric(df["upvote_ratio"], errors="coerce").astype("float64")
+    for bool_col in ("is_self", "is_video", "is_submitter"):
+        if bool_col in df.columns:
+            df[bool_col] = df[bool_col].astype("boolean")
+    # Force ALL remaining columns to string (handles NaN-only cols inferred as float)
+    for col in df.columns:
+        if df[col].dtype not in ("float64", "Int64", "boolean"):
+            df[col] = df[col].astype("string")
+    return df
+def _safe_table(df: pd.DataFrame, schema: pa.Schema) -> pa.Table:
+    """Cast a DataFrame to an existing Arrow schema, coercing mismatches."""
+    table = pa.Table.from_pandas(df, preserve_index=False)
+    return table.cast(schema, safe=False)
+def build_posts():
+    print("=== Building posts_full.parquet ===")
+    writer = None
+    schema = None
+    seen_ids: set = set()
+    total = 0
+    # GOONED: use full combined file
+    gooned_path = DATA_DIR / "GOONED_submissions.csv"
+    print(f"  Reading {gooned_path.name}...", flush=True)
+    for chunk in pd.read_csv(
+        gooned_path,
+        usecols=lambda c: c in SUBMISSION_COLS,
+        chunksize=CHUNK_SIZE,
+        low_memory=False,
+    ):
+        chunk = chunk[~chunk["id"].isin(seen_ids)].drop_duplicates("id")
+        seen_ids.update(chunk["id"].tolist())
+        chunk = _normalise(chunk)
+        table = pa.Table.from_pandas(chunk, preserve_index=False)
+        if writer is None:
+            schema = table.schema
+            writer = pq.ParquetWriter(OUT_POSTS, schema, compression="snappy")
+        else:
+            table = _safe_table(chunk, schema)
+        writer.write_table(table)
+        total += len(chunk)
+    print(f"    GOONED submissions: {total:,}", flush=True)
+    # All other subreddits: yearly files, skip GOONED
+    yearly_files = sorted(DATA_DIR.glob("*_submissions_20*.csv"))
+    yearly_files = [f for f in yearly_files if not f.name.startswith("GOONED_")]
+    for path in yearly_files:
+        try:
+            chunk = safe_read(path, SUBMISSION_COLS)
+        except Exception as e:
+            print(f"  SKIP {path.name}: {e}", flush=True)
+            continue
+        if "id" not in chunk.columns:
+            continue
+        chunk = chunk[~chunk["id"].isin(seen_ids)].drop_duplicates("id")
+        seen_ids.update(chunk["id"].tolist())
+        chunk = _normalise(chunk)
+        table = _safe_table(chunk, schema) if schema else pa.Table.from_pandas(chunk, preserve_index=False)
+        if writer is None:
+            schema = table.schema
+            writer = pq.ParquetWriter(OUT_POSTS, schema, compression="snappy")
+        writer.write_table(table)
+        total += len(chunk)
+    if writer:
+        writer.close()
+    print(f"  Total posts written: {total:,}")
+    print(f"  Saved: {OUT_POSTS}")
+def build_comments():
+    print("=== Building comments_full.parquet ===")
+    writer = None
+    schema = None
+    total = 0
+    # GOONED: use full combined file (35M rows — chunked, no id tracking to save RAM)
+    gooned_path = DATA_DIR / "GOONED_comments.csv"
+    print(f"  Reading {gooned_path.name} in chunks...", flush=True)
+    chunk_n = 0
+    for chunk in pd.read_csv(
+        gooned_path,
+        usecols=lambda c: c in COMMENT_COLS,
+        chunksize=CHUNK_SIZE,
+        low_memory=False,
+    ):
+        chunk = chunk.drop_duplicates("id")
+        chunk = _normalise(chunk)
+        if writer is None:
+            table = pa.Table.from_pandas(chunk, preserve_index=False)
+            schema = table.schema
+            writer = pq.ParquetWriter(OUT_COMMENTS, schema, compression="snappy")
+        else:
+            table = _safe_table(chunk, schema)
+        writer.write_table(table)
+        total += len(chunk)
+        chunk_n += 1
+        if chunk_n % 10 == 0:
+            print(f"    ...{total:,} rows processed", flush=True)
+        del chunk
+        gc.collect()
+    print(f"    GOONED comments done: {total:,}", flush=True)
+    # All other subreddits: yearly files (no GOONED overlap since we used the combined file)
+    yearly_files = sorted(DATA_DIR.glob("*_comments_20*.csv"))
+    yearly_files = [f for f in yearly_files if not f.name.startswith("GOONED_")]
+    for path in yearly_files:
+        try:
+            chunk = safe_read(path, COMMENT_COLS)
+        except Exception as e:
+            print(f"  SKIP {path.name}: {e}", flush=True)
+            continue
+        if "id" not in chunk.columns:
+            continue
+        chunk = chunk.drop_duplicates("id")
+        chunk = _normalise(chunk)
+        table = _safe_table(chunk, schema) if schema else pa.Table.from_pandas(chunk, preserve_index=False)
+        if writer is None:
+            schema = table.schema
+            writer = pq.ParquetWriter(OUT_COMMENTS, schema, compression="snappy")
+        writer.write_table(table)
+        total += len(chunk)
+        del chunk
+        gc.collect()
+    if writer:
+        writer.close()
+    print(f"  Total comments written: {total:,}")
+    print(f"  Saved: {OUT_COMMENTS}")
+if __name__ == "__main__":
+    build_posts()
+    build_comments()
+    print("\nDone. Update DATA_DIR in agent/analysis/inspect_data.py to use posts_full and comments_full.")

agent/requirements.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+anthropic>=0.40.0
+together>=1.3.0
+openai>=1.0.0
+scikit-learn>=1.4.0
+pandas>=2.0.0
+pyarrow>=14.0.0
+streamlit>=1.35.0
+python-dotenv>=1.0.0
+plotly>=5.20.0

app.py ADDED Viewed

	@@ -0,0 +1,1059 @@

+"""
+Grasping Gooning — analysis agent UI
+Run: streamlit run app.py  (from /Users/binx/Desktop/Goon/)
+"""
+from __future__ import annotations
+import inspect
+import json
+import os
+import random
+import sys
+import threading
+import time
+from pathlib import Path
+import pandas as pd
+import plotly.io as pio
+import streamlit as st
+from dotenv import load_dotenv
+# ── paths ──────────────────────────────────────────────────────────────────
+ROOT = Path(__file__).parent
+sys.path.insert(0, str(ROOT / "agent"))
+load_dotenv(ROOT / "agent" / ".env")
+from analysis import run_agent, list_datasets
+# ── page config ────────────────────────────────────────────────────────────
+st.set_page_config(
+    page_title="Grasping Gooning",
+    layout="wide",
+    initial_sidebar_state="expanded",
+)
+@st.cache_data(show_spinner=False)
+def load_post_samples(n: int = 120) -> list[dict]:
+    """Random sample of real post titles for the loading slideshow."""
+    try:
+        import pyarrow.dataset as _ds
+        _path = ROOT / "data" / "posts.parquet"
+        if not _path.exists():
+            return []
+        d = _ds.dataset(str(_path), format="parquet")
+        t = d.scanner(columns=["subreddit", "title"]).head(40000).to_pandas()
+        mask = (
+            t["title"].str.len() > 30
+        ) & (
+            t["title"].str.len() < 180
+        ) & (
+            ~t["title"].str.lower().str.startswith("[")
+        )
+        sample = t[mask].sample(min(n, mask.sum()), random_state=None)
+        return sample[["subreddit", "title"]].to_dict(orient="records")
+    except Exception:
+        return []
+LOADING_HINTS = [
+    "you're so close…",
+    "keep going…",
+    "deeper…",
+    "almost there…",
+    "don't stop now…",
+    "just a bit more…",
+    "stay with it…",
+    "right there…",
+    "edge of something…",
+    "hold on…",
+    "so close…",
+    "don't stop…",
+]
+# ── global CSS ─────────────────────────────────────────────────────────────
+st.markdown("""
+<style>
+@import url('https://fonts.googleapis.com/icon?family=Material+Icons');
+/* ---- tokens ---- */
+:root {
+  --bg:      #ffffff;
+  --surface: #f5f5f5;
+  --border:  #e0e0e0;
+  --divider: #ebebeb;
+  --ink:     #000000;
+  --body:    #222222;
+  --mid:     #555555;
+  --muted:   #888888;
+  --faint:   #aaaaaa;
+}
+/* ---- base ---- */
+html, body, .stApp,
+[data-testid="stAppViewContainer"],
+[data-testid="stMain"],
+[data-testid="stHeader"],
+[data-testid="stToolbar"],
+[data-testid="stBottom"],
+[data-testid="stBottomBlockContainer"] {
+  background: var(--bg) !important;
+  color: var(--ink) !important;
+  font-family: Arial, Helvetica, sans-serif !important;
+}
+/* header bar */
+[data-testid="stHeader"] {
+  border-bottom: 1px solid var(--border) !important;
+  box-shadow: none !important;
+}
+/* bottom chat bar */
+[data-testid="stBottom"] {
+  border-top: 1px solid var(--border) !important;
+  box-shadow: none !important;
+}
+#MainMenu, footer { visibility: hidden; }
+/* hide deploy button (confirmed testid from Streamlit 1.50 bundle) */
+[data-testid="stAppDeployButton"] {
+  display: none !important;
+}
+/* ---- layout ---- */
+.block-container {
+  max-width: 1060px !important;
+  padding: 56px 32px 140px !important;
+}
+/* ---- sidebar (dark) ---- */
+section[data-testid="stSidebar"] {
+  background: #0f0f0f !important;
+  border-right: 1px solid #1e1e1e !important;
+}
+section[data-testid="stSidebar"] .block-container {
+  padding: 28px 16px 48px !important;
+}
+/* all text inside sidebar goes light */
+section[data-testid="stSidebar"] p,
+section[data-testid="stSidebar"] span,
+section[data-testid="stSidebar"] label,
+section[data-testid="stSidebar"] div,
+section[data-testid="stSidebar"] .stMarkdown {
+  color: #cccccc !important;
+  font-family: Arial, Helvetica, sans-serif !important;
+}
+/* sidebar collapse button */
+[data-testid="stSidebarCollapseButton"] button::after { color: #555 !important; }
+[data-testid="stExpandSidebarButton"] button::after   { color: #555 !important; }
+/* ---- sidebar heading animation ---- */
+.sb-title {
+  font-size: 13px;
+  font-weight: 700;
+  letter-spacing: 0.18em;
+  text-transform: uppercase;
+  color: #ffffff !important;
+  margin-bottom: 2px;
+  animation: sbFadeDown 500ms cubic-bezier(0.22,1,0.36,1) both;
+}
+.sb-tagline {
+  font-size: 10px;
+  letter-spacing: 0.08em;
+  color: #555 !important;
+  overflow: hidden;
+  white-space: nowrap;
+  width: 0;
+  animation: sbTypewriter 1.4s steps(32, end) 300ms forwards,
+             sbBlinkCursor 600ms step-end 300ms 3;
+  border-right: 1px solid #444;
+}
+@keyframes sbFadeDown {
+  from { opacity: 0; transform: translateY(-6px); }
+  to   { opacity: 1; transform: none; }
+}
+@keyframes sbTypewriter {
+  to { width: 100%; border-right-color: transparent; }
+}
+@keyframes sbBlinkCursor {
+  50% { border-right-color: transparent; }
+}
+/* ---- sidebar labels ---- */
+.sidebar-label {
+  font-size: 9px;
+  font-weight: 700;
+  letter-spacing: 0.16em;
+  text-transform: uppercase;
+  color: #444444 !important;
+  animation: sbFadeDown 400ms ease both;
+}
+.sidebar-box {
+  border-top: 1px solid #1e1e1e;
+  padding-top: 14px;
+  margin-top: 14px;
+}
+.sidebar-copy {
+  font-size: 10px;
+  line-height: 1.7;
+  color: #777777 !important;
+}
+.sidebar-stat {
+  font-size: 11px;
+  color: #aaaaaa !important;
+  font-weight: 600;
+}
+/* ---- sidebar buttons ---- */
+section[data-testid="stSidebar"] .stButton > button {
+  width: 100% !important;
+  background: transparent !important;
+  border: 1px solid #2a2a2a !important;
+  border-radius: 0 !important;
+  color: #666666 !important;
+  padding: 8px 10px !important;
+  text-align: left !important;
+  font-size: 10px !important;
+  letter-spacing: 0.08em !important;
+  text-transform: uppercase !important;
+  box-shadow: none !important;
+  transition: background 150ms, border-color 150ms, color 150ms !important;
+}
+section[data-testid="stSidebar"] .stButton > button:hover {
+  background: #1a1a1a !important;
+  border-color: #555555 !important;
+  color: #eeeeee !important;
+}
+/* ---- sidebar inputs ---- */
+section[data-testid="stSidebar"] .stTextInput input {
+  background: #1a1a1a !important;
+  border: 1px solid #2a2a2a !important;
+  border-radius: 0 !important;
+  color: #cccccc !important;
+  box-shadow: none !important;
+  font-size: 12px !important;
+  transition: border-color 150ms !important;
+}
+section[data-testid="stSidebar"] .stTextInput input:focus {
+  border-color: #555555 !important;
+}
+section[data-testid="stSidebar"] .stTextInput input::placeholder {
+  color: #444444 !important;
+}
+/* ---- sidebar expander (dark) ---- */
+section[data-testid="stSidebar"] [data-testid="stExpander"] {
+  background: transparent !important;
+  border: 1px solid #1e1e1e !important;
+}
+section[data-testid="stSidebar"] [data-testid="stExpander"]:hover {
+  border-color: #333333 !important;
+}
+section[data-testid="stSidebar"] [data-testid="stExpander"] summary p,
+section[data-testid="stSidebar"] [data-testid="stExpander"] summary span {
+  color: #888888 !important;
+  font-size: 10px !important;
+  letter-spacing: 0.1em !important;
+  text-transform: uppercase !important;
+}
+/* ---- main area buttons ---- */
+.block-container .stButton > button {
+  width: 100% !important;
+  background: transparent !important;
+  border: 1px solid #cccccc !important;
+  border-radius: 0 !important;
+  color: var(--mid) !important;
+  padding: 8px 10px !important;
+  text-align: left !important;
+  font-size: 10px !important;
+  letter-spacing: 0.08em !important;
+  text-transform: uppercase !important;
+  box-shadow: none !important;
+  transition: background 150ms, border-color 150ms, color 150ms !important;
+}
+.block-container .stButton > button:hover {
+  background: var(--surface) !important;
+  border-color: var(--ink) !important;
+  color: var(--ink) !important;
+}
+/* ---- inputs ---- */
+.stTextInput input {
+  background: var(--bg) !important;
+  border: 1px solid var(--border) !important;
+  border-radius: 0 !important;
+  color: var(--ink) !important;
+  box-shadow: none !important;
+  font-size: 12px !important;
+  transition: border-color 150ms !important;
+}
+.stTextInput input:focus { border-color: var(--ink) !important; }
+/* chat input container */
+[data-testid="stChatInput"],
+[data-testid="stChatInput"] > div,
+[data-testid="stChatInputContainer"] {
+  background: #ffffff !important;
+  border: 1px solid #d0d0d0 !important;
+  border-radius: 0 !important;
+  box-shadow: none !important;
+  transition: border-color 150ms !important;
+}
+[data-testid="stChatInput"]:focus-within,
+[data-testid="stChatInputContainer"]:focus-within {
+  border-color: #000000 !important;
+  box-shadow: none !important;
+}
+[data-testid="stChatInput"] textarea {
+  background: #ffffff !important;
+  color: #000000 !important;
+  font-size: 14px !important;
+  font-family: Arial, Helvetica, sans-serif !important;
+}
+/* send button */
+[data-testid="stChatInput"] button,
+[data-testid="stChatInputContainer"] button {
+  background: #000000 !important;
+  border: none !important;
+  border-radius: 0 !important;
+  color: #ffffff !important;
+  box-shadow: none !important;
+  min-width: 44px !important;
+  width: 44px !important;
+  height: 100% !important;
+  font-family: Arial, Helvetica, sans-serif !important;
+  font-size: 13px !important;
+  font-weight: 700 !important;
+  letter-spacing: 0.04em !important;
+}
+[data-testid="stChatInput"] button:hover,
+[data-testid="stChatInputContainer"] button:hover {
+  background: #333333 !important;
+  opacity: 1 !important;
+}
+/* replace SVG arrow with text "->" */
+[data-testid="stChatInput"] button svg,
+[data-testid="stChatInputContainer"] button svg { display: none !important; }
+[data-testid="stChatInput"] button::after,
+[data-testid="stChatInputContainer"] button::after {
+  content: "->";
+  font-family: Arial, Helvetica, sans-serif !important;
+  font-size: 13px;
+  font-weight: 700;
+  color: #ffffff;
+}
+/* kill any rounded wrapper Streamlit adds around the whole bar */
+[data-testid="stBottom"] > div,
+[data-testid="stBottomBlockContainer"] > div {
+  background: #ffffff !important;
+  border-radius: 0 !important;
+  box-shadow: none !important;
+}
+/* ---- expander ---- */
+[data-testid="stExpander"] {
+  background: transparent !important;
+  border: 1px solid var(--border) !important;
+  border-radius: 0 !important;
+  overflow: hidden !important;
+  transition: border-color 150ms !important;
+}
+[data-testid="stExpander"]:hover { border-color: var(--ink) !important; }
+[data-testid="stExpander"] summary { padding: 10px 14px !important; }
+/* kill the expander toggle icon (data-testid confirmed from Streamlit 1.50 bundle) */
+[data-testid="stExpander"] summary [data-testid="stIconMaterial"],
+[data-testid="stExpander"] summary [data-testid="stImageIcon"] {
+  display: none !important;
+}
+/* sidebar collapse / expand toggle icons -> replace with < > */
+[data-testid="stExpandSidebarButton"] [data-testid="stIconMaterial"],
+[data-testid="stSidebarCollapseButton"] [data-testid="stIconMaterial"] {
+  display: none !important;
+}
+[data-testid="stExpandSidebarButton"] button::after {
+  content: ">";
+  font-size: 15px; font-weight: 700;
+  font-family: Arial, Helvetica, sans-serif !important;
+  color: #555555;
+}
+[data-testid="stSidebarCollapseButton"] button::after {
+  content: "<";
+  font-size: 15px; font-weight: 700;
+  font-family: Arial, Helvetica, sans-serif !important;
+  color: #555555;
+}
+/* ---- progress bar ---- */
+.prog-wrap { padding: 20px 0 12px; }
+.prog-hint {
+  font-size: 11px; color: var(--muted);
+  letter-spacing: 0.1em; margin-bottom: 10px;
+  font-style: italic;
+  animation: progPulse 1.8s ease-in-out infinite;
+}
+@keyframes progPulse {
+  0%, 100% { opacity: 0.5; }
+  50%       { opacity: 1; }
+}
+.prog-bg {
+  background: var(--divider); height: 1px; width: 100%; margin-bottom: 6px;
+  position: relative; overflow: hidden;
+}
+.prog-fill {
+  background: var(--ink); height: 1px;
+  transition: width 0.35s ease;
+  position: absolute; top: 0; left: 0;
+}
+/* shimmer on the fill bar */
+.prog-fill::after {
+  content: "";
+  position: absolute; top: 0; right: 0;
+  width: 40px; height: 1px;
+  background: linear-gradient(to right, transparent, #fff, transparent);
+  animation: shimmer 1.2s ease-in-out infinite;
+}
+@keyframes shimmer {
+  0%   { opacity: 0; transform: translateX(-40px); }
+  50%  { opacity: 1; }
+  100% { opacity: 0; transform: translateX(40px); }
+}
+.prog-pct {
+  font-size: 9px; color: var(--faint);
+  letter-spacing: 0.14em; text-transform: uppercase;
+  font-family: "Courier New", monospace !important;
+}
+.prog-stuck {
+  margin-top: 8px;
+  font-size: 10px; color: var(--muted);
+  letter-spacing: 0.08em; font-style: italic;
+  animation: stuckFadeIn 400ms ease both;
+}
+.prog-stuck-0 { color: var(--muted); }
+.prog-stuck-1 { color: var(--faint); }
+.prog-stuck-2 { color: #cccccc; font-size: 9px; }
+.prog-stuck-3 { color: #dddddd; font-size: 9px; }
+.prog-stuck-4 { color: #e0e0e0; font-size: 9px; }
+.prog-stuck-5 { color: #e8e8e8; font-size: 9px; }
+@keyframes stuckFadeIn {
+  from { opacity: 0; transform: translateY(4px); }
+  to   { opacity: 1; transform: none; }
+}
+/* ---- loading post slideshow ---- */
+.post-slide {
+  margin-top: 20px;
+  padding: 14px 18px;
+  background: var(--surface);
+  border-left: 2px solid var(--divider);
+  animation: slideIn 300ms cubic-bezier(0.22,1,0.36,1) both;
+}
+.post-slide-sub {
+  font-size: 9px; letter-spacing: 0.14em; text-transform: uppercase;
+  color: var(--faint); margin-bottom: 6px;
+  font-family: "Courier New", monospace !important;
+}
+.post-slide-title {
+  font-size: 13px; line-height: 1.5; color: var(--body);
+  font-style: italic;
+}
+@keyframes slideIn {
+  from { opacity: 0; transform: translateY(6px); }
+  to   { opacity: 1; transform: none; }
+}
+/* ---- chat ---- */
+[data-testid="stChatMessage"] {
+  background: transparent !important;
+  border: none !important;
+  padding: 0 !important;
+  margin: 0 0 28px !important;
+  gap: 12px !important;
+  animation: fadeUp 240ms cubic-bezier(0.22,1,0.36,1) both;
+}
+/* user avatar: kaomoji */
+[data-testid="stChatMessageAvatarUser"] {
+  background: #000000 !important;
+  border: none !important;
+  width: 34px !important; height: 34px !important;
+  min-width: 34px !important;
+  border-radius: 0 !important;
+  position: relative !important;
+  overflow: visible !important;
+}
+[data-testid="stChatMessageAvatarUser"] svg,
+[data-testid="stChatMessageAvatarUser"] img { display: none !important; }
+[data-testid="stChatMessageAvatarUser"]::after {
+  content: "( ˘▾˘)";
+  position: absolute; top: 50%; left: 50%;
+  transform: translate(-50%, -50%);
+  font-size: 11px; line-height: 1; color: #ffffff;
+  white-space: nowrap;
+  font-family: Arial, Helvetica, sans-serif !important;
+}
+/* assistant avatar: kaomoji */
+[data-testid="stChatMessageAvatarAssistant"] {
+  background: #ffffff !important;
+  border: 1px solid var(--border) !important;
+  width: 34px !important; height: 34px !important;
+  min-width: 34px !important;
+  border-radius: 0 !important;
+  position: relative !important;
+  overflow: visible !important;
+}
+[data-testid="stChatMessageAvatarAssistant"] svg,
+[data-testid="stChatMessageAvatarAssistant"] img { display: none !important; }
+[data-testid="stChatMessageAvatarAssistant"]::after {
+  content: "ʕ•ᴥ•ʔ";
+  position: absolute; top: 50%; left: 50%;
+  transform: translate(-50%, -50%);
+  font-size: 11px; line-height: 1; color: #000000;
+  white-space: nowrap;
+  font-family: Arial, Helvetica, sans-serif !important;
+}
+.msg-meta {
+  display: flex; align-items: center; gap: 12px;
+  margin-bottom: 8px;
+}
+.msg-label {
+  font-size: 10px; font-weight: 700;
+  letter-spacing: 0.14em; text-transform: uppercase;
+  color: var(--muted);
+}
+.route-tag {
+  font-size: 10px; letter-spacing: 0.1em; text-transform: uppercase;
+  color: var(--faint); border-left: 1px solid var(--border); padding-left: 10px;
+}
+.msg-body {
+  border-top: 1px solid var(--divider);
+  padding-top: 12px;
+  animation: fadeIn 220ms 60ms ease both;
+}
+.msg-body p {
+  font-size: 14px !important; line-height: 1.7 !important;
+  color: var(--body) !important; max-width: 72ch !important;
+}
+/* ---- cost bar ---- */
+.cost-row {
+  display: flex; align-items: center; gap: 16px;
+  margin-top: 22px; margin-bottom: 18px;
+  padding: 14px 18px;
+  background: #000000;
+}
+.cost-label {
+  font-size: 10px; letter-spacing: 0.18em; text-transform: uppercase;
+  color: #666666; white-space: nowrap; font-family: "Courier New", monospace !important;
+}
+.cost-track {
+  flex: 1; height: 2px; background: #2a2a2a; position: relative;
+}
+.cost-fill {
+  position: absolute; top: 0; left: 0; height: 2px;
+  background: #ffffff;
+  transition: width 600ms cubic-bezier(0.22,1,0.36,1);
+}
+.cost-val {
+  font-size: 12px; letter-spacing: 0.06em; color: #ffffff;
+  font-family: "Courier New", monospace !important; white-space: nowrap;
+}
+.cost-tok {
+  font-size: 10px; letter-spacing: 0.04em; color: #666666;
+  font-family: "Courier New", monospace !important; white-space: nowrap;
+}
+/* ---- step trace ---- */
+.step-title {
+  display: block; font-size: 12px; font-weight: 700;
+  margin-bottom: 4px; color: var(--ink);
+}
+.spath {
+  display: block; margin-top: 4px;
+  font-size: 11px; color: var(--muted);
+  font-family: "Courier New", monospace !important;
+}
+[data-testid="stDataFrame"] table {
+  font-size: 12px !important;
+  font-family: Arial, Helvetica, sans-serif !important;
+}
+/* ---- keyframes ---- */
+@keyframes fadeUp {
+  from { opacity:0; transform:translateY(6px); }
+  to   { opacity:1; transform:none; }
+}
+@keyframes fadeIn {
+  from { opacity:0; } to { opacity:1; }
+}
+/* ---- responsive ---- */
+@media (max-width:640px) {
+  .block-container { padding: 32px 16px 100px !important; }
+}
+</style>
+""", unsafe_allow_html=True)
+# ── helpers ────────────────────────────────────────────────────────────────
+def fmt(v: int | None) -> str:
+    return "n/a" if v is None else f"{v:,}"
+def dataset_snapshot() -> dict:
+    try:
+        return list_datasets()
+    except Exception:
+        return {}
+def render_plot(plotly_json: str) -> None:
+    try:
+        fig = pio.from_json(plotly_json)
+        fig.update_layout(
+            paper_bgcolor="rgba(0,0,0,0)", plot_bgcolor="rgba(0,0,0,0)",
+            font=dict(family="Arial, Helvetica, sans-serif", color="#222", size=12),
+            margin=dict(l=0, r=0, t=32, b=0),
+            colorway=["#000", "#555", "#888", "#bbb"],
+            xaxis=dict(gridcolor="#ebebeb", linecolor="#e0e0e0"),
+            yaxis=dict(gridcolor="#ebebeb", linecolor="#e0e0e0"),
+        )
+        st.plotly_chart(fig, use_container_width=True)
+    except Exception as exc:
+        st.warning(f"Chart could not be rendered: {exc}")
+def compact_tool_result(result: object) -> dict:
+    if not isinstance(result, dict):
+        return {"value": result}
+    compact: dict = {"keys": sorted(result.keys())}
+    for key in ("saved_csv", "saved_png", "plotly_json", "error", "analysis", "dataset", "filters"):
+        if key in result and result.get(key) is not None:
+            compact[key] = result[key]
+    table = result.get("table")
+    if isinstance(table, list):
+        compact["table_rows"] = len(table)
+        compact["table_preview"] = table[:5]
+    return compact
+def extract_artifacts(tool_calls: list[dict]) -> list[dict]:
+    artifacts: list[dict] = []
+    for tc in tool_calls:
+        result = tc.get("result") or {}
+        if not isinstance(result, dict):
+            continue
+        for key, atype in (("saved_csv", "csv"), ("saved_png", "png")):
+            if result.get(key):
+                artifacts.append({"type": atype, "tool": tc.get("tool", "?"), "path": result[key]})
+        if result.get("plotly_json"):
+            artifacts.append({"type": "plotly_json", "tool": tc.get("tool", "?"), "present": True})
+    return artifacts
+def build_backend_history(turns: list[dict]) -> list[dict]:
+    history: list[dict] = []
+    for turn in turns:
+        history.append({"role": "user", "content": turn["question"]})
+        content = turn["answer"]
+        state = {
+            "tool_calls": [
+                {"tool": tc.get("tool"), "args": tc.get("args") or {},
+                 "result": compact_tool_result(tc.get("result"))}
+                for tc in turn.get("tool_calls", [])
+            ],
+            "artifacts": turn.get("artifacts", []),
+            "plotly_json": bool(turn.get("plotly_json")),
+            "route": turn.get("route"),
+        }
+        if state["tool_calls"] or state["artifacts"] or state["plotly_json"]:
+            content += f"\n\n<analysis_state>\n{json.dumps(state, default=str, indent=2)}\n</analysis_state>"
+        history.append({"role": "assistant", "content": content})
+    return history
+def call_agent(question: str, history: list[dict], turns: list[dict]) -> dict:
+    kwargs = {"history": history}
+    params = inspect.signature(run_agent).parameters
+    for name in ("analysis_context", "conversation_state", "turns"):
+        if name in params:
+            kwargs[name] = turns
+            break
+    return run_agent(question, **kwargs)
+_POST_SAMPLES: list[dict] = []
+def call_agent_with_progress(question: str, backend_history: list[dict], turns: list[dict], slot) -> dict:
+    """Run agent in a background thread; update a progress slot from the main thread."""
+    result_holder: dict = {}
+    exc_holder: dict = {}
+    def worker() -> None:
+        try:
+            result_holder["r"] = call_agent(question, backend_history, turns)
+        except Exception as e:
+            exc_holder["e"] = e
+    t = threading.Thread(target=worker, daemon=True)
+    t.start()
+    STUCK_MSGS = [
+        "i promise i'm still gooning",
+        "locked in. fully gooned. cannot stop.",
+        "the data is vast. the goon is deep. patience.",
+        "i've been edging this query for so long i've lost track of time.",
+        "every second is another row scanned. feel it.",
+        "this is what a true goon session looks like. no shortcuts.",
+        "i am one with the dataset. do not disturb.",
+    ]
+    global _POST_SAMPLES
+    if not _POST_SAMPLES:
+        _POST_SAMPLES = load_post_samples()
+    posts = _POST_SAMPLES if _POST_SAMPLES else []
+    random.shuffle(posts)
+    pct = 0
+    idx = 0
+    post_idx = 0
+    start = time.time()
+    stuck_since: float | None = None
+    last_pct_change = time.time()
+    while t.is_alive():
+        prev_pct = pct
+        pct = min(pct + random.randint(1, 5), 93)
+        if pct != prev_pct:
+            stuck_since = None
+            last_pct_change = time.time()
+        else:
+            if stuck_since is None:
+                stuck_since = time.time()
+        elapsed = int(time.time() - start)
+        elapsed_str = f"{elapsed}s" if elapsed < 60 else f"{elapsed // 60}m {elapsed % 60}s"
+        stuck_sec = int(time.time() - stuck_since) if stuck_since else 0
+        hint = LOADING_HINTS[idx % len(LOADING_HINTS)]
+        # Build up stuck messages — one new line per 12s window, cleared when pct moves
+        n_stuck = min(stuck_sec // 12, len(STUCK_MSGS))
+        stuck_html = "".join(
+            f'<div class="prog-stuck prog-stuck-{i}">{STUCK_MSGS[i]}</div>'
+            for i in range(n_stuck)
+        )
+        # rotate post every ~11 ticks (~4 seconds)
+        if idx % 11 == 0 and idx > 0:
+            post_idx += 1
+        post_html = ""
+        if posts:
+            p = posts[post_idx % len(posts)]
+            sub   = p.get("subreddit", "")
+            title = p.get("title", "").replace("<", "&lt;").replace(">", "&gt;")
+            post_html = (
+                f'<div class="post-slide" key="{post_idx}">'
+                f'<div class="post-slide-sub">r/{sub}</div>'
+                f'<div class="post-slide-title">{title}</div>'
+                f'</div>'
+            )
+        at_cap = pct >= 93
+        pct_display = "—%" if at_cap else f"{pct}%"
+        running_label = (
+            '<span class="prog-hint" style="display:inline;margin-left:8px;margin-bottom:0">'
+            'still running</span>'
+            if at_cap else ""
+        )
+        slot.markdown(
+            f'<div class="prog-wrap">'
+            f'<div class="prog-hint">{hint}</div>'
+            f'<div class="prog-bg"><div class="prog-fill" style="width:{pct}%"></div></div>'
+            f'<div class="prog-pct">{pct_display} &nbsp;·&nbsp; {elapsed_str}{running_label}</div>'
+            f'{stuck_html}'
+            f'{post_html}'
+            f'</div>',
+            unsafe_allow_html=True,
+        )
+        idx += 1
+        time.sleep(0.35)
+    t.join()
+    slot.empty()
+    if exc_holder:
+        raise exc_holder["e"]
+    return result_holder["r"]
+def render_cost_bar(usage: dict) -> None:
+    cost  = usage.get("cost_usd", 0)
+    inp   = usage.get("input_tokens", 0)
+    out   = usage.get("output_tokens", 0)
+    # scale: 0–$0.50 maps to 0–100% of bar
+    pct   = min(cost / 0.50 * 100, 100)
+    if cost < 0.01:
+        val_str = f"< $0.01"
+    else:
+        val_str = f"${cost:.3f}"
+    tok_str = f"{inp:,} in · {out:,} out"
+    st.markdown(
+        f'<div class="cost-row">'
+        f'<span class="cost-label">cost</span>'
+        f'<div class="cost-track"><div class="cost-fill" style="width:{pct:.1f}%"></div></div>'
+        f'<span class="cost-val">{val_str}</span>'
+        f'<span class="cost-tok">{tok_str}</span>'
+        f'</div>',
+        unsafe_allow_html=True,
+    )
+def render_tool_calls(tool_calls: list[dict]) -> None:
+    n = len(tool_calls)
+    with st.expander(f"Method  {n} step{'s' if n != 1 else ''}", expanded=False):
+        for i, tc in enumerate(tool_calls):
+            st.markdown(
+                f"<span class='step-title'>Step {i+1} -> {tc.get('tool','?')}</span>",
+                unsafe_allow_html=True,
+            )
+            if tc.get("args"):
+                st.json(tc["args"], expanded=False)
+            res = tc.get("result") or {}
+            if isinstance(res, dict):
+                if res.get("table"):
+                    try:
+                        st.dataframe(pd.DataFrame(res["table"]), use_container_width=True, hide_index=True)
+                    except Exception:
+                        pass
+                for key in ("saved_csv", "saved_png"):
+                    if res.get(key):
+                        st.markdown(f"<span class='spath'>-> {res[key]}</span>", unsafe_allow_html=True)
+            if i < n - 1:
+                st.markdown("---")
+def render_export_buttons(answer: str, tool_calls: list[dict], turn_idx: int) -> None:
+    artifacts = extract_artifacts(tool_calls)
+    csvs = [a["path"] for a in artifacts if a["type"] == "csv"]
+    pngs = [a["path"] for a in artifacts if a["type"] == "png"]
+    items: list[tuple[str, bytes, str, str]] = []
+    items.append(("answer.md", answer.encode("utf-8"), "text/markdown", f"answer_{turn_idx}.md"))
+    for path in csvs:
+        p = Path(path)
+        if p.exists():
+            items.append((p.name, p.read_bytes(), "text/csv", p.name))
+    for path in pngs:
+        p = Path(path)
+        if p.exists():
+            items.append((p.name, p.read_bytes(), "image/png", p.name))
+    cols = st.columns(len(items))
+    for col, (label, data, mime, fname) in zip(cols, items):
+        with col:
+            st.download_button(
+                label=label, data=data, file_name=fname, mime=mime,
+                key=f"dl_{turn_idx}_{fname}",
+            )
+# ── session state ──────────────────────────────────────────────────────────
+for key, default in [("history", []), ("chat", []), ("turns", []), ("prefill", ""), ("authenticated", False), ("logged_out", False)]:
+    if key not in st.session_state:
+        st.session_state[key] = default
+# seed from env if already set (e.g. from .env file) — but not if user explicitly logged out
+if not st.session_state["authenticated"] and not st.session_state["logged_out"] and os.environ.get("ANTHROPIC_API_KEY"):
+    st.session_state["authenticated"] = True
+# ── dataset metadata ───────────────────────────────────────────────────────
+meta          = dataset_snapshot()
+posts_rows    = meta.get("posts", {}).get("rows")
+comments_rows = meta.get("comments", {}).get("rows")
+sub_count     = len(meta.get("posts", {}).get("subreddits") or [])
+latest_date   = (meta.get("comments", {}).get("date_range") or {}).get("latest", "n/a")
+# ── login gate ─────────────────────────────────────────────────────────────
+if not st.session_state["authenticated"]:
+    st.markdown("""
+<style>
+.login-wrap {
+  display: flex; flex-direction: column; align-items: center;
+  justify-content: center; min-height: 70vh; gap: 20px;
+}
+.login-title {
+  font-size: 22px; font-weight: 700; letter-spacing: 0.04em; color: var(--ink);
+}
+.login-sub {
+  font-size: 12px; color: var(--muted); margin-top: -12px;
+}
+</style>
+<div class='login-wrap'>
+  <div class='login-title'>Grasping Gooning</div>
+  <div class='login-sub'>enter your Anthropic API key to continue</div>
+</div>
+""", unsafe_allow_html=True)
+    col = st.columns([1, 2, 1])[1]
+    with col:
+        login_key = st.text_input(
+            "API key", type="password", placeholder="sk-ant-…",
+            label_visibility="collapsed",
+        )
+        if st.button("Enter ->", key="login_btn", use_container_width=True):
+            if login_key.strip():
+                ascii_key = login_key.encode("ascii", errors="ignore").decode("ascii")
+                os.environ["ANTHROPIC_API_KEY"] = ascii_key
+                st.session_state["authenticated"] = True
+                st.session_state["logged_out"] = False
+                st.rerun()
+            else:
+                st.error("Paste your API key above.")
+    st.stop()
+# ── sidebar ────────────────────────────────────────────────────────────────
+with st.sidebar:
+    st.markdown("""
+<div class='sb-title'>Grasping Gooning</div>
+<div class='sb-tagline'>reddit data analysis agent</div>
+""", unsafe_allow_html=True)
+    st.markdown("<div class='sidebar-box'><div class='sidebar-label'>Session</div></div>",
+                unsafe_allow_html=True)
+    if st.button("Clear conversation", key="clear"):
+        st.session_state.update(history=[], chat=[], turns=[], prefill="")
+        st.rerun()
+    if st.button("Log out", key="logout"):
+        os.environ.pop("ANTHROPIC_API_KEY", None)
+        st.session_state.update(history=[], chat=[], turns=[], prefill="", authenticated=False, logged_out=True)
+        st.rerun()
+    # ── about (bottom of sidebar) ──────────────────────────────────────────
+    st.markdown("<div class='sidebar-box'>", unsafe_allow_html=True)
+    with st.expander("About"):
+        earliest = (meta.get("posts", {}).get("date_range") or {}).get("earliest", "n/a")
+        subs_list = meta.get("posts", {}).get("subreddits") or []
+        st.markdown(f"""
+<div class='sidebar-copy'>
+A research tool for analysing the Reddit gooning corpus.<br>
+Ask questions in plain English — the agent runs real code against the data and returns charts, tables, and findings.<br><br>
+<span class='sidebar-stat'>{fmt(posts_rows)}</span> posts<br>
+<span class='sidebar-stat'>{fmt(comments_rows)}</span> comments<br>
+<span class='sidebar-stat'>{sub_count}</span> subreddits<br>
+<span style='color:#444;font-size:9px'>{earliest} — {latest_date}</span><br><br>
+<span style='color:#333;font-size:9px;letter-spacing:0.1em;text-transform:uppercase'>Subreddits</span><br>
+<span style='color:#555;font-size:9px;line-height:1.9'>{" · ".join(subs_list[:15])}{"..." if len(subs_list) > 15 else ""}</span>
+</div>
+""", unsafe_allow_html=True)
+    st.markdown("</div>", unsafe_allow_html=True)
+# ── chat history ───────────────────────────────────────────────────────────
+for i, msg in enumerate(st.session_state["chat"]):
+    with st.chat_message(msg["role"]):
+        role  = msg["role"]
+        route = msg.get("route", "")
+        label = "You" if role == "user" else "Answer"
+        route_html = f"<span class='route-tag'>{route}</span>" if route and role == "assistant" else ""
+        st.markdown(
+            f"<div class='msg-meta'><span class='msg-label'>{label}</span>{route_html}</div>",
+            unsafe_allow_html=True,
+        )
+        st.markdown("<div class='msg-body'>", unsafe_allow_html=True)
+        st.markdown(msg["content"])
+        st.markdown("</div>", unsafe_allow_html=True)
+        for pj in (msg.get("plotly_jsons") or ([msg["plotly_json"]] if msg.get("plotly_json") else [])):
+            render_plot(pj)
+        if msg.get("usage"):
+            render_cost_bar(msg["usage"])
+        if msg.get("tool_calls"):
+            render_tool_calls(msg["tool_calls"])
+        if role == "assistant":
+            render_export_buttons(msg["content"], msg.get("tool_calls") or [], i)
+# ── chat input ─────────────────────────────────────────────────────────────
+prefill  = st.session_state["prefill"]
+question = st.chat_input("what do you want to know…")
+if prefill:
+    st.session_state["prefill"] = ""
+effective_question = question or prefill
+if effective_question:
+    question = effective_question
+    backend_history = build_backend_history(st.session_state["turns"])
+    with st.chat_message("user"):
+        st.markdown("<div class='msg-meta'><span class='msg-label'>You</span></div>",
+                    unsafe_allow_html=True)
+        st.markdown("<div class='msg-body'>", unsafe_allow_html=True)
+        st.markdown(question)
+        st.markdown("</div>", unsafe_allow_html=True)
+    with st.chat_message("assistant"):
+        progress_slot = st.empty()
+        try:
+            result = call_agent_with_progress(question, backend_history, list(st.session_state["turns"]), progress_slot)
+        except Exception as exc:
+            err_str = str(exc)
+            is_auth_err = (
+                type(exc).__name__ in ("AuthenticationError", "PermissionDeniedError")
+                or "invalid x-api-key" in err_str.lower()
+                or "401" in err_str
+            )
+            if is_auth_err:
+                os.environ.pop("ANTHROPIC_API_KEY", None)
+                st.session_state.update(authenticated=False, logged_out=True)
+                st.error("API key rejected — please re-enter it.")
+                st.rerun()
+            elif "rate_limit" in err_str.lower():
+                st.error("Rate limited. Wait a moment and try again.")
+            elif "Unicode encoding error" in err_str or ("ascii" in err_str.lower() and "codec" in err_str.lower()):
+                st.error("Encoding error — your API key may contain non-standard characters. Log out and re-enter it.")
+            else:
+                st.error(f"Something went wrong: {err_str[:300]}")
+            st.stop()
+        answer       = result.get("answer", "")
+        tool_calls   = result.get("tool_calls", [])
+        plotly_jsons = result.get("plotly_jsons") or ([result["plotly_json"]] if result.get("plotly_json") else [])
+        route        = result.get("route", "")
+        usage        = result.get("usage") or {}
+        route_html = f"<span class='route-tag'>{route}</span>" if route else ""
+        st.markdown(
+            f"<div class='msg-meta'><span class='msg-label'>Answer</span>{route_html}</div>",
+            unsafe_allow_html=True,
+        )
+        st.markdown("<div class='msg-body'>", unsafe_allow_html=True)
+        st.markdown(answer)
+        st.markdown("</div>", unsafe_allow_html=True)
+        for pj in plotly_jsons:
+            render_plot(pj)
+        if usage:
+            render_cost_bar(usage)
+        if tool_calls:
+            render_tool_calls(tool_calls)
+        render_export_buttons(answer, tool_calls, len(st.session_state["turns"]))
+    turn = {
+        "question": question, "answer": answer,
+        "tool_calls": tool_calls, "plotly_jsons": plotly_jsons,
+        "artifacts": extract_artifacts(tool_calls), "route": route,
+        "usage": usage,
+    }
+    st.session_state["turns"].append(turn)
+    st.session_state["history"] = build_backend_history(st.session_state["turns"])
+    st.session_state["chat"].append({"role": "user", "content": question})
+    st.session_state["chat"].append({
+        "role": "assistant", "content": answer,
+        "tool_calls": tool_calls, "plotly_jsons": plotly_jsons,
+        "route": route, "usage": usage,
+    })

requirements.txt ADDED Viewed

	@@ -0,0 +1,10 @@

+anthropic>=0.40.0
+together>=1.3.0
+openai>=1.0.0
+scikit-learn>=1.4.0
+pandas>=2.0.0
+pyarrow>=14.0.0
+streamlit>=1.35.0
+python-dotenv>=1.0.0
+plotly>=5.20.0
+huggingface-hub>=0.22.0