goon / REPLICATION.md
Binx
Initial commit: analysis app, deployment config, UI improvements
da605e9
---
editor_options:
markdown:
wrap: 72
---
# Replication Guide β€” Grasping 'Gooning'
An observational analysis of Reddit gooning communities via BERTopic
topic modelling.
------------------------------------------------------------------------
## Project summary
This study empirically analyses publicly available Reddit subreddits
dedicated to gooning (prolonged, trancelike masturbation). The analysis
pipeline combines:
- **Descriptive statistics** β€” demographics (age, gender, sexuality
from user flair), word frequencies, and substance mentions across 30
subreddits
- **BERTopic topic modelling** β€” semantic clustering of \~2.35M
documents (posts and comments) to identify the major themes and
discourse patterns
**Full corpus:** 30 subreddits, \~22M raw records, 2019–2025
------------------------------------------------------------------------
## Repository structure
```
Goon/
β”œβ”€β”€ REPLICATION.md ← this file
β”œβ”€β”€ README.md ← subreddit list
β”œβ”€β”€ Goon.Rproj ← RStudio project (sets working directory)
β”œβ”€β”€ data_prep.qmd ← Step 1: R β€” ingest CSVs, clean, export Parquet
β”œβ”€β”€ goon_analysis.qmd ← Step 2: R β€” descriptive analyses
β”œβ”€β”€ goon_topic_analysis.qmd ← Step 4: R β€” visualise BERTopic outputs
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ *.csv ← raw Reddit exports (one file per subreddit per year)
β”‚ β”œβ”€β”€ GOONED_comments.csv ← pre-concatenated GOONED comments (all years)
β”‚ β”œβ”€β”€ GOONED_submissions.csv ← pre-concatenated GOONED submissions (all years)
β”‚ β”œβ”€β”€ comments.parquet ← output of data_prep.qmd (18.2M rows)
β”‚ β”œβ”€β”€ posts.parquet ← output of data_prep.qmd (3.8M rows)
β”‚ β”œβ”€β”€ corpus_clean.parquet ← output of topic_analysis.qmd (modelling corpus)
β”‚ └── corpus_deleted.parquet ← deleted/removed rows (tracked separately)
β”œβ”€β”€ topic analysis/
β”‚ β”œβ”€β”€ topic_analysis.qmd ← Step 3: Python β€” BERTopic pipeline
β”‚ β”œβ”€β”€ topic_api_labeling.qmd ← Step 3b: Python β€” optional API-based topic labelling
β”‚ β”œβ”€β”€ build_topic_results_summary.py ← helper: generate an HTML summary from run outputs
β”‚ β”œβ”€β”€ run_topic_analysis.sh ← shell wrapper that sets env vars and calls quarto
β”‚ β”œβ”€β”€ README.md ← detailed pipeline documentation
β”‚ β”œβ”€β”€ MEMORY_MANAGEMENT.md ← strategies for large-corpus runs
β”‚ └── runs/
β”‚ └── full_run_v1/ ← complete output of the full corpus run
β”‚ β”œβ”€β”€ bertopic/ ← saved BERTopic models + doc-topic assignments
β”‚ β”œβ”€β”€ topics/ ← keyword tables, summaries, evaluation, API labels
β”‚ └── figures/ ← generated charts
└── .venv/ ← Python virtual environment (created per instructions below)
```
------------------------------------------------------------------------
## System requirements
### Hardware
| Component | Minimum | Recommended |
|----|----|----|
| RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) |
| Disk | 50 GB free | 100 GB free |
| CPU | Any modern multi-core | 8+ cores for embedding generation |
| GPU | Not required | Optional β€” speeds up embedding by \~10Γ— |
> The full corpus run was executed on a Linux VM with 96 GB RAM. A local
> Mac with 24 GB will work for the pilot/sample runs but may struggle
> with the full corpus embedding step.
### Software
| Tool | Version tested | Notes |
|----|----|----|
| R | β‰₯ 4.3 | For data_prep.qmd and goon_analysis.qmd |
| RStudio / Quarto CLI | β‰₯ 1.4 | To render .qmd files |
| Python | β‰₯ 3.10 | For topic_analysis.qmd |
| Quarto | β‰₯ 1.4 | Installed with RStudio or standalone |
------------------------------------------------------------------------
## Step-by-step replication
### Step 0: Clone / obtain the repository
The `data/` folder containing raw CSVs is required. These are large
files and are not distributed via git β€” they must be present locally.
Open `Goon.Rproj` in RStudio. This sets the working directory to the
project root so that `here::here()` paths resolve correctly.
------------------------------------------------------------------------
### Step 1: R data preparation (`data_prep.qmd`)
**Purpose:** Reads all raw CSVs, combines them into unified data frames,
applies minimal cleaning, and exports `data/comments.parquet` and
`data/posts.parquet`.
**R packages required:**
``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
"data.table", "arrow", "here"))
```
**Run:**
Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks
in order).
Alternatively, from the terminal:
``` bash
cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd
```
**Expected outputs:** - `data/comments.parquet` β€” \~18.2M rows, \~528
MB - `data/posts.parquet` β€” \~3.8M rows, \~186 MB
**Time estimate:** 15–45 minutes depending on RAM and I/O speed (the
GOONED CSV files alone are \~7 GB).
------------------------------------------------------------------------
### Step 2: R descriptive analysis (`goon_analysis.qmd`)
**Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency
analysis, substance mention counts.
**R packages required:**
``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
"stringr", "here", "e1071", "tidytext",
"data.table", "arrow"))
```
**Run:**
Open `goon_analysis.qmd` in RStudio and click **Render**.
**Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must
exist (Step 1).
**Expected outputs:** - Rendered HTML with plots embedded - In-memory:
word count tables, substance mention counts, demographic counts
------------------------------------------------------------------------
### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)
**Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN
topic modelling via BERTopic, reduces topics using c-TF-IDF
agglomerative clustering, evaluates models, and exports all topic
outputs.
#### 3a. Create the Python virtual environment
``` bash
cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate
pip install bertopic umap-learn hdbscan sentence-transformers \
scikit-learn pandas pyarrow quarto
```
**Key package versions (tested):**
| Package | Version |
|-----------------------|---------|
| bertopic | 0.16.x |
| umap-learn | 0.5.x |
| hdbscan | 0.8.x |
| sentence-transformers | 2.x |
| scikit-learn | 1.x |
| pandas | 2.x |
| pyarrow | 14.x |
#### 3b. Run the pipeline
**Pilot run (200k documents β€” recommended first):**
``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
```
Outputs land in `topic analysis/runs/pilot_200k/`.
**Full corpus run (\~2.35M cleaned documents after filtering):**
``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
```
> **Warning:** The full run requires \~64 GB RAM for the embedding +
> UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies
> if RAM is limited.
**Pipeline stages (automatically executed in order):**
1. **Data ingestion** β€” loads `data/posts.parquet` +
`data/comments.parquet`
2. **Light cleaning** β€” deduplication, URL/username/subreddit
anonymisation, markdown stripping; deleted/removed rows saved
separately to `data/corpus_deleted.parquet`
3. **Embedding** β€” `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved
as shards; skips already-generated shards on resume
4. **UMAP** β€” pre-computed once, reused across all HDBSCAN configs
5. **BERTopic** β€” 6 configurations: min_cluster_size ∈ {50, 100, 200} Γ—
method ∈ {eom, leaf}
6. **Topic reduction** β€” c-TF-IDF agglomerative clustering to 100, 50,
and 25 topics
7. **Evaluation** β€” NPMI coherence, topic diversity, outlier rates
8. **Export** β€” keyword tables, representative docs, summary CSVs
**Reproducibility:** Random seed = 42 throughout. A
`reproducibility_log.json` is written to the run folder with all
settings and package versions.
#### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)
Sends reduced topic summaries (keywords + representative texts) to an
LLM API to generate human-readable labels. Does NOT send raw corpus
text.
Set your API key, then render:
``` bash
export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"
```
Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`.
------------------------------------------------------------------------
### Step 4: R topic visualisations (`goon_topic_analysis.qmd`)
**Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and
produces exploratory visualisations: topic size bar chart, subreddit Γ—
topic heatmap, topic prevalence over time, post vs comment split,
representative documents.
**R packages required:**
``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
"ggplot2", "stringr", "here", "arrow"))
```
**Run:**
``` bash
quarto render goon_topic_analysis.qmd
```
**Prerequisites:** Step 3 must have completed and outputs must exist
under `topic analysis/runs/full_run_v1/topics/`.
------------------------------------------------------------------------
## Execution order summary
```
Step 1 β†’ data_prep.qmd (R) ~30 min
Step 2 β†’ goon_analysis.qmd (R) ~10 min
Step 3 β†’ topic_analysis.qmd (Python) ~6–48 hours (full corpus)
Step 3b β†’ topic_api_labeling.qmd (Python) ~5 min + API cost (optional)
Step 4 β†’ goon_topic_analysis.qmd (R) ~2 min
```
Steps 2 and 3 are independent of each other and can run in parallel.
------------------------------------------------------------------------
## Key modelling decisions
| Decision | Choice | Rationale |
|-------------------------|--------------------|---------------------------|
| Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure |
| UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure |
| UMAP n_components | 5 | Low enough for HDBSCAN to work well |
| HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference |
| Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters |
| Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) |
| Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords |
| Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling |
------------------------------------------------------------------------
## Known issue: r/GOONED missing from BERTopic results
r/GOONED is the dominant subreddit in the corpus by a large margin:
| | Count |
|---------------------|----------------|
| r/GOONED posts | 2,765,119 |
| r/GOONED comments | 15,493,075 |
| **r/GOONED total** | **18,258,194** |
| Full cleaned corpus | 22,011,124 |
| **r/GOONED share** | **82.9%** |
Despite this, r/GOONED is **entirely absent** from
`topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the
BERTopic pipeline did not have the `GOONED_comments.csv` and
`GOONED_submissions.csv` files available (most likely because their
combined size of \~7 GB made transfer impractical), and
`corpus_clean.parquet` on the VM was generated without them.
**Consequence:** All topic modelling results represent 29 subreddits
(3.75M records) rather than the full 30-subreddit corpus (22M records).
Topic proportions, dominant themes, and subreddit distribution tables
are therefore not representative of the full corpus.
**To fix:** Ensure the `GOONED_*.csv` files (or the pre-built
`comments.parquet` / `posts.parquet`) are available on the compute
environment, then re-run:
``` bash
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
```
Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by
this issue β€” they read directly from the local `data/posts.parquet` and
`data/comments.parquet` files, which do contain r/GOONED data.
------------------------------------------------------------------------
## Known limitations
1. **Outlier rate:** 37.3% of documents are assigned to the outlier
topic (-1) by HDBSCAN. This is typical for short-text social media
corpora. Outliers are excluded from topic analyses but are retained
in the corpus parquet files.
2. **Flair ambiguity:** Gender/sexuality classifiers rely on
voluntarily set flair strings. Flair adoption is uneven across
subreddits and users, introducing selection bias. Abbreviations like
`\\bt\\b` may match unintended strings (e.g. US state TX).
3. **Deleted content:** Posts and comments marked `[deleted]` or
`[removed]` are excluded from modelling but counted separately.
These may disproportionately represent controversial content.
4. **Temporal coverage:** Coverage varies by subreddit. Some
communities only appear in later years (2024–2025); others span the
full 2019–2025 window.
5. **Platform-specific norms:** Moderation rules, flair conventions,
and posting styles differ across subreddits, which may shape topics
in ways that are not generalisable.
6. **Unobserved participants:** Lurkers, banned users, and deleted
accounts are not captured.
------------------------------------------------------------------------
## Output files (full_run_v1)
| File | Description |
|-----------------------|-------------------------------------------------|
| `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts |
| `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs |
| `topics/evaluation_table.csv` | Same as above, alternate format |
| `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations |
| `topics/topic_by_subreddit.csv` | Topic Γ— subreddit document counts |
| `topics/topic_by_month.csv` | Topic Γ— month document counts |
| `topics/topic_by_doctype.csv` | Topic Γ— doc type (post/comment) |
| `topics/preprocessing_decisions.json` | Logged cleaning decisions |
| `topics/api/` | API-generated labels, summaries, category annotations |
| `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs |
------------------------------------------------------------------------
## Contacts and citation
This analysis was conducted as part of a preliminary empirical study of
online gooning communities. If replicating, please cite the original
study and note the random seed, embedding model, and reduction target
used.