--- editor_options: markdown: wrap: 72 --- # Replication Guide — Grasping 'Gooning' An observational analysis of Reddit gooning communities via BERTopic topic modelling. ------------------------------------------------------------------------ ## Project summary This study empirically analyses publicly available Reddit subreddits dedicated to gooning (prolonged, trancelike masturbation). The analysis pipeline combines: - **Descriptive statistics** — demographics (age, gender, sexuality from user flair), word frequencies, and substance mentions across 30 subreddits - **BERTopic topic modelling** — semantic clustering of \~2.35M documents (posts and comments) to identify the major themes and discourse patterns **Full corpus:** 30 subreddits, \~22M raw records, 2019–2025 ------------------------------------------------------------------------ ## Repository structure ``` Goon/ ├── REPLICATION.md ← this file ├── README.md ← subreddit list ├── Goon.Rproj ← RStudio project (sets working directory) ├── data_prep.qmd ← Step 1: R — ingest CSVs, clean, export Parquet ├── goon_analysis.qmd ← Step 2: R — descriptive analyses ├── goon_topic_analysis.qmd ← Step 4: R — visualise BERTopic outputs ├── data/ │ ├── *.csv ← raw Reddit exports (one file per subreddit per year) │ ├── GOONED_comments.csv ← pre-concatenated GOONED comments (all years) │ ├── GOONED_submissions.csv ← pre-concatenated GOONED submissions (all years) │ ├── comments.parquet ← output of data_prep.qmd (18.2M rows) │ ├── posts.parquet ← output of data_prep.qmd (3.8M rows) │ ├── corpus_clean.parquet ← output of topic_analysis.qmd (modelling corpus) │ └── corpus_deleted.parquet ← deleted/removed rows (tracked separately) ├── topic analysis/ │ ├── topic_analysis.qmd ← Step 3: Python — BERTopic pipeline │ ├── topic_api_labeling.qmd ← Step 3b: Python — optional API-based topic labelling │ ├── build_topic_results_summary.py ← helper: generate an HTML summary from run outputs │ ├── run_topic_analysis.sh ← shell wrapper that sets env vars and calls quarto │ ├── README.md ← detailed pipeline documentation │ ├── MEMORY_MANAGEMENT.md ← strategies for large-corpus runs │ └── runs/ │ └── full_run_v1/ ← complete output of the full corpus run │ ├── bertopic/ ← saved BERTopic models + doc-topic assignments │ ├── topics/ ← keyword tables, summaries, evaluation, API labels │ └── figures/ ← generated charts └── .venv/ ← Python virtual environment (created per instructions below) ``` ------------------------------------------------------------------------ ## System requirements ### Hardware | Component | Minimum | Recommended | |----|----|----| | RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) | | Disk | 50 GB free | 100 GB free | | CPU | Any modern multi-core | 8+ cores for embedding generation | | GPU | Not required | Optional — speeds up embedding by \~10× | > The full corpus run was executed on a Linux VM with 96 GB RAM. A local > Mac with 24 GB will work for the pilot/sample runs but may struggle > with the full corpus embedding step. ### Software | Tool | Version tested | Notes | |----|----|----| | R | ≥ 4.3 | For data_prep.qmd and goon_analysis.qmd | | RStudio / Quarto CLI | ≥ 1.4 | To render .qmd files | | Python | ≥ 3.10 | For topic_analysis.qmd | | Quarto | ≥ 1.4 | Installed with RStudio or standalone | ------------------------------------------------------------------------ ## Step-by-step replication ### Step 0: Clone / obtain the repository The `data/` folder containing raw CSVs is required. These are large files and are not distributed via git — they must be present locally. Open `Goon.Rproj` in RStudio. This sets the working directory to the project root so that `here::here()` paths resolve correctly. ------------------------------------------------------------------------ ### Step 1: R data preparation (`data_prep.qmd`) **Purpose:** Reads all raw CSVs, combines them into unified data frames, applies minimal cleaning, and exports `data/comments.parquet` and `data/posts.parquet`. **R packages required:** ``` r install.packages(c("dplyr", "tidyr", "tibble", "purrr", "data.table", "arrow", "here")) ``` **Run:** Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks in order). Alternatively, from the terminal: ``` bash cd /Users/bkot7579/Desktop/Goon quarto render data_prep.qmd ``` **Expected outputs:** - `data/comments.parquet` — \~18.2M rows, \~528 MB - `data/posts.parquet` — \~3.8M rows, \~186 MB **Time estimate:** 15–45 minutes depending on RAM and I/O speed (the GOONED CSV files alone are \~7 GB). ------------------------------------------------------------------------ ### Step 2: R descriptive analysis (`goon_analysis.qmd`) **Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency analysis, substance mention counts. **R packages required:** ``` r install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2", "stringr", "here", "e1071", "tidytext", "data.table", "arrow")) ``` **Run:** Open `goon_analysis.qmd` in RStudio and click **Render**. **Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must exist (Step 1). **Expected outputs:** - Rendered HTML with plots embedded - In-memory: word count tables, substance mention counts, demographic counts ------------------------------------------------------------------------ ### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`) **Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN topic modelling via BERTopic, reduces topics using c-TF-IDF agglomerative clustering, evaluates models, and exports all topic outputs. #### 3a. Create the Python virtual environment ``` bash cd /Users/bkot7579/Desktop/Goon python3 -m venv .venv source .venv/bin/activate pip install bertopic umap-learn hdbscan sentence-transformers \ scikit-learn pandas pyarrow quarto ``` **Key package versions (tested):** | Package | Version | |-----------------------|---------| | bertopic | 0.16.x | | umap-learn | 0.5.x | | hdbscan | 0.8.x | | sentence-transformers | 2.x | | scikit-learn | 1.x | | pandas | 2.x | | pyarrow | 14.x | #### 3b. Run the pipeline **Pilot run (200k documents — recommended first):** ``` bash cd /Users/bkot7579/Desktop/Goon ./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k ``` Outputs land in `topic analysis/runs/pilot_200k/`. **Full corpus run (\~2.35M cleaned documents after filtering):** ``` bash cd /Users/bkot7579/Desktop/Goon ./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1 ``` > **Warning:** The full run requires \~64 GB RAM for the embedding + > UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies > if RAM is limited. **Pipeline stages (automatically executed in order):** 1. **Data ingestion** — loads `data/posts.parquet` + `data/comments.parquet` 2. **Light cleaning** — deduplication, URL/username/subreddit anonymisation, markdown stripping; deleted/removed rows saved separately to `data/corpus_deleted.parquet` 3. **Embedding** — `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved as shards; skips already-generated shards on resume 4. **UMAP** — pre-computed once, reused across all HDBSCAN configs 5. **BERTopic** — 6 configurations: min_cluster_size ∈ {50, 100, 200} × method ∈ {eom, leaf} 6. **Topic reduction** — c-TF-IDF agglomerative clustering to 100, 50, and 25 topics 7. **Evaluation** — NPMI coherence, topic diversity, outlier rates 8. **Export** — keyword tables, representative docs, summary CSVs **Reproducibility:** Random seed = 42 throughout. A `reproducibility_log.json` is written to the run folder with all settings and package versions. #### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`) Sends reduced topic summaries (keywords + representative texts) to an LLM API to generate human-readable labels. Does NOT send raw corpus text. Set your API key, then render: ``` bash export OPENAI_API_KEY="your-key-here" quarto render "topic analysis/topic_api_labeling.qmd" ``` Outputs are saved to `topic analysis/runs//topics/api/`. ------------------------------------------------------------------------ ### Step 4: R topic visualisations (`goon_topic_analysis.qmd`) **Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and produces exploratory visualisations: topic size bar chart, subreddit × topic heatmap, topic prevalence over time, post vs comment split, representative documents. **R packages required:** ``` r install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2", "stringr", "here", "arrow")) ``` **Run:** ``` bash quarto render goon_topic_analysis.qmd ``` **Prerequisites:** Step 3 must have completed and outputs must exist under `topic analysis/runs/full_run_v1/topics/`. ------------------------------------------------------------------------ ## Execution order summary ``` Step 1 → data_prep.qmd (R) ~30 min Step 2 → goon_analysis.qmd (R) ~10 min Step 3 → topic_analysis.qmd (Python) ~6–48 hours (full corpus) Step 3b → topic_api_labeling.qmd (Python) ~5 min + API cost (optional) Step 4 → goon_topic_analysis.qmd (R) ~2 min ``` Steps 2 and 3 are independent of each other and can run in parallel. ------------------------------------------------------------------------ ## Key modelling decisions | Decision | Choice | Rationale | |-------------------------|--------------------|---------------------------| | Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure | | UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure | | UMAP n_components | 5 | Low enough for HDBSCAN to work well | | HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference | | Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters | | Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) | | Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords | | Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling | ------------------------------------------------------------------------ ## Known issue: r/GOONED missing from BERTopic results r/GOONED is the dominant subreddit in the corpus by a large margin: | | Count | |---------------------|----------------| | r/GOONED posts | 2,765,119 | | r/GOONED comments | 15,493,075 | | **r/GOONED total** | **18,258,194** | | Full cleaned corpus | 22,011,124 | | **r/GOONED share** | **82.9%** | Despite this, r/GOONED is **entirely absent** from `topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the BERTopic pipeline did not have the `GOONED_comments.csv` and `GOONED_submissions.csv` files available (most likely because their combined size of \~7 GB made transfer impractical), and `corpus_clean.parquet` on the VM was generated without them. **Consequence:** All topic modelling results represent 29 subreddits (3.75M records) rather than the full 30-subreddit corpus (22M records). Topic proportions, dominant themes, and subreddit distribution tables are therefore not representative of the full corpus. **To fix:** Ensure the `GOONED_*.csv` files (or the pre-built `comments.parquet` / `posts.parquet`) are available on the compute environment, then re-run: ``` bash ./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2 ``` Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by this issue — they read directly from the local `data/posts.parquet` and `data/comments.parquet` files, which do contain r/GOONED data. ------------------------------------------------------------------------ ## Known limitations 1. **Outlier rate:** 37.3% of documents are assigned to the outlier topic (-1) by HDBSCAN. This is typical for short-text social media corpora. Outliers are excluded from topic analyses but are retained in the corpus parquet files. 2. **Flair ambiguity:** Gender/sexuality classifiers rely on voluntarily set flair strings. Flair adoption is uneven across subreddits and users, introducing selection bias. Abbreviations like `\\bt\\b` may match unintended strings (e.g. US state TX). 3. **Deleted content:** Posts and comments marked `[deleted]` or `[removed]` are excluded from modelling but counted separately. These may disproportionately represent controversial content. 4. **Temporal coverage:** Coverage varies by subreddit. Some communities only appear in later years (2024–2025); others span the full 2019–2025 window. 5. **Platform-specific norms:** Moderation rules, flair conventions, and posting styles differ across subreddits, which may shape topics in ways that are not generalisable. 6. **Unobserved participants:** Lurkers, banned users, and deleted accounts are not captured. ------------------------------------------------------------------------ ## Output files (full_run_v1) | File | Description | |-----------------------|-------------------------------------------------| | `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts | | `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs | | `topics/evaluation_table.csv` | Same as above, alternate format | | `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations | | `topics/topic_by_subreddit.csv` | Topic × subreddit document counts | | `topics/topic_by_month.csv` | Topic × month document counts | | `topics/topic_by_doctype.csv` | Topic × doc type (post/comment) | | `topics/preprocessing_decisions.json` | Logged cleaning decisions | | `topics/api/` | API-generated labels, summaries, category annotations | | `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs | ------------------------------------------------------------------------ ## Contacts and citation This analysis was conducted as part of a preliminary empirical study of online gooning communities. If replicating, please cite the original study and note the random seed, embedding model, and reduction target used.