File size: 15,320 Bytes
da605e9 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 | ---
editor_options:
markdown:
wrap: 72
---
# Replication Guide β Grasping 'Gooning'
An observational analysis of Reddit gooning communities via BERTopic
topic modelling.
------------------------------------------------------------------------
## Project summary
This study empirically analyses publicly available Reddit subreddits
dedicated to gooning (prolonged, trancelike masturbation). The analysis
pipeline combines:
- **Descriptive statistics** β demographics (age, gender, sexuality
from user flair), word frequencies, and substance mentions across 30
subreddits
- **BERTopic topic modelling** β semantic clustering of \~2.35M
documents (posts and comments) to identify the major themes and
discourse patterns
**Full corpus:** 30 subreddits, \~22M raw records, 2019β2025
------------------------------------------------------------------------
## Repository structure
```
Goon/
βββ REPLICATION.md β this file
βββ README.md β subreddit list
βββ Goon.Rproj β RStudio project (sets working directory)
βββ data_prep.qmd β Step 1: R β ingest CSVs, clean, export Parquet
βββ goon_analysis.qmd β Step 2: R β descriptive analyses
βββ goon_topic_analysis.qmd β Step 4: R β visualise BERTopic outputs
βββ data/
β βββ *.csv β raw Reddit exports (one file per subreddit per year)
β βββ GOONED_comments.csv β pre-concatenated GOONED comments (all years)
β βββ GOONED_submissions.csv β pre-concatenated GOONED submissions (all years)
β βββ comments.parquet β output of data_prep.qmd (18.2M rows)
β βββ posts.parquet β output of data_prep.qmd (3.8M rows)
β βββ corpus_clean.parquet β output of topic_analysis.qmd (modelling corpus)
β βββ corpus_deleted.parquet β deleted/removed rows (tracked separately)
βββ topic analysis/
β βββ topic_analysis.qmd β Step 3: Python β BERTopic pipeline
β βββ topic_api_labeling.qmd β Step 3b: Python β optional API-based topic labelling
β βββ build_topic_results_summary.py β helper: generate an HTML summary from run outputs
β βββ run_topic_analysis.sh β shell wrapper that sets env vars and calls quarto
β βββ README.md β detailed pipeline documentation
β βββ MEMORY_MANAGEMENT.md β strategies for large-corpus runs
β βββ runs/
β βββ full_run_v1/ β complete output of the full corpus run
β βββ bertopic/ β saved BERTopic models + doc-topic assignments
β βββ topics/ β keyword tables, summaries, evaluation, API labels
β βββ figures/ β generated charts
βββ .venv/ β Python virtual environment (created per instructions below)
```
------------------------------------------------------------------------
## System requirements
### Hardware
| Component | Minimum | Recommended |
|----|----|----|
| RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) |
| Disk | 50 GB free | 100 GB free |
| CPU | Any modern multi-core | 8+ cores for embedding generation |
| GPU | Not required | Optional β speeds up embedding by \~10Γ |
> The full corpus run was executed on a Linux VM with 96 GB RAM. A local
> Mac with 24 GB will work for the pilot/sample runs but may struggle
> with the full corpus embedding step.
### Software
| Tool | Version tested | Notes |
|----|----|----|
| R | β₯ 4.3 | For data_prep.qmd and goon_analysis.qmd |
| RStudio / Quarto CLI | β₯ 1.4 | To render .qmd files |
| Python | β₯ 3.10 | For topic_analysis.qmd |
| Quarto | β₯ 1.4 | Installed with RStudio or standalone |
------------------------------------------------------------------------
## Step-by-step replication
### Step 0: Clone / obtain the repository
The `data/` folder containing raw CSVs is required. These are large
files and are not distributed via git β they must be present locally.
Open `Goon.Rproj` in RStudio. This sets the working directory to the
project root so that `here::here()` paths resolve correctly.
------------------------------------------------------------------------
### Step 1: R data preparation (`data_prep.qmd`)
**Purpose:** Reads all raw CSVs, combines them into unified data frames,
applies minimal cleaning, and exports `data/comments.parquet` and
`data/posts.parquet`.
**R packages required:**
``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
"data.table", "arrow", "here"))
```
**Run:**
Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks
in order).
Alternatively, from the terminal:
``` bash
cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd
```
**Expected outputs:** - `data/comments.parquet` β \~18.2M rows, \~528
MB - `data/posts.parquet` β \~3.8M rows, \~186 MB
**Time estimate:** 15β45 minutes depending on RAM and I/O speed (the
GOONED CSV files alone are \~7 GB).
------------------------------------------------------------------------
### Step 2: R descriptive analysis (`goon_analysis.qmd`)
**Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency
analysis, substance mention counts.
**R packages required:**
``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
"stringr", "here", "e1071", "tidytext",
"data.table", "arrow"))
```
**Run:**
Open `goon_analysis.qmd` in RStudio and click **Render**.
**Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must
exist (Step 1).
**Expected outputs:** - Rendered HTML with plots embedded - In-memory:
word count tables, substance mention counts, demographic counts
------------------------------------------------------------------------
### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)
**Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN
topic modelling via BERTopic, reduces topics using c-TF-IDF
agglomerative clustering, evaluates models, and exports all topic
outputs.
#### 3a. Create the Python virtual environment
``` bash
cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate
pip install bertopic umap-learn hdbscan sentence-transformers \
scikit-learn pandas pyarrow quarto
```
**Key package versions (tested):**
| Package | Version |
|-----------------------|---------|
| bertopic | 0.16.x |
| umap-learn | 0.5.x |
| hdbscan | 0.8.x |
| sentence-transformers | 2.x |
| scikit-learn | 1.x |
| pandas | 2.x |
| pyarrow | 14.x |
#### 3b. Run the pipeline
**Pilot run (200k documents β recommended first):**
``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
```
Outputs land in `topic analysis/runs/pilot_200k/`.
**Full corpus run (\~2.35M cleaned documents after filtering):**
``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
```
> **Warning:** The full run requires \~64 GB RAM for the embedding +
> UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies
> if RAM is limited.
**Pipeline stages (automatically executed in order):**
1. **Data ingestion** β loads `data/posts.parquet` +
`data/comments.parquet`
2. **Light cleaning** β deduplication, URL/username/subreddit
anonymisation, markdown stripping; deleted/removed rows saved
separately to `data/corpus_deleted.parquet`
3. **Embedding** β `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved
as shards; skips already-generated shards on resume
4. **UMAP** β pre-computed once, reused across all HDBSCAN configs
5. **BERTopic** β 6 configurations: min_cluster_size β {50, 100, 200} Γ
method β {eom, leaf}
6. **Topic reduction** β c-TF-IDF agglomerative clustering to 100, 50,
and 25 topics
7. **Evaluation** β NPMI coherence, topic diversity, outlier rates
8. **Export** β keyword tables, representative docs, summary CSVs
**Reproducibility:** Random seed = 42 throughout. A
`reproducibility_log.json` is written to the run folder with all
settings and package versions.
#### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)
Sends reduced topic summaries (keywords + representative texts) to an
LLM API to generate human-readable labels. Does NOT send raw corpus
text.
Set your API key, then render:
``` bash
export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"
```
Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`.
------------------------------------------------------------------------
### Step 4: R topic visualisations (`goon_topic_analysis.qmd`)
**Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and
produces exploratory visualisations: topic size bar chart, subreddit Γ
topic heatmap, topic prevalence over time, post vs comment split,
representative documents.
**R packages required:**
``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
"ggplot2", "stringr", "here", "arrow"))
```
**Run:**
``` bash
quarto render goon_topic_analysis.qmd
```
**Prerequisites:** Step 3 must have completed and outputs must exist
under `topic analysis/runs/full_run_v1/topics/`.
------------------------------------------------------------------------
## Execution order summary
```
Step 1 β data_prep.qmd (R) ~30 min
Step 2 β goon_analysis.qmd (R) ~10 min
Step 3 β topic_analysis.qmd (Python) ~6β48 hours (full corpus)
Step 3b β topic_api_labeling.qmd (Python) ~5 min + API cost (optional)
Step 4 β goon_topic_analysis.qmd (R) ~2 min
```
Steps 2 and 3 are independent of each other and can run in parallel.
------------------------------------------------------------------------
## Key modelling decisions
| Decision | Choice | Rationale |
|-------------------------|--------------------|---------------------------|
| Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure |
| UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure |
| UMAP n_components | 5 | Low enough for HDBSCAN to work well |
| HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference |
| Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters |
| Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) |
| Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords |
| Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling |
------------------------------------------------------------------------
## Known issue: r/GOONED missing from BERTopic results
r/GOONED is the dominant subreddit in the corpus by a large margin:
| | Count |
|---------------------|----------------|
| r/GOONED posts | 2,765,119 |
| r/GOONED comments | 15,493,075 |
| **r/GOONED total** | **18,258,194** |
| Full cleaned corpus | 22,011,124 |
| **r/GOONED share** | **82.9%** |
Despite this, r/GOONED is **entirely absent** from
`topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the
BERTopic pipeline did not have the `GOONED_comments.csv` and
`GOONED_submissions.csv` files available (most likely because their
combined size of \~7 GB made transfer impractical), and
`corpus_clean.parquet` on the VM was generated without them.
**Consequence:** All topic modelling results represent 29 subreddits
(3.75M records) rather than the full 30-subreddit corpus (22M records).
Topic proportions, dominant themes, and subreddit distribution tables
are therefore not representative of the full corpus.
**To fix:** Ensure the `GOONED_*.csv` files (or the pre-built
`comments.parquet` / `posts.parquet`) are available on the compute
environment, then re-run:
``` bash
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
```
Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by
this issue β they read directly from the local `data/posts.parquet` and
`data/comments.parquet` files, which do contain r/GOONED data.
------------------------------------------------------------------------
## Known limitations
1. **Outlier rate:** 37.3% of documents are assigned to the outlier
topic (-1) by HDBSCAN. This is typical for short-text social media
corpora. Outliers are excluded from topic analyses but are retained
in the corpus parquet files.
2. **Flair ambiguity:** Gender/sexuality classifiers rely on
voluntarily set flair strings. Flair adoption is uneven across
subreddits and users, introducing selection bias. Abbreviations like
`\\bt\\b` may match unintended strings (e.g. US state TX).
3. **Deleted content:** Posts and comments marked `[deleted]` or
`[removed]` are excluded from modelling but counted separately.
These may disproportionately represent controversial content.
4. **Temporal coverage:** Coverage varies by subreddit. Some
communities only appear in later years (2024β2025); others span the
full 2019β2025 window.
5. **Platform-specific norms:** Moderation rules, flair conventions,
and posting styles differ across subreddits, which may shape topics
in ways that are not generalisable.
6. **Unobserved participants:** Lurkers, banned users, and deleted
accounts are not captured.
------------------------------------------------------------------------
## Output files (full_run_v1)
| File | Description |
|-----------------------|-------------------------------------------------|
| `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts |
| `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs |
| `topics/evaluation_table.csv` | Same as above, alternate format |
| `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations |
| `topics/topic_by_subreddit.csv` | Topic Γ subreddit document counts |
| `topics/topic_by_month.csv` | Topic Γ month document counts |
| `topics/topic_by_doctype.csv` | Topic Γ doc type (post/comment) |
| `topics/preprocessing_decisions.json` | Logged cleaning decisions |
| `topics/api/` | API-generated labels, summaries, category annotations |
| `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs |
------------------------------------------------------------------------
## Contacts and citation
This analysis was conducted as part of a preliminary empirical study of
online gooning communities. If replicating, please cite the original
study and note the random seed, embedding model, and reduction target
used.
|