File size: 15,320 Bytes
da605e9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
---
editor_options: 
  markdown: 
    wrap: 72
---

# Replication Guide β€” Grasping 'Gooning'

An observational analysis of Reddit gooning communities via BERTopic
topic modelling.

------------------------------------------------------------------------

## Project summary

This study empirically analyses publicly available Reddit subreddits
dedicated to gooning (prolonged, trancelike masturbation). The analysis
pipeline combines:

-   **Descriptive statistics** β€” demographics (age, gender, sexuality
    from user flair), word frequencies, and substance mentions across 30
    subreddits
-   **BERTopic topic modelling** β€” semantic clustering of \~2.35M
    documents (posts and comments) to identify the major themes and
    discourse patterns

**Full corpus:** 30 subreddits, \~22M raw records, 2019–2025

------------------------------------------------------------------------

## Repository structure

```         
Goon/
β”œβ”€β”€ REPLICATION.md                 ← this file
β”œβ”€β”€ README.md                      ← subreddit list
β”œβ”€β”€ Goon.Rproj                     ← RStudio project (sets working directory)
β”œβ”€β”€ data_prep.qmd                  ← Step 1: R β€” ingest CSVs, clean, export Parquet
β”œβ”€β”€ goon_analysis.qmd              ← Step 2: R β€” descriptive analyses
β”œβ”€β”€ goon_topic_analysis.qmd        ← Step 4: R β€” visualise BERTopic outputs
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ *.csv                      ← raw Reddit exports (one file per subreddit per year)
β”‚   β”œβ”€β”€ GOONED_comments.csv        ← pre-concatenated GOONED comments (all years)
β”‚   β”œβ”€β”€ GOONED_submissions.csv     ← pre-concatenated GOONED submissions (all years)
β”‚   β”œβ”€β”€ comments.parquet           ← output of data_prep.qmd (18.2M rows)
β”‚   β”œβ”€β”€ posts.parquet              ← output of data_prep.qmd (3.8M rows)
β”‚   β”œβ”€β”€ corpus_clean.parquet       ← output of topic_analysis.qmd (modelling corpus)
β”‚   └── corpus_deleted.parquet     ← deleted/removed rows (tracked separately)
β”œβ”€β”€ topic analysis/
β”‚   β”œβ”€β”€ topic_analysis.qmd         ← Step 3: Python β€” BERTopic pipeline
β”‚   β”œβ”€β”€ topic_api_labeling.qmd     ← Step 3b: Python β€” optional API-based topic labelling
β”‚   β”œβ”€β”€ build_topic_results_summary.py  ← helper: generate an HTML summary from run outputs
β”‚   β”œβ”€β”€ run_topic_analysis.sh      ← shell wrapper that sets env vars and calls quarto
β”‚   β”œβ”€β”€ README.md                  ← detailed pipeline documentation
β”‚   β”œβ”€β”€ MEMORY_MANAGEMENT.md       ← strategies for large-corpus runs
β”‚   └── runs/
β”‚       └── full_run_v1/           ← complete output of the full corpus run
β”‚           β”œβ”€β”€ bertopic/          ← saved BERTopic models + doc-topic assignments
β”‚           β”œβ”€β”€ topics/            ← keyword tables, summaries, evaluation, API labels
β”‚           └── figures/           ← generated charts
└── .venv/                         ← Python virtual environment (created per instructions below)
```

------------------------------------------------------------------------

## System requirements

### Hardware

| Component | Minimum | Recommended |
|----|----|----|
| RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) |
| Disk | 50 GB free | 100 GB free |
| CPU | Any modern multi-core | 8+ cores for embedding generation |
| GPU | Not required | Optional β€” speeds up embedding by \~10Γ— |

> The full corpus run was executed on a Linux VM with 96 GB RAM. A local
> Mac with 24 GB will work for the pilot/sample runs but may struggle
> with the full corpus embedding step.

### Software

| Tool | Version tested | Notes |
|----|----|----|
| R | β‰₯ 4.3 | For data_prep.qmd and goon_analysis.qmd |
| RStudio / Quarto CLI | β‰₯ 1.4 | To render .qmd files |
| Python | β‰₯ 3.10 | For topic_analysis.qmd |
| Quarto | β‰₯ 1.4 | Installed with RStudio or standalone |

------------------------------------------------------------------------

## Step-by-step replication

### Step 0: Clone / obtain the repository

The `data/` folder containing raw CSVs is required. These are large
files and are not distributed via git β€” they must be present locally.

Open `Goon.Rproj` in RStudio. This sets the working directory to the
project root so that `here::here()` paths resolve correctly.

------------------------------------------------------------------------

### Step 1: R data preparation (`data_prep.qmd`)

**Purpose:** Reads all raw CSVs, combines them into unified data frames,
applies minimal cleaning, and exports `data/comments.parquet` and
`data/posts.parquet`.

**R packages required:**

``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "data.table", "arrow", "here"))
```

**Run:**

Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks
in order).

Alternatively, from the terminal:

``` bash
cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd
```

**Expected outputs:** - `data/comments.parquet` β€” \~18.2M rows, \~528
MB - `data/posts.parquet` β€” \~3.8M rows, \~186 MB

**Time estimate:** 15–45 minutes depending on RAM and I/O speed (the
GOONED CSV files alone are \~7 GB).

------------------------------------------------------------------------

### Step 2: R descriptive analysis (`goon_analysis.qmd`)

**Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency
analysis, substance mention counts.

**R packages required:**

``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
                   "stringr", "here", "e1071", "tidytext",
                   "data.table", "arrow"))
```

**Run:**

Open `goon_analysis.qmd` in RStudio and click **Render**.

**Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must
exist (Step 1).

**Expected outputs:** - Rendered HTML with plots embedded - In-memory:
word count tables, substance mention counts, demographic counts

------------------------------------------------------------------------

### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)

**Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN
topic modelling via BERTopic, reduces topics using c-TF-IDF
agglomerative clustering, evaluates models, and exports all topic
outputs.

#### 3a. Create the Python virtual environment

``` bash
cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate

pip install bertopic umap-learn hdbscan sentence-transformers \
            scikit-learn pandas pyarrow quarto
```

**Key package versions (tested):**

| Package               | Version |
|-----------------------|---------|
| bertopic              | 0.16.x  |
| umap-learn            | 0.5.x   |
| hdbscan               | 0.8.x   |
| sentence-transformers | 2.x     |
| scikit-learn          | 1.x     |
| pandas                | 2.x     |
| pyarrow               | 14.x    |

#### 3b. Run the pipeline

**Pilot run (200k documents β€” recommended first):**

``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
```

Outputs land in `topic analysis/runs/pilot_200k/`.

**Full corpus run (\~2.35M cleaned documents after filtering):**

``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
```

> **Warning:** The full run requires \~64 GB RAM for the embedding +
> UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies
> if RAM is limited.

**Pipeline stages (automatically executed in order):**

1.  **Data ingestion** β€” loads `data/posts.parquet` +
    `data/comments.parquet`
2.  **Light cleaning** β€” deduplication, URL/username/subreddit
    anonymisation, markdown stripping; deleted/removed rows saved
    separately to `data/corpus_deleted.parquet`
3.  **Embedding** β€” `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved
    as shards; skips already-generated shards on resume
4.  **UMAP** β€” pre-computed once, reused across all HDBSCAN configs
5.  **BERTopic** β€” 6 configurations: min_cluster_size ∈ {50, 100, 200} Γ—
    method ∈ {eom, leaf}
6.  **Topic reduction** β€” c-TF-IDF agglomerative clustering to 100, 50,
    and 25 topics
7.  **Evaluation** β€” NPMI coherence, topic diversity, outlier rates
8.  **Export** β€” keyword tables, representative docs, summary CSVs

**Reproducibility:** Random seed = 42 throughout. A
`reproducibility_log.json` is written to the run folder with all
settings and package versions.

#### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)

Sends reduced topic summaries (keywords + representative texts) to an
LLM API to generate human-readable labels. Does NOT send raw corpus
text.

Set your API key, then render:

``` bash
export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"
```

Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`.

------------------------------------------------------------------------

### Step 4: R topic visualisations (`goon_topic_analysis.qmd`)

**Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and
produces exploratory visualisations: topic size bar chart, subreddit Γ—
topic heatmap, topic prevalence over time, post vs comment split,
representative documents.

**R packages required:**

``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "ggplot2", "stringr", "here", "arrow"))
```

**Run:**

``` bash
quarto render goon_topic_analysis.qmd
```

**Prerequisites:** Step 3 must have completed and outputs must exist
under `topic analysis/runs/full_run_v1/topics/`.

------------------------------------------------------------------------

## Execution order summary

```         
Step 1  β†’  data_prep.qmd            (R)       ~30 min
Step 2  β†’  goon_analysis.qmd        (R)       ~10 min
Step 3  β†’  topic_analysis.qmd       (Python)  ~6–48 hours (full corpus)
Step 3b β†’  topic_api_labeling.qmd   (Python)  ~5 min + API cost (optional)
Step 4  β†’  goon_topic_analysis.qmd  (R)       ~2 min
```

Steps 2 and 3 are independent of each other and can run in parallel.

------------------------------------------------------------------------

## Key modelling decisions

| Decision | Choice | Rationale |
|-------------------------|--------------------|---------------------------|
| Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure |
| UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure |
| UMAP n_components | 5 | Low enough for HDBSCAN to work well |
| HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference |
| Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters |
| Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) |
| Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords |
| Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling |

------------------------------------------------------------------------

## Known issue: r/GOONED missing from BERTopic results

r/GOONED is the dominant subreddit in the corpus by a large margin:

|                     | Count          |
|---------------------|----------------|
| r/GOONED posts      | 2,765,119      |
| r/GOONED comments   | 15,493,075     |
| **r/GOONED total**  | **18,258,194** |
| Full cleaned corpus | 22,011,124     |
| **r/GOONED share**  | **82.9%**      |

Despite this, r/GOONED is **entirely absent** from
`topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the
BERTopic pipeline did not have the `GOONED_comments.csv` and
`GOONED_submissions.csv` files available (most likely because their
combined size of \~7 GB made transfer impractical), and
`corpus_clean.parquet` on the VM was generated without them.

**Consequence:** All topic modelling results represent 29 subreddits
(3.75M records) rather than the full 30-subreddit corpus (22M records).
Topic proportions, dominant themes, and subreddit distribution tables
are therefore not representative of the full corpus.

**To fix:** Ensure the `GOONED_*.csv` files (or the pre-built
`comments.parquet` / `posts.parquet`) are available on the compute
environment, then re-run:

``` bash
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
```

Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by
this issue β€” they read directly from the local `data/posts.parquet` and
`data/comments.parquet` files, which do contain r/GOONED data.

------------------------------------------------------------------------

## Known limitations

1.  **Outlier rate:** 37.3% of documents are assigned to the outlier
    topic (-1) by HDBSCAN. This is typical for short-text social media
    corpora. Outliers are excluded from topic analyses but are retained
    in the corpus parquet files.

2.  **Flair ambiguity:** Gender/sexuality classifiers rely on
    voluntarily set flair strings. Flair adoption is uneven across
    subreddits and users, introducing selection bias. Abbreviations like
    `\\bt\\b` may match unintended strings (e.g. US state TX).

3.  **Deleted content:** Posts and comments marked `[deleted]` or
    `[removed]` are excluded from modelling but counted separately.
    These may disproportionately represent controversial content.

4.  **Temporal coverage:** Coverage varies by subreddit. Some
    communities only appear in later years (2024–2025); others span the
    full 2019–2025 window.

5.  **Platform-specific norms:** Moderation rules, flair conventions,
    and posting styles differ across subreddits, which may shape topics
    in ways that are not generalisable.

6.  **Unobserved participants:** Lurkers, banned users, and deleted
    accounts are not captured.

------------------------------------------------------------------------

## Output files (full_run_v1)

| File | Description |
|-----------------------|-------------------------------------------------|
| `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts |
| `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs |
| `topics/evaluation_table.csv` | Same as above, alternate format |
| `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations |
| `topics/topic_by_subreddit.csv` | Topic Γ— subreddit document counts |
| `topics/topic_by_month.csv` | Topic Γ— month document counts |
| `topics/topic_by_doctype.csv` | Topic Γ— doc type (post/comment) |
| `topics/preprocessing_decisions.json` | Logged cleaning decisions |
| `topics/api/` | API-generated labels, summaries, category annotations |
| `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs |

------------------------------------------------------------------------

## Contacts and citation

This analysis was conducted as part of a preliminary empirical study of
online gooning communities. If replicating, please cite the original
study and note the random seed, embedding model, and reduction target
used.