Spaces:

Binxk
/

goon

Sleeping

File size: 15,320 Bytes

da605e9

---
editor_options: 
  markdown: 
    wrap: 72
---

# Replication Guide — Grasping 'Gooning'

An observational analysis of Reddit gooning communities via BERTopic
topic modelling.

------------------------------------------------------------------------

## Project summary

This study empirically analyses publicly available Reddit subreddits
dedicated to gooning (prolonged, trancelike masturbation). The analysis
pipeline combines:

-   **Descriptive statistics** — demographics (age, gender, sexuality
    from user flair), word frequencies, and substance mentions across 30
    subreddits
-   **BERTopic topic modelling** — semantic clustering of \~2.35M
    documents (posts and comments) to identify the major themes and
    discourse patterns

**Full corpus:** 30 subreddits, \~22M raw records, 2019–2025

------------------------------------------------------------------------

## Repository structure

```         
Goon/
├── REPLICATION.md                 ← this file
├── README.md                      ← subreddit list
├── Goon.Rproj                     ← RStudio project (sets working directory)
├── data_prep.qmd                  ← Step 1: R — ingest CSVs, clean, export Parquet
├── goon_analysis.qmd              ← Step 2: R — descriptive analyses
├── goon_topic_analysis.qmd        ← Step 4: R — visualise BERTopic outputs
├── data/
│   ├── *.csv                      ← raw Reddit exports (one file per subreddit per year)
│   ├── GOONED_comments.csv        ← pre-concatenated GOONED comments (all years)
│   ├── GOONED_submissions.csv     ← pre-concatenated GOONED submissions (all years)
│   ├── comments.parquet           ← output of data_prep.qmd (18.2M rows)
│   ├── posts.parquet              ← output of data_prep.qmd (3.8M rows)
│   ├── corpus_clean.parquet       ← output of topic_analysis.qmd (modelling corpus)
│   └── corpus_deleted.parquet     ← deleted/removed rows (tracked separately)
├── topic analysis/
│   ├── topic_analysis.qmd         ← Step 3: Python — BERTopic pipeline
│   ├── topic_api_labeling.qmd     ← Step 3b: Python — optional API-based topic labelling
│   ├── build_topic_results_summary.py  ← helper: generate an HTML summary from run outputs
│   ├── run_topic_analysis.sh      ← shell wrapper that sets env vars and calls quarto
│   ├── README.md                  ← detailed pipeline documentation
│   ├── MEMORY_MANAGEMENT.md       ← strategies for large-corpus runs
│   └── runs/
│       └── full_run_v1/           ← complete output of the full corpus run
│           ├── bertopic/          ← saved BERTopic models + doc-topic assignments
│           ├── topics/            ← keyword tables, summaries, evaluation, API labels
│           └── figures/           ← generated charts
└── .venv/                         ← Python virtual environment (created per instructions below)
```

------------------------------------------------------------------------

## System requirements

### Hardware

| Component | Minimum | Recommended |
|----|----|----|
| RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) |
| Disk | 50 GB free | 100 GB free |
| CPU | Any modern multi-core | 8+ cores for embedding generation |
| GPU | Not required | Optional — speeds up embedding by \~10× |

> The full corpus run was executed on a Linux VM with 96 GB RAM. A local
> Mac with 24 GB will work for the pilot/sample runs but may struggle
> with the full corpus embedding step.

### Software

| Tool | Version tested | Notes |
|----|----|----|
| R | ≥ 4.3 | For data_prep.qmd and goon_analysis.qmd |
| RStudio / Quarto CLI | ≥ 1.4 | To render .qmd files |
| Python | ≥ 3.10 | For topic_analysis.qmd |
| Quarto | ≥ 1.4 | Installed with RStudio or standalone |

------------------------------------------------------------------------

## Step-by-step replication

### Step 0: Clone / obtain the repository

The `data/` folder containing raw CSVs is required. These are large
files and are not distributed via git — they must be present locally.

Open `Goon.Rproj` in RStudio. This sets the working directory to the
project root so that `here::here()` paths resolve correctly.

------------------------------------------------------------------------

### Step 1: R data preparation (`data_prep.qmd`)

**Purpose:** Reads all raw CSVs, combines them into unified data frames,
applies minimal cleaning, and exports `data/comments.parquet` and
`data/posts.parquet`.

**R packages required:**

``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "data.table", "arrow", "here"))
```

**Run:**

Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks
in order).

Alternatively, from the terminal:

``` bash
cd /Users/bkot7579/Desktop/Goon
quarto render data_prep.qmd
```

**Expected outputs:** - `data/comments.parquet` — \~18.2M rows, \~528
MB - `data/posts.parquet` — \~3.8M rows, \~186 MB

**Time estimate:** 15–45 minutes depending on RAM and I/O speed (the
GOONED CSV files alone are \~7 GB).

------------------------------------------------------------------------

### Step 2: R descriptive analysis (`goon_analysis.qmd`)

**Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency
analysis, substance mention counts.

**R packages required:**

``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
                   "stringr", "here", "e1071", "tidytext",
                   "data.table", "arrow"))
```

**Run:**

Open `goon_analysis.qmd` in RStudio and click **Render**.

**Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must
exist (Step 1).

**Expected outputs:** - Rendered HTML with plots embedded - In-memory:
word count tables, substance mention counts, demographic counts

------------------------------------------------------------------------

### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)

**Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN
topic modelling via BERTopic, reduces topics using c-TF-IDF
agglomerative clustering, evaluates models, and exports all topic
outputs.

#### 3a. Create the Python virtual environment

``` bash
cd /Users/bkot7579/Desktop/Goon
python3 -m venv .venv
source .venv/bin/activate

pip install bertopic umap-learn hdbscan sentence-transformers \
            scikit-learn pandas pyarrow quarto
```

**Key package versions (tested):**

| Package               | Version |
|-----------------------|---------|
| bertopic              | 0.16.x  |
| umap-learn            | 0.5.x   |
| hdbscan               | 0.8.x   |
| sentence-transformers | 2.x     |
| scikit-learn          | 1.x     |
| pandas                | 2.x     |
| pyarrow               | 14.x    |

#### 3b. Run the pipeline

**Pilot run (200k documents — recommended first):**

``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
```

Outputs land in `topic analysis/runs/pilot_200k/`.

**Full corpus run (\~2.35M cleaned documents after filtering):**

``` bash
cd /Users/bkot7579/Desktop/Goon
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
```

> **Warning:** The full run requires \~64 GB RAM for the embedding +
> UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies
> if RAM is limited.

**Pipeline stages (automatically executed in order):**

1.  **Data ingestion** — loads `data/posts.parquet` +
    `data/comments.parquet`
2.  **Light cleaning** — deduplication, URL/username/subreddit
    anonymisation, markdown stripping; deleted/removed rows saved
    separately to `data/corpus_deleted.parquet`
3.  **Embedding** — `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved
    as shards; skips already-generated shards on resume
4.  **UMAP** — pre-computed once, reused across all HDBSCAN configs
5.  **BERTopic** — 6 configurations: min_cluster_size ∈ {50, 100, 200} ×
    method ∈ {eom, leaf}
6.  **Topic reduction** — c-TF-IDF agglomerative clustering to 100, 50,
    and 25 topics
7.  **Evaluation** — NPMI coherence, topic diversity, outlier rates
8.  **Export** — keyword tables, representative docs, summary CSVs

**Reproducibility:** Random seed = 42 throughout. A
`reproducibility_log.json` is written to the run folder with all
settings and package versions.

#### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)

Sends reduced topic summaries (keywords + representative texts) to an
LLM API to generate human-readable labels. Does NOT send raw corpus
text.

Set your API key, then render:

``` bash
export OPENAI_API_KEY="your-key-here"
quarto render "topic analysis/topic_api_labeling.qmd"
```

Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`.

------------------------------------------------------------------------

### Step 4: R topic visualisations (`goon_topic_analysis.qmd`)

**Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and
produces exploratory visualisations: topic size bar chart, subreddit ×
topic heatmap, topic prevalence over time, post vs comment split,
representative documents.

**R packages required:**

``` r
install.packages(c("dplyr", "tidyr", "tibble", "purrr",
                   "ggplot2", "stringr", "here", "arrow"))
```

**Run:**

``` bash
quarto render goon_topic_analysis.qmd
```

**Prerequisites:** Step 3 must have completed and outputs must exist
under `topic analysis/runs/full_run_v1/topics/`.

------------------------------------------------------------------------

## Execution order summary

```         
Step 1  →  data_prep.qmd            (R)       ~30 min
Step 2  →  goon_analysis.qmd        (R)       ~10 min
Step 3  →  topic_analysis.qmd       (Python)  ~6–48 hours (full corpus)
Step 3b →  topic_api_labeling.qmd   (Python)  ~5 min + API cost (optional)
Step 4  →  goon_topic_analysis.qmd  (R)       ~2 min
```

Steps 2 and 3 are independent of each other and can run in parallel.

------------------------------------------------------------------------

## Key modelling decisions

| Decision | Choice | Rationale |
|-------------------------|--------------------|---------------------------|
| Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure |
| UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure |
| UMAP n_components | 5 | Low enough for HDBSCAN to work well |
| HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference |
| Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters |
| Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) |
| Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords |
| Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling |

------------------------------------------------------------------------

## Known issue: r/GOONED missing from BERTopic results

r/GOONED is the dominant subreddit in the corpus by a large margin:

|                     | Count          |
|---------------------|----------------|
| r/GOONED posts      | 2,765,119      |
| r/GOONED comments   | 15,493,075     |
| **r/GOONED total**  | **18,258,194** |
| Full cleaned corpus | 22,011,124     |
| **r/GOONED share**  | **82.9%**      |

Despite this, r/GOONED is **entirely absent** from
`topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the
BERTopic pipeline did not have the `GOONED_comments.csv` and
`GOONED_submissions.csv` files available (most likely because their
combined size of \~7 GB made transfer impractical), and
`corpus_clean.parquet` on the VM was generated without them.

**Consequence:** All topic modelling results represent 29 subreddits
(3.75M records) rather than the full 30-subreddit corpus (22M records).
Topic proportions, dominant themes, and subreddit distribution tables
are therefore not representative of the full corpus.

**To fix:** Ensure the `GOONED_*.csv` files (or the pre-built
`comments.parquet` / `posts.parquet`) are available on the compute
environment, then re-run:

``` bash
./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
```

Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by
this issue — they read directly from the local `data/posts.parquet` and
`data/comments.parquet` files, which do contain r/GOONED data.

------------------------------------------------------------------------

## Known limitations

1.  **Outlier rate:** 37.3% of documents are assigned to the outlier
    topic (-1) by HDBSCAN. This is typical for short-text social media
    corpora. Outliers are excluded from topic analyses but are retained
    in the corpus parquet files.

2.  **Flair ambiguity:** Gender/sexuality classifiers rely on
    voluntarily set flair strings. Flair adoption is uneven across
    subreddits and users, introducing selection bias. Abbreviations like
    `\\bt\\b` may match unintended strings (e.g. US state TX).

3.  **Deleted content:** Posts and comments marked `[deleted]` or
    `[removed]` are excluded from modelling but counted separately.
    These may disproportionately represent controversial content.

4.  **Temporal coverage:** Coverage varies by subreddit. Some
    communities only appear in later years (2024–2025); others span the
    full 2019–2025 window.

5.  **Platform-specific norms:** Moderation rules, flair conventions,
    and posting styles differ across subreddits, which may shape topics
    in ways that are not generalisable.

6.  **Unobserved participants:** Lurkers, banned users, and deleted
    accounts are not captured.

------------------------------------------------------------------------

## Output files (full_run_v1)

| File | Description |
|-----------------------|-------------------------------------------------|
| `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts |
| `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs |
| `topics/evaluation_table.csv` | Same as above, alternate format |
| `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations |
| `topics/topic_by_subreddit.csv` | Topic × subreddit document counts |
| `topics/topic_by_month.csv` | Topic × month document counts |
| `topics/topic_by_doctype.csv` | Topic × doc type (post/comment) |
| `topics/preprocessing_decisions.json` | Logged cleaning decisions |
| `topics/api/` | API-generated labels, summaries, category annotations |
| `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs |

------------------------------------------------------------------------

## Contacts and citation

This analysis was conducted as part of a preliminary empirical study of
online gooning communities. If replicating, please cite the original
study and note the random seed, embedding model, and reduction target
used.