| --- |
| editor_options: |
| markdown: |
| wrap: 72 |
| --- |
| |
| # Replication Guide β Grasping 'Gooning' |
|
|
| An observational analysis of Reddit gooning communities via BERTopic |
| topic modelling. |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Project summary |
|
|
| This study empirically analyses publicly available Reddit subreddits |
| dedicated to gooning (prolonged, trancelike masturbation). The analysis |
| pipeline combines: |
|
|
| - **Descriptive statistics** β demographics (age, gender, sexuality |
| from user flair), word frequencies, and substance mentions across 30 |
| subreddits |
| - **BERTopic topic modelling** β semantic clustering of \~2.35M |
| documents (posts and comments) to identify the major themes and |
| discourse patterns |
| |
| **Full corpus:** 30 subreddits, \~22M raw records, 2019β2025 |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Repository structure |
|
|
| ``` |
| Goon/ |
| βββ REPLICATION.md β this file |
| βββ README.md β subreddit list |
| βββ Goon.Rproj β RStudio project (sets working directory) |
| βββ data_prep.qmd β Step 1: R β ingest CSVs, clean, export Parquet |
| βββ goon_analysis.qmd β Step 2: R β descriptive analyses |
| βββ goon_topic_analysis.qmd β Step 4: R β visualise BERTopic outputs |
| βββ data/ |
| β βββ *.csv β raw Reddit exports (one file per subreddit per year) |
| β βββ GOONED_comments.csv β pre-concatenated GOONED comments (all years) |
| β βββ GOONED_submissions.csv β pre-concatenated GOONED submissions (all years) |
| β βββ comments.parquet β output of data_prep.qmd (18.2M rows) |
| β βββ posts.parquet β output of data_prep.qmd (3.8M rows) |
| β βββ corpus_clean.parquet β output of topic_analysis.qmd (modelling corpus) |
| β βββ corpus_deleted.parquet β deleted/removed rows (tracked separately) |
| βββ topic analysis/ |
| β βββ topic_analysis.qmd β Step 3: Python β BERTopic pipeline |
| β βββ topic_api_labeling.qmd β Step 3b: Python β optional API-based topic labelling |
| β βββ build_topic_results_summary.py β helper: generate an HTML summary from run outputs |
| β βββ run_topic_analysis.sh β shell wrapper that sets env vars and calls quarto |
| β βββ README.md β detailed pipeline documentation |
| β βββ MEMORY_MANAGEMENT.md β strategies for large-corpus runs |
| β βββ runs/ |
| β βββ full_run_v1/ β complete output of the full corpus run |
| β βββ bertopic/ β saved BERTopic models + doc-topic assignments |
| β βββ topics/ β keyword tables, summaries, evaluation, API labels |
| β βββ figures/ β generated charts |
| βββ .venv/ β Python virtual environment (created per instructions below) |
| ``` |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## System requirements |
|
|
| ### Hardware |
|
|
| | Component | Minimum | Recommended | |
| |----|----|----| |
| | RAM | 32 GB | 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) | |
| | Disk | 50 GB free | 100 GB free | |
| | CPU | Any modern multi-core | 8+ cores for embedding generation | |
| | GPU | Not required | Optional β speeds up embedding by \~10Γ | |
| |
| > The full corpus run was executed on a Linux VM with 96 GB RAM. A local |
| > Mac with 24 GB will work for the pilot/sample runs but may struggle |
| > with the full corpus embedding step. |
| |
| ### Software |
| |
| | Tool | Version tested | Notes | |
| |----|----|----| |
| | R | β₯ 4.3 | For data_prep.qmd and goon_analysis.qmd | |
| | RStudio / Quarto CLI | β₯ 1.4 | To render .qmd files | |
| | Python | β₯ 3.10 | For topic_analysis.qmd | |
| | Quarto | β₯ 1.4 | Installed with RStudio or standalone | |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Step-by-step replication |
|
|
| ### Step 0: Clone / obtain the repository |
|
|
| The `data/` folder containing raw CSVs is required. These are large |
| files and are not distributed via git β they must be present locally. |
|
|
| Open `Goon.Rproj` in RStudio. This sets the working directory to the |
| project root so that `here::here()` paths resolve correctly. |
|
|
| ------------------------------------------------------------------------ |
|
|
| ### Step 1: R data preparation (`data_prep.qmd`) |
| |
| **Purpose:** Reads all raw CSVs, combines them into unified data frames, |
| applies minimal cleaning, and exports `data/comments.parquet` and |
| `data/posts.parquet`. |
| |
| **R packages required:** |
| |
| ``` r |
| install.packages(c("dplyr", "tidyr", "tibble", "purrr", |
| "data.table", "arrow", "here")) |
| ``` |
| |
| **Run:** |
| |
| Open `data_prep.qmd` in RStudio and click **Render** (or run all chunks |
| in order). |
|
|
| Alternatively, from the terminal: |
|
|
| ``` bash |
| cd /Users/bkot7579/Desktop/Goon |
| quarto render data_prep.qmd |
| ``` |
|
|
| **Expected outputs:** - `data/comments.parquet` β \~18.2M rows, \~528 |
| MB - `data/posts.parquet` β \~3.8M rows, \~186 MB |
|
|
| **Time estimate:** 15β45 minutes depending on RAM and I/O speed (the |
| GOONED CSV files alone are \~7 GB). |
|
|
| ------------------------------------------------------------------------ |
|
|
| ### Step 2: R descriptive analysis (`goon_analysis.qmd`) |
| |
| **Purpose:** Demographic analysis (r/GOONEDmeetup flair), word frequency |
| analysis, substance mention counts. |
| |
| **R packages required:** |
| |
| ``` r |
| install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2", |
| "stringr", "here", "e1071", "tidytext", |
| "data.table", "arrow")) |
| ``` |
| |
| **Run:** |
| |
| Open `goon_analysis.qmd` in RStudio and click **Render**. |
|
|
| **Prerequisites:** `data/comments.parquet` and `data/posts.parquet` must |
| exist (Step 1). |
|
|
| **Expected outputs:** - Rendered HTML with plots embedded - In-memory: |
| word count tables, substance mention counts, demographic counts |
|
|
| ------------------------------------------------------------------------ |
|
|
| ### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`) |
| |
| **Purpose:** Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN |
| topic modelling via BERTopic, reduces topics using c-TF-IDF |
| agglomerative clustering, evaluates models, and exports all topic |
| outputs. |
| |
| #### 3a. Create the Python virtual environment |
| |
| ``` bash |
| cd /Users/bkot7579/Desktop/Goon |
| python3 -m venv .venv |
| source .venv/bin/activate |
| |
| pip install bertopic umap-learn hdbscan sentence-transformers \ |
| scikit-learn pandas pyarrow quarto |
| ``` |
| |
| **Key package versions (tested):** |
| |
| | Package | Version | |
| |-----------------------|---------| |
| | bertopic | 0.16.x | |
| | umap-learn | 0.5.x | |
| | hdbscan | 0.8.x | |
| | sentence-transformers | 2.x | |
| | scikit-learn | 1.x | |
| | pandas | 2.x | |
| | pyarrow | 14.x | |
| |
| #### 3b. Run the pipeline |
| |
| **Pilot run (200k documents β recommended first):** |
| |
| ``` bash |
| cd /Users/bkot7579/Desktop/Goon |
| ./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k |
| ``` |
| |
| Outputs land in `topic analysis/runs/pilot_200k/`. |
| |
| **Full corpus run (\~2.35M cleaned documents after filtering):** |
| |
| ``` bash |
| cd /Users/bkot7579/Desktop/Goon |
| ./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1 |
| ``` |
| |
| > **Warning:** The full run requires \~64 GB RAM for the embedding + |
| > UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies |
| > if RAM is limited. |
| |
| **Pipeline stages (automatically executed in order):** |
| |
| 1. **Data ingestion** β loads `data/posts.parquet` + |
| `data/comments.parquet` |
| 2. **Light cleaning** β deduplication, URL/username/subreddit |
| anonymisation, markdown stripping; deleted/removed rows saved |
| separately to `data/corpus_deleted.parquet` |
| 3. **Embedding** β `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved |
| as shards; skips already-generated shards on resume |
| 4. **UMAP** β pre-computed once, reused across all HDBSCAN configs |
| 5. **BERTopic** β 6 configurations: min_cluster_size β {50, 100, 200} Γ |
| method β {eom, leaf} |
| 6. **Topic reduction** β c-TF-IDF agglomerative clustering to 100, 50, |
| and 25 topics |
| 7. **Evaluation** β NPMI coherence, topic diversity, outlier rates |
| 8. **Export** β keyword tables, representative docs, summary CSVs |
| |
| **Reproducibility:** Random seed = 42 throughout. A |
| `reproducibility_log.json` is written to the run folder with all |
| settings and package versions. |
| |
| #### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`) |
| |
| Sends reduced topic summaries (keywords + representative texts) to an |
| LLM API to generate human-readable labels. Does NOT send raw corpus |
| text. |
| |
| Set your API key, then render: |
| |
| ``` bash |
| export OPENAI_API_KEY="your-key-here" |
| quarto render "topic analysis/topic_api_labeling.qmd" |
| ``` |
| |
| Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`. |
| |
| ------------------------------------------------------------------------ |
| |
| ### Step 4: R topic visualisations (`goon_topic_analysis.qmd`) |
| |
| **Purpose:** Loads the CSV/Parquet outputs from the BERTopic run and |
| produces exploratory visualisations: topic size bar chart, subreddit Γ |
| topic heatmap, topic prevalence over time, post vs comment split, |
| representative documents. |
| |
| **R packages required:** |
| |
| ``` r |
| install.packages(c("dplyr", "tidyr", "tibble", "purrr", |
| "ggplot2", "stringr", "here", "arrow")) |
| ``` |
| |
| **Run:** |
|
|
| ``` bash |
| quarto render goon_topic_analysis.qmd |
| ``` |
|
|
| **Prerequisites:** Step 3 must have completed and outputs must exist |
| under `topic analysis/runs/full_run_v1/topics/`. |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Execution order summary |
|
|
| ``` |
| Step 1 β data_prep.qmd (R) ~30 min |
| Step 2 β goon_analysis.qmd (R) ~10 min |
| Step 3 β topic_analysis.qmd (Python) ~6β48 hours (full corpus) |
| Step 3b β topic_api_labeling.qmd (Python) ~5 min + API cost (optional) |
| Step 4 β goon_topic_analysis.qmd (R) ~2 min |
| ``` |
|
|
| Steps 2 and 3 are independent of each other and can run in parallel. |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Key modelling decisions |
|
|
| | Decision | Choice | Rationale | |
| |-------------------------|--------------------|---------------------------| |
| | Embedding model | `all-MiniLM-L6-v2` | Fast, runs on CPU, 384-dim sufficient for topic structure | |
| | UMAP n_neighbors | 15 | BERTopic default; balances local vs global structure | |
| | UMAP n_components | 5 | Low enough for HDBSCAN to work well | |
| | HDBSCAN min_cluster_size | 50, 100, 200 | Tested all three; mcs=100 eom selected as reference | |
| | Topic reduction method | c-TF-IDF agglomerative | Merges semantically similar topics rather than splitting clusters | |
| | Reduction targets | 100, 50, 25 | 50 selected for reporting (NPMI=0.27, diversity=0.74) | |
| | Preprocessing | Minimal | Preserves informal language and slang; CountVectorizer handles casing/stopwords | |
| | Random seed | 42 | Applied to UMAP, HDBSCAN sampling, and document cap sampling | |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Known issue: r/GOONED missing from BERTopic results |
|
|
| r/GOONED is the dominant subreddit in the corpus by a large margin: |
|
|
| | | Count | |
| |---------------------|----------------| |
| | r/GOONED posts | 2,765,119 | |
| | r/GOONED comments | 15,493,075 | |
| | **r/GOONED total** | **18,258,194** | |
| | Full cleaned corpus | 22,011,124 | |
| | **r/GOONED share** | **82.9%** | |
|
|
| Despite this, r/GOONED is **entirely absent** from |
| `topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the |
| BERTopic pipeline did not have the `GOONED_comments.csv` and |
| `GOONED_submissions.csv` files available (most likely because their |
| combined size of \~7 GB made transfer impractical), and |
| `corpus_clean.parquet` on the VM was generated without them. |
|
|
| **Consequence:** All topic modelling results represent 29 subreddits |
| (3.75M records) rather than the full 30-subreddit corpus (22M records). |
| Topic proportions, dominant themes, and subreddit distribution tables |
| are therefore not representative of the full corpus. |
|
|
| **To fix:** Ensure the `GOONED_*.csv` files (or the pre-built |
| `comments.parquet` / `posts.parquet`) are available on the compute |
| environment, then re-run: |
|
|
| ``` bash |
| ./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2 |
| ``` |
|
|
| Note: the R-side analyses (`goon_analysis.qmd`) are **not** affected by |
| this issue β they read directly from the local `data/posts.parquet` and |
| `data/comments.parquet` files, which do contain r/GOONED data. |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Known limitations |
|
|
| 1. **Outlier rate:** 37.3% of documents are assigned to the outlier |
| topic (-1) by HDBSCAN. This is typical for short-text social media |
| corpora. Outliers are excluded from topic analyses but are retained |
| in the corpus parquet files. |
| |
| 2. **Flair ambiguity:** Gender/sexuality classifiers rely on |
| voluntarily set flair strings. Flair adoption is uneven across |
| subreddits and users, introducing selection bias. Abbreviations like |
| `\\bt\\b` may match unintended strings (e.g. US state TX). |
| |
| 3. **Deleted content:** Posts and comments marked `[deleted]` or |
| `[removed]` are excluded from modelling but counted separately. |
| These may disproportionately represent controversial content. |
| |
| 4. **Temporal coverage:** Coverage varies by subreddit. Some |
| communities only appear in later years (2024β2025); others span the |
| full 2019β2025 window. |
| |
| 5. **Platform-specific norms:** Moderation rules, flair conventions, |
| and posting styles differ across subreddits, which may shape topics |
| in ways that are not generalisable. |
| |
| 6. **Unobserved participants:** Lurkers, banned users, and deleted |
| accounts are not captured. |
| |
| ------------------------------------------------------------------------ |
|
|
| ## Output files (full_run_v1) |
|
|
| | File | Description | |
| |-----------------------|-------------------------------------------------| |
| | `topics/final_topic_summary.csv` | 50 reduced topics with size, keywords, representative texts | |
| | `topics/final_model_comparison.csv` | Coherence, diversity, outlier rate for all 9 runs | |
| | `topics/evaluation_table.csv` | Same as above, alternate format | |
| | `topics/bertopic_run_summary.csv` | Initial topic counts across 6 HDBSCAN configurations | |
| | `topics/topic_by_subreddit.csv` | Topic Γ subreddit document counts | |
| | `topics/topic_by_month.csv` | Topic Γ month document counts | |
| | `topics/topic_by_doctype.csv` | Topic Γ doc type (post/comment) | |
| | `topics/preprocessing_decisions.json` | Logged cleaning decisions | |
| | `topics/api/` | API-generated labels, summaries, category annotations | |
| | `bertopic/` | Saved BERTopic models + doc-topic parquet files for all 6 initial runs | |
|
|
| ------------------------------------------------------------------------ |
|
|
| ## Contacts and citation |
|
|
| This analysis was conducted as part of a preliminary empirical study of |
| online gooning communities. If replicating, please cite the original |
| study and note the random seed, embedding model, and reduction target |
| used. |
|
|