Spaces:

Binxk
/

goon

Sleeping

App Files Files Community

goon / REPLICATION.md

Binx

Initial commit: analysis app, deployment config, UI improvements

da605e9 26 days ago

preview code

raw

history blame contribute delete

15.3 kB

	---
	editor_options:
	markdown:
	wrap: 72
	---

	# Replication Guide — Grasping 'Gooning'

	An observational analysis of Reddit gooning communities via BERTopic
	topic modelling.

	------------------------------------------------------------------------

	## Project summary

	This study empirically analyses publicly available Reddit subreddits
	dedicated to gooning (prolonged, trancelike masturbation). The analysis
	pipeline combines:

	- Descriptive statistics — demographics (age, gender, sexuality
	from user flair), word frequencies, and substance mentions across 30
	subreddits
	- BERTopic topic modelling — semantic clustering of \~2.35M
	documents (posts and comments) to identify the major themes and
	discourse patterns

	Full corpus: 30 subreddits, \~22M raw records, 2019–2025

	------------------------------------------------------------------------

	## Repository structure

	```
	Goon/
	├── REPLICATION.md ← this file
	├── README.md ← subreddit list
	├── Goon.Rproj ← RStudio project (sets working directory)
	├── data_prep.qmd ← Step 1: R — ingest CSVs, clean, export Parquet
	├── goon_analysis.qmd ← Step 2: R — descriptive analyses
	├── goon_topic_analysis.qmd ← Step 4: R — visualise BERTopic outputs
	├── data/
	│ ├── *.csv ← raw Reddit exports (one file per subreddit per year)
	│ ├── GOONED_comments.csv ← pre-concatenated GOONED comments (all years)
	│ ├── GOONED_submissions.csv ← pre-concatenated GOONED submissions (all years)
	│ ├── comments.parquet ← output of data_prep.qmd (18.2M rows)
	│ ├── posts.parquet ← output of data_prep.qmd (3.8M rows)
	│ ├── corpus_clean.parquet ← output of topic_analysis.qmd (modelling corpus)
	│ └── corpus_deleted.parquet ← deleted/removed rows (tracked separately)
	├── topic analysis/
	│ ├── topic_analysis.qmd ← Step 3: Python — BERTopic pipeline
	│ ├── topic_api_labeling.qmd ← Step 3b: Python — optional API-based topic labelling
	│ ├── build_topic_results_summary.py ← helper: generate an HTML summary from run outputs
	│ ├── run_topic_analysis.sh ← shell wrapper that sets env vars and calls quarto
	│ ├── README.md ← detailed pipeline documentation
	│ ├── MEMORY_MANAGEMENT.md ← strategies for large-corpus runs
	│ └── runs/
	│ └── full_run_v1/ ← complete output of the full corpus run
	│ ├── bertopic/ ← saved BERTopic models + doc-topic assignments
	│ ├── topics/ ← keyword tables, summaries, evaluation, API labels
	│ └── figures/ ← generated charts
	└── .venv/ ← Python virtual environment (created per instructions below)
	```

	------------------------------------------------------------------------

	## System requirements

	### Hardware

	\| Component \| Minimum \| Recommended \|
	\|----\|----\|----\|
	\| RAM \| 32 GB \| 64 GB (full corpus run; see MEMORY_MANAGEMENT.md) \|
	\| Disk \| 50 GB free \| 100 GB free \|
	\| CPU \| Any modern multi-core \| 8+ cores for embedding generation \|
	\| GPU \| Not required \| Optional — speeds up embedding by \~10× \|

	> The full corpus run was executed on a Linux VM with 96 GB RAM. A local
	> Mac with 24 GB will work for the pilot/sample runs but may struggle
	> with the full corpus embedding step.

	### Software

	\| Tool \| Version tested \| Notes \|
	\|----\|----\|----\|
	\| R \| ≥ 4.3 \| For data_prep.qmd and goon_analysis.qmd \|
	\| RStudio / Quarto CLI \| ≥ 1.4 \| To render .qmd files \|
	\| Python \| ≥ 3.10 \| For topic_analysis.qmd \|
	\| Quarto \| ≥ 1.4 \| Installed with RStudio or standalone \|

	------------------------------------------------------------------------

	## Step-by-step replication

	### Step 0: Clone / obtain the repository

	The `data/` folder containing raw CSVs is required. These are large
	files and are not distributed via git — they must be present locally.

	Open `Goon.Rproj` in RStudio. This sets the working directory to the
	project root so that `here::here()` paths resolve correctly.

	------------------------------------------------------------------------

	### Step 1: R data preparation (`data_prep.qmd`)

	Purpose: Reads all raw CSVs, combines them into unified data frames,
	applies minimal cleaning, and exports `data/comments.parquet` and
	`data/posts.parquet`.

	R packages required:

	``` r
	install.packages(c("dplyr", "tidyr", "tibble", "purrr",
	"data.table", "arrow", "here"))
	```

	Run:

	Open `data_prep.qmd` in RStudio and click Render (or run all chunks
	in order).

	Alternatively, from the terminal:

	``` bash
	cd /Users/bkot7579/Desktop/Goon
	quarto render data_prep.qmd
	```

	Expected outputs: - `data/comments.parquet` — \~18.2M rows, \~528
	MB - `data/posts.parquet` — \~3.8M rows, \~186 MB

	Time estimate: 15–45 minutes depending on RAM and I/O speed (the
	GOONED CSV files alone are \~7 GB).

	------------------------------------------------------------------------

	### Step 2: R descriptive analysis (`goon_analysis.qmd`)

	Purpose: Demographic analysis (r/GOONEDmeetup flair), word frequency
	analysis, substance mention counts.

	R packages required:

	``` r
	install.packages(c("dplyr", "tidyr", "tibble", "purrr", "ggplot2",
	"stringr", "here", "e1071", "tidytext",
	"data.table", "arrow"))
	```

	Run:

	Open `goon_analysis.qmd` in RStudio and click Render.

	Prerequisites: `data/comments.parquet` and `data/posts.parquet` must
	exist (Step 1).

	Expected outputs: - Rendered HTML with plots embedded - In-memory:
	word count tables, substance mention counts, demographic counts

	------------------------------------------------------------------------

	### Step 3: Python BERTopic topic modelling (`topic analysis/topic_analysis.qmd`)

	Purpose: Embeds all documents with `all-MiniLM-L6-v2`, runs HDBSCAN
	topic modelling via BERTopic, reduces topics using c-TF-IDF
	agglomerative clustering, evaluates models, and exports all topic
	outputs.

	#### 3a. Create the Python virtual environment

	``` bash
	cd /Users/bkot7579/Desktop/Goon
	python3 -m venv .venv
	source .venv/bin/activate

	pip install bertopic umap-learn hdbscan sentence-transformers \
	scikit-learn pandas pyarrow quarto
	```

	Key package versions (tested):

	\| Package \| Version \|
	\|-----------------------\|---------\|
	\| bertopic \| 0.16.x \|
	\| umap-learn \| 0.5.x \|
	\| hdbscan \| 0.8.x \|
	\| sentence-transformers \| 2.x \|
	\| scikit-learn \| 1.x \|
	\| pandas \| 2.x \|
	\| pyarrow \| 14.x \|

	#### 3b. Run the pipeline

	Pilot run (200k documents — recommended first):

	``` bash
	cd /Users/bkot7579/Desktop/Goon
	./topic\ analysis/run_topic_analysis.sh --max-docs 200000 --run-tag pilot_200k
	```

	Outputs land in `topic analysis/runs/pilot_200k/`.

	Full corpus run (\~2.35M cleaned documents after filtering):

	``` bash
	cd /Users/bkot7579/Desktop/Goon
	./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v1
	```

	> Warning: The full run requires \~64 GB RAM for the embedding +
	> UMAP stages. See `topic analysis/MEMORY_MANAGEMENT.md` for strategies
	> if RAM is limited.

	Pipeline stages (automatically executed in order):

	1. Data ingestion — loads `data/posts.parquet` +
	`data/comments.parquet`
	2. Light cleaning — deduplication, URL/username/subreddit
	anonymisation, markdown stripping; deleted/removed rows saved
	separately to `data/corpus_deleted.parquet`
	3. Embedding — `all-MiniLM-L6-v2` (384-dim), batch_size=512, saved
	as shards; skips already-generated shards on resume
	4. UMAP — pre-computed once, reused across all HDBSCAN configs
	5. BERTopic — 6 configurations: min_cluster_size ∈ {50, 100, 200} ×
	method ∈ {eom, leaf}
	6. Topic reduction — c-TF-IDF agglomerative clustering to 100, 50,
	and 25 topics
	7. Evaluation — NPMI coherence, topic diversity, outlier rates
	8. Export — keyword tables, representative docs, summary CSVs

	Reproducibility: Random seed = 42 throughout. A
	`reproducibility_log.json` is written to the run folder with all
	settings and package versions.

	#### 3c. Optional API-based topic labelling (`topic_api_labeling.qmd`)

	Sends reduced topic summaries (keywords + representative texts) to an
	LLM API to generate human-readable labels. Does NOT send raw corpus
	text.

	Set your API key, then render:

	``` bash
	export OPENAI_API_KEY="your-key-here"
	quarto render "topic analysis/topic_api_labeling.qmd"
	```

	Outputs are saved to `topic analysis/runs/<run-tag>/topics/api/`.

	------------------------------------------------------------------------

	### Step 4: R topic visualisations (`goon_topic_analysis.qmd`)

	Purpose: Loads the CSV/Parquet outputs from the BERTopic run and
	produces exploratory visualisations: topic size bar chart, subreddit ×
	topic heatmap, topic prevalence over time, post vs comment split,
	representative documents.

	R packages required:

	``` r
	install.packages(c("dplyr", "tidyr", "tibble", "purrr",
	"ggplot2", "stringr", "here", "arrow"))
	```

	Run:

	``` bash
	quarto render goon_topic_analysis.qmd
	```

	Prerequisites: Step 3 must have completed and outputs must exist
	under `topic analysis/runs/full_run_v1/topics/`.

	------------------------------------------------------------------------

	## Execution order summary

	```
	Step 1 → data_prep.qmd (R) ~30 min
	Step 2 → goon_analysis.qmd (R) ~10 min
	Step 3 → topic_analysis.qmd (Python) ~6–48 hours (full corpus)
	Step 3b → topic_api_labeling.qmd (Python) ~5 min + API cost (optional)
	Step 4 → goon_topic_analysis.qmd (R) ~2 min
	```

	Steps 2 and 3 are independent of each other and can run in parallel.

	------------------------------------------------------------------------

	## Key modelling decisions

	\| Decision \| Choice \| Rationale \|
	\|-------------------------\|--------------------\|---------------------------\|
	\| Embedding model \| `all-MiniLM-L6-v2` \| Fast, runs on CPU, 384-dim sufficient for topic structure \|
	\| UMAP n_neighbors \| 15 \| BERTopic default; balances local vs global structure \|
	\| UMAP n_components \| 5 \| Low enough for HDBSCAN to work well \|
	\| HDBSCAN min_cluster_size \| 50, 100, 200 \| Tested all three; mcs=100 eom selected as reference \|
	\| Topic reduction method \| c-TF-IDF agglomerative \| Merges semantically similar topics rather than splitting clusters \|
	\| Reduction targets \| 100, 50, 25 \| 50 selected for reporting (NPMI=0.27, diversity=0.74) \|
	\| Preprocessing \| Minimal \| Preserves informal language and slang; CountVectorizer handles casing/stopwords \|
	\| Random seed \| 42 \| Applied to UMAP, HDBSCAN sampling, and document cap sampling \|

	------------------------------------------------------------------------

	## Known issue: r/GOONED missing from BERTopic results

	r/GOONED is the dominant subreddit in the corpus by a large margin:

	\| \| Count \|
	\|---------------------\|----------------\|
	\| r/GOONED posts \| 2,765,119 \|
	\| r/GOONED comments \| 15,493,075 \|
	\| r/GOONED total \| 18,258,194 \|
	\| Full cleaned corpus \| 22,011,124 \|
	\| r/GOONED share \| 82.9% \|

	Despite this, r/GOONED is entirely absent from
	`topic analysis/runs/full_run_v1/` outputs. The cloud VM that ran the
	BERTopic pipeline did not have the `GOONED_comments.csv` and
	`GOONED_submissions.csv` files available (most likely because their
	combined size of \~7 GB made transfer impractical), and
	`corpus_clean.parquet` on the VM was generated without them.

	Consequence: All topic modelling results represent 29 subreddits
	(3.75M records) rather than the full 30-subreddit corpus (22M records).
	Topic proportions, dominant themes, and subreddit distribution tables
	are therefore not representative of the full corpus.

	To fix: Ensure the `GOONED_*.csv` files (or the pre-built
	`comments.parquet` / `posts.parquet`) are available on the compute
	environment, then re-run:

	``` bash
	./topic\ analysis/run_topic_analysis.sh --run-tag full_run_v2
	```

	Note: the R-side analyses (`goon_analysis.qmd`) are not affected by
	this issue — they read directly from the local `data/posts.parquet` and
	`data/comments.parquet` files, which do contain r/GOONED data.

	------------------------------------------------------------------------

	## Known limitations

	1. Outlier rate: 37.3% of documents are assigned to the outlier
	topic (-1) by HDBSCAN. This is typical for short-text social media
	corpora. Outliers are excluded from topic analyses but are retained
	in the corpus parquet files.

	2. Flair ambiguity: Gender/sexuality classifiers rely on
	voluntarily set flair strings. Flair adoption is uneven across
	subreddits and users, introducing selection bias. Abbreviations like
	`\\bt\\b` may match unintended strings (e.g. US state TX).

	3. Deleted content: Posts and comments marked `[deleted]` or
	`[removed]` are excluded from modelling but counted separately.
	These may disproportionately represent controversial content.

	4. Temporal coverage: Coverage varies by subreddit. Some
	communities only appear in later years (2024–2025); others span the
	full 2019–2025 window.

	5. Platform-specific norms: Moderation rules, flair conventions,
	and posting styles differ across subreddits, which may shape topics
	in ways that are not generalisable.

	6. Unobserved participants: Lurkers, banned users, and deleted
	accounts are not captured.

	------------------------------------------------------------------------

	## Output files (full_run_v1)

	\| File \| Description \|
	\|-----------------------\|-------------------------------------------------\|
	\| `topics/final_topic_summary.csv` \| 50 reduced topics with size, keywords, representative texts \|
	\| `topics/final_model_comparison.csv` \| Coherence, diversity, outlier rate for all 9 runs \|
	\| `topics/evaluation_table.csv` \| Same as above, alternate format \|
	\| `topics/bertopic_run_summary.csv` \| Initial topic counts across 6 HDBSCAN configurations \|
	\| `topics/topic_by_subreddit.csv` \| Topic × subreddit document counts \|
	\| `topics/topic_by_month.csv` \| Topic × month document counts \|
	\| `topics/topic_by_doctype.csv` \| Topic × doc type (post/comment) \|
	\| `topics/preprocessing_decisions.json` \| Logged cleaning decisions \|
	\| `topics/api/` \| API-generated labels, summaries, category annotations \|
	\| `bertopic/` \| Saved BERTopic models + doc-topic parquet files for all 6 initial runs \|

	------------------------------------------------------------------------

	## Contacts and citation

	This analysis was conducted as part of a preliminary empirical study of
	online gooning communities. If replicating, please cite the original
	study and note the random seed, embedding model, and reduction target
	used.