Update README.md

7888196 verified 27 days ago

6.54 kB

	---
	license: gpl-3.0
	tags:
	- graph-neural-networks
	- pytorch-geometric
	- graph-classification
	- snap
	language:
	- en
	metrics:
	- matthews_correlation
	- accuracy
	- f1
	- roc-auc
	library_name: pytorch
	pipeline_tag: graph-ml
	datasets:
	- reddit_threads
	---

	# Graph Classification on SNAP Reddit Threads

	Binary graph classification on [SNAP Reddit Threads](https://snap.stanford.edu/data/reddit_threads.html) with PyTorch Geometric. Full training code, preprocessing, metric tables, figures, and reproduction commands:

	https://github.com/pymlex/threads-gnn

	## Overview

	Nodes are Reddit users in a discussion thread. Undirected edges are reply relations. The label marks whether the thread is discussion-based. The dataset has 203,088 graphs, 11--97 nodes per graph, and no raw node features. We engineer 38 structural descriptors per node and compare three encoders under one protocol: GIN, PNA, and GAT. Each model uses four message-passing layers, hidden dimension 128, attention pooling, and a virtual node. Model selection uses validation Matthews correlation coefficient. Experiments used Google Colab with an NVIDIA GPU, batch size 4096, learning rate 0.003, and early stopping with patience 8.

	## Results

	### Architecture comparison

	\| Architecture \| Best val MCC \| Test MCC \| Test F1 \| Test ROC-AUC \|
	\| --- \| --- \| --- \| --- \| --- \|
	\| GIN \| 0.5609 \| 0.5642 \| 0.8017 \| 0.8417 \|
	\| PNA \| 0.5609 \| 0.5635 \| 0.8016 \| 0.8419 \|
	\| GAT \| 0.5592 \| 0.5655 \| 0.8002 \| 0.8418 \|

	Selected checkpoint: GIN (`model.pt`), chosen by best validation MCC. GIN leads validation MCC by a margin of 6e-5 over PNA. On the held-out test split GAT reaches the highest MCC 0.5655, while ROC-AUC stays near 0.842 for all three encoders.

	### Training curves

	![Training curves for GIN, PNA, and GAT](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/training_curves.png)

	Validation MCC rises in the first five epochs and plateaus near 0.55--0.56 for every encoder. Best checkpoints appear at epoch 31 for GIN, epoch 23 for PNA, and epoch 32 for GAT.

	### Test ROC curves

	![Combined test ROC curves](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/test_roc_curves.png)

	Per-architecture plots:

	- GIN:
	![GIN ROC curve](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gin_seed42/test_roc_curve.png)
	- PNA:
	![PNA ROC curve](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/pna_seed42/test_roc_curve.png)
	- GAT:
	![GAT ROC curve](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gat_seed42/test_roc_curve.png)

	### Logit distributions on the test split

	![Combined logit histograms for class 1](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/test_logit_histograms.png)

	Densities are split by ground-truth label. Separation between the two classes reflects ranking quality beyond the fixed 0.5 probability threshold.

	- GIN:
	![GIN logit histogram](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gin_seed42/test_logit_histogram.png)
	- PNA:
	![PNA logit histogram](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/pna_seed42/test_logit_histogram.png)
	- GAT:
	![GAT logit histogram](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gat_seed42/test_logit_histogram.png)

	### Confusion matrices on the test split

	#### GIN

	![GIN test confusion matrix](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gin_seed42/test_confusion_matrix.png)

	#### PNA

	![PNA test confusion matrix](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/pna_seed42/test_confusion_matrix.png)

	#### GAT

	![GAT test confusion matrix](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gat_seed42/test_confusion_matrix.png)

	All models favour recall on the positive class. Class 0 recall stays near 0.67--0.70 while class 1 recall exceeds 0.85. GAT yields the highest class-0 recall 0.700 and test accuracy 0.781.

	### Selected model test metrics (GIN)

	\| Metric \| Value \|
	\| --- \| --- \|
	\| MCC \| 0.5642 \|
	\| Accuracy \| 0.7783 \|
	\| Balanced accuracy \| 0.7758 \|
	\| Precision \| 0.7400 \|
	\| Recall \| 0.8745 \|
	\| F1 \| 0.8017 \|
	\| ROC-AUC \| 0.8417 \|
	\| PR-AUC \| 0.8087 \|

	## Checkpoint

	\| File \| Description \|
	\| --- \| --- \|
	\| `model.pt` \| Best GIN checkpoint selected by validation MCC \|
	\| `config.json` \| Experiment configuration for the selected run \|
	\| `final_metrics.json` \| Validation and test metrics for the selected run \|
	\| `selected_model.json` \| Architecture comparison and selection record \|

	## Inference

	```python
	from huggingface_hub import hf_hub_download
	import torch

	checkpoint_path = hf_hub_download(repo_id="pymlex/threads-gnn", filename="model.pt")
	checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False)
	state_dict = checkpoint["model_state_dict"]
	```

	Clone https://github.com/pymlex/threads-gnn for the full `GINClassifier` definition, structural feature pipeline, and batched inference over PyG `Data` objects.

	## Citation

	```bibtex
	@misc{threads_gnn,
	author = {Alex Zyukov},
	title = {Graph Classification on SNAP Reddit Threads},
	year = {2026},
	publisher = {GitHub},
	howpublished = {\url{https://github.com/pymlex/threads-gnn}},
	note = {Hugging Face model pymlex/threads-gnn}
	}
	```

	## References

	```bibtex
	@inproceedings{karateclub,
	title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
	author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
	year = {2020},
	pages = {3125--3132},
	booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
	organization = {ACM},
	}
	@inproceedings{xu2019gin,
	title = {How Powerful are Graph Neural Networks?},
	author = {Keyulu Xu and Weihua Hu and Jure Leskovec and Stefanie Jegelka},
	booktitle = {International Conference on Learning Representations},
	year = {2019},
	}
	@inproceedings{corso2020pna,
	title = {Principal Neighbourhood Aggregation for Graph Nets},
	author = {Gabriele Corso and Luca Cavalleri and Dominique Beaini and Pietro Li and Petar Velickovic},
	booktitle = {Advances in Neural Information Processing Systems},
	year = {2020},
	}
	@inproceedings{velickovic2018gat,
	title = {Graph Attention Networks},
	author = {Petar Velickovic and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Li and Yoshua Bengio},
	booktitle = {International Conference on Learning Representations},
	year = {2018},
	}
	```

	The project is under GPL-3.0 license.