--- license: gpl-3.0 tags: - graph-neural-networks - pytorch-geometric - graph-classification - snap language: - en metrics: - matthews_correlation - accuracy - f1 - roc-auc library_name: pytorch pipeline_tag: graph-ml datasets: - reddit_threads --- # Graph Classification on SNAP Reddit Threads Binary graph classification on [SNAP Reddit Threads](https://snap.stanford.edu/data/reddit_threads.html) with PyTorch Geometric. Full training code, preprocessing, metric tables, figures, and reproduction commands: https://github.com/pymlex/threads-gnn ## Overview Nodes are Reddit users in a discussion thread. Undirected edges are reply relations. The label marks whether the thread is discussion-based. The dataset has 203,088 graphs, 11--97 nodes per graph, and no raw node features. We engineer 38 structural descriptors per node and compare three encoders under one protocol: GIN, PNA, and GAT. Each model uses four message-passing layers, hidden dimension 128, attention pooling, and a virtual node. Model selection uses validation Matthews correlation coefficient. Experiments used Google Colab with an NVIDIA GPU, batch size 4096, learning rate 0.003, and early stopping with patience 8. ## Results ### Architecture comparison | Architecture | Best val MCC | Test MCC | Test F1 | Test ROC-AUC | | --- | --- | --- | --- | --- | | GIN | 0.5609 | 0.5642 | 0.8017 | 0.8417 | | PNA | 0.5609 | 0.5635 | 0.8016 | 0.8419 | | GAT | 0.5592 | 0.5655 | 0.8002 | 0.8418 | **Selected checkpoint: GIN** (`model.pt`), chosen by best validation MCC. GIN leads validation MCC by a margin of 6e-5 over PNA. On the held-out test split GAT reaches the highest MCC 0.5655, while ROC-AUC stays near 0.842 for all three encoders. ### Training curves ![Training curves for GIN, PNA, and GAT](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/training_curves.png) Validation MCC rises in the first five epochs and plateaus near 0.55--0.56 for every encoder. Best checkpoints appear at epoch 31 for GIN, epoch 23 for PNA, and epoch 32 for GAT. ### Test ROC curves ![Combined test ROC curves](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/test_roc_curves.png) Per-architecture plots: - GIN: ![GIN ROC curve](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gin_seed42/test_roc_curve.png) - PNA: ![PNA ROC curve](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/pna_seed42/test_roc_curve.png) - GAT: ![GAT ROC curve](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gat_seed42/test_roc_curve.png) ### Logit distributions on the test split ![Combined logit histograms for class 1](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/test_logit_histograms.png) Densities are split by ground-truth label. Separation between the two classes reflects ranking quality beyond the fixed 0.5 probability threshold. - GIN: ![GIN logit histogram](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gin_seed42/test_logit_histogram.png) - PNA: ![PNA logit histogram](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/pna_seed42/test_logit_histogram.png) - GAT: ![GAT logit histogram](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gat_seed42/test_logit_histogram.png) ### Confusion matrices on the test split #### GIN ![GIN test confusion matrix](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gin_seed42/test_confusion_matrix.png) #### PNA ![PNA test confusion matrix](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/pna_seed42/test_confusion_matrix.png) #### GAT ![GAT test confusion matrix](https://raw.githubusercontent.com/pymlex/threads-gnn/main/runs/gat_seed42/test_confusion_matrix.png) All models favour recall on the positive class. Class 0 recall stays near 0.67--0.70 while class 1 recall exceeds 0.85. GAT yields the highest class-0 recall 0.700 and test accuracy 0.781. ### Selected model test metrics (GIN) | Metric | Value | | --- | --- | | MCC | 0.5642 | | Accuracy | 0.7783 | | Balanced accuracy | 0.7758 | | Precision | 0.7400 | | Recall | 0.8745 | | F1 | 0.8017 | | ROC-AUC | 0.8417 | | PR-AUC | 0.8087 | ## Checkpoint | File | Description | | --- | --- | | `model.pt` | Best GIN checkpoint selected by validation MCC | | `config.json` | Experiment configuration for the selected run | | `final_metrics.json` | Validation and test metrics for the selected run | | `selected_model.json` | Architecture comparison and selection record | ## Inference ```python from huggingface_hub import hf_hub_download import torch checkpoint_path = hf_hub_download(repo_id="pymlex/threads-gnn", filename="model.pt") checkpoint = torch.load(checkpoint_path, map_location="cpu", weights_only=False) state_dict = checkpoint["model_state_dict"] ``` Clone https://github.com/pymlex/threads-gnn for the full `GINClassifier` definition, structural feature pipeline, and batched inference over PyG `Data` objects. ## Citation ```bibtex @misc{threads_gnn, author = {Alex Zyukov}, title = {Graph Classification on SNAP Reddit Threads}, year = {2026}, publisher = {GitHub}, howpublished = {\url{https://github.com/pymlex/threads-gnn}}, note = {Hugging Face model pymlex/threads-gnn} } ``` ## References ```bibtex @inproceedings{karateclub, title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}}, author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar}, year = {2020}, pages = {3125--3132}, booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)}, organization = {ACM}, } @inproceedings{xu2019gin, title = {How Powerful are Graph Neural Networks?}, author = {Keyulu Xu and Weihua Hu and Jure Leskovec and Stefanie Jegelka}, booktitle = {International Conference on Learning Representations}, year = {2019}, } @inproceedings{corso2020pna, title = {Principal Neighbourhood Aggregation for Graph Nets}, author = {Gabriele Corso and Luca Cavalleri and Dominique Beaini and Pietro Li and Petar Velickovic}, booktitle = {Advances in Neural Information Processing Systems}, year = {2020}, } @inproceedings{velickovic2018gat, title = {Graph Attention Networks}, author = {Petar Velickovic and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Li and Yoshua Bengio}, booktitle = {International Conference on Learning Representations}, year = {2018}, } ``` The project is under GPL-3.0 license.