cs3319-project2 / reports /final_report.md
NLP-beginner's picture
CS3319 Project 2 final deliverable (public F1 = 0.96626)
f28d994
|
Raw
History Blame Contribute Delete
6.39 kB

Final Report: Citation-Aware High-Order Graph Recommendation

Abstract

This project solves an academic paper recommendation task as author-paper link prediction on a heterogeneous academic graph. The final system combines LightGCN collaborative filtering, explicit graph/meta-path features, content features from feature.pkl, BPR-MF scores, DeepWalk / Node2Vec random-walk embeddings, and a new citation-aware high-order propagation feature family. The best confirmed public leaderboard score is 0.96626 F1, achieved by submission_rich_rw7_highorder_directed_r0.500000.csv.

Data

The official data includes:

  • bipartite_train_ann.txt: author-paper training positives.
  • bipartite_test_ann.txt: author-paper pairs to predict.
  • author_file_ann.txt: author-author collaboration edges.
  • paper_file_ann.txt: paper-paper citation edges.
  • feature.pkl: 512-dimensional paper content features.

The graph has 6,611 authors and about 79,937 papers. The test set contains 2,047,262 author-paper pairs.

Baseline And Early Models

The initial notebook-style heterogeneous GNN baseline reached validation F1 around 0.885. Several variants were tried:

  • SAGEConv heterogeneous GNN with MLP decoder.
  • BPR ranking loss.
  • LightGBM structural feature baselines.
  • BPR-MF recommender baselines.
  • Multiple LightGCN variants.

The first stable confirmed public result was a 6-model LightGCN ensemble:

submissions/sub_ens6_t0.36.csv
public F1 = 0.93044

This ensemble averaged cosine scores from six LightGCN checkpoints and forced known train/test overlap positives to 1.

First Major Breakthrough: Feature Stacking

The first large improvement came from moving beyond pure LightGCN scores. The model stacked:

  • LightGCN score and rank features.
  • Author degree and paper degree.
  • Coauthor evidence.
  • Citation in/out degree.
  • Author-history and candidate-paper citation overlaps.
  • Meta-path counts such as A-A-P, A-P-P, and A-P-A-P.
  • Content similarity features.
  • BPR-MF score features.

The second-stage model was LightGBM with OOF validation. This pushed public performance to about 0.95996 with the content + BPR-MF stacker.

Second Major Breakthrough: DeepWalk / Node2Vec

The next improvement came from random-walk graph embedding score sources. DeepWalk and Node2Vec were trained on mixed academic graphs using author-paper, paper-paper citation, and author-author coauthor edges. For each author-paper pair, the model constructed:

  • dot product.
  • cosine similarity.
  • global rank.
  • author-wise rank / percentile.

Adding DeepWalk and Node2Vec to the content + BPR-MF stacker improved public F1 to about 0.96252. Further systematic random-walk experiments showed that higher-dimensional DeepWalk and longer walks improved validation, but larger random-walk ensembles began to overfit the seed202 validation split.

Third Major Breakthrough: High-Order Citation Propagation

The final and most important innovation was explicit high-order citation propagation. Instead of training more random-walk embeddings, we computed deterministic propagation features over typed meta-paths.

Let:

  • R be the row-normalized author-paper interaction matrix.
  • C be the row-normalized paper-paper citation matrix.
  • S be the row-normalized author-author coauthor matrix.

Author-history citation propagation is:

H_k = R C^k

For a candidate pair (a, p), H_k[a, p] measures whether candidate paper p is reachable from author a's historical papers through k citation steps.

Coauthor-based propagation is:

G_k = S R C^k

This captures whether candidate paper p is reachable from the historical papers of author a's collaborators.

The final version uses three citation directions:

  • forward citation.
  • backward citation.
  • undirected citation.

It also includes popularity-normalized scores:

propagation_score / log(1 + paper_degree + citation_degree)

This reduces the tendency to over-score globally popular papers.

Validation Results

Stage Validation F1 AUC
rich content + 7 random-walk blocks 0.964947 0.994555
+ undirected high-order propagation 0.966556 0.994890
+ directed high-order propagation 0.966874 0.994918

The final public submission is:

validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
public F1 = 0.96626

Threshold Calibration Observation

An important finding is that validation-optimal probability thresholds do not transfer reliably to test. For the final model, the validation-optimal threshold was:

0.461730808

Applying this threshold directly to test produced a positive ratio of:

0.524195

The public-best final submission instead used rank cutoff:

rank top 50.0% -> positive
force known positives -> positive

This gives a stable test positive ratio of 0.500000.

The reason is that the validation set is an artificial 1:1 positive/negative split, while the test candidate distribution is different. LightGBM scores are strong ranking scores but are not calibrated probabilities under this distribution shift. Therefore, rank cutoff is more robust than transferring the raw validation probability threshold.

Final Files

Best final submission:

validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv

Final validation summary:

validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv

Final test scores:

validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy

High-order feature code:

code/high_order_graph_stack.py

Conclusion

The final system improves by combining three complementary signals:

  1. LightGCN-style collaborative filtering.
  2. Random-walk graph embedding proximity.
  3. Explicit citation-aware high-order meta-path propagation.

The high-order propagation features are the most distinctive final contribution. They preserve interpretable path semantics such as A-P-P^k and A-A-P-P^k, separate citation directionality, and reduce popularity bias through normalization. This turned out to be a real public leaderboard improvement rather than only a validation gain.