Final Report: Citation-Aware High-Order Graph Recommendation
Abstract
This project solves an academic paper recommendation task as author-paper link prediction on a
heterogeneous academic graph. The final system combines LightGCN collaborative filtering, explicit
graph/meta-path features, content features from feature.pkl, BPR-MF scores, DeepWalk / Node2Vec
random-walk embeddings, and a new citation-aware high-order propagation feature family. The best
confirmed public leaderboard score is 0.96626 F1, achieved by
submission_rich_rw7_highorder_directed_r0.500000.csv.
Data
The official data includes:
bipartite_train_ann.txt: author-paper training positives.bipartite_test_ann.txt: author-paper pairs to predict.author_file_ann.txt: author-author collaboration edges.paper_file_ann.txt: paper-paper citation edges.feature.pkl: 512-dimensional paper content features.
The graph has 6,611 authors and about 79,937 papers. The test set contains 2,047,262 author-paper pairs.
Baseline And Early Models
The initial notebook-style heterogeneous GNN baseline reached validation F1 around 0.885. Several variants were tried:
- SAGEConv heterogeneous GNN with MLP decoder.
- BPR ranking loss.
- LightGBM structural feature baselines.
- BPR-MF recommender baselines.
- Multiple LightGCN variants.
The first stable confirmed public result was a 6-model LightGCN ensemble:
submissions/sub_ens6_t0.36.csv
public F1 = 0.93044
This ensemble averaged cosine scores from six LightGCN checkpoints and forced known train/test overlap positives to 1.
First Major Breakthrough: Feature Stacking
The first large improvement came from moving beyond pure LightGCN scores. The model stacked:
- LightGCN score and rank features.
- Author degree and paper degree.
- Coauthor evidence.
- Citation in/out degree.
- Author-history and candidate-paper citation overlaps.
- Meta-path counts such as A-A-P, A-P-P, and A-P-A-P.
- Content similarity features.
- BPR-MF score features.
The second-stage model was LightGBM with OOF validation. This pushed public performance to about 0.95996 with the content + BPR-MF stacker.
Second Major Breakthrough: DeepWalk / Node2Vec
The next improvement came from random-walk graph embedding score sources. DeepWalk and Node2Vec were trained on mixed academic graphs using author-paper, paper-paper citation, and author-author coauthor edges. For each author-paper pair, the model constructed:
- dot product.
- cosine similarity.
- global rank.
- author-wise rank / percentile.
Adding DeepWalk and Node2Vec to the content + BPR-MF stacker improved public F1 to about 0.96252. Further systematic random-walk experiments showed that higher-dimensional DeepWalk and longer walks improved validation, but larger random-walk ensembles began to overfit the seed202 validation split.
Third Major Breakthrough: High-Order Citation Propagation
The final and most important innovation was explicit high-order citation propagation. Instead of training more random-walk embeddings, we computed deterministic propagation features over typed meta-paths.
Let:
Rbe the row-normalized author-paper interaction matrix.Cbe the row-normalized paper-paper citation matrix.Sbe the row-normalized author-author coauthor matrix.
Author-history citation propagation is:
H_k = R C^k
For a candidate pair (a, p), H_k[a, p] measures whether candidate paper p is reachable from
author a's historical papers through k citation steps.
Coauthor-based propagation is:
G_k = S R C^k
This captures whether candidate paper p is reachable from the historical papers of author a's
collaborators.
The final version uses three citation directions:
- forward citation.
- backward citation.
- undirected citation.
It also includes popularity-normalized scores:
propagation_score / log(1 + paper_degree + citation_degree)
This reduces the tendency to over-score globally popular papers.
Validation Results
| Stage | Validation F1 | AUC |
|---|---|---|
| rich content + 7 random-walk blocks | 0.964947 | 0.994555 |
| + undirected high-order propagation | 0.966556 | 0.994890 |
| + directed high-order propagation | 0.966874 | 0.994918 |
The final public submission is:
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
public F1 = 0.96626
Threshold Calibration Observation
An important finding is that validation-optimal probability thresholds do not transfer reliably to test. For the final model, the validation-optimal threshold was:
0.461730808
Applying this threshold directly to test produced a positive ratio of:
0.524195
The public-best final submission instead used rank cutoff:
rank top 50.0% -> positive
force known positives -> positive
This gives a stable test positive ratio of 0.500000.
The reason is that the validation set is an artificial 1:1 positive/negative split, while the test candidate distribution is different. LightGBM scores are strong ranking scores but are not calibrated probabilities under this distribution shift. Therefore, rank cutoff is more robust than transferring the raw validation probability threshold.
Final Files
Best final submission:
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
Final validation summary:
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
Final test scores:
validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy
High-order feature code:
code/high_order_graph_stack.py
Conclusion
The final system improves by combining three complementary signals:
- LightGCN-style collaborative filtering.
- Random-walk graph embedding proximity.
- Explicit citation-aware high-order meta-path propagation.
The high-order propagation features are the most distinctive final contribution. They preserve
interpretable path semantics such as A-P-P^k and A-A-P-P^k, separate citation directionality,
and reduce popularity bias through normalization. This turned out to be a real public leaderboard
improvement rather than only a validation gain.