CS3319 Project 2 Final Deliverable
This package contains the cleaned final artifacts for the CS3319 recommendation-system project. It preserves the core code, data, model checkpoints, cached scores, random-walk model weights, important submissions, and reports for the main stages of the work.
Best Confirmed Result
| Submission | Method | Public LB F1 |
|---|---|---|
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv |
rich content + 7 random-walk blocks + directed high-order citation propagation + LightGBM, rank top 50% | 0.96626 |
The strongest validation run for this final method is:
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
rich_rw7_highorder_directed validation F1 = 0.966874
What Is Included
cs3319_final_deliverable/
code/ Core experiment and generation scripts.
data_and_docs/ Official data files and course documents.
checkpoints/ LightGCN checkpoints, including final_ens6.
cached_scores/ Early cached BPR/LightGBM/ensemble score files.
validation_runs/
feature_cache/ Cached content and high-order graph features.
dynamic_seed202/ Curated OOF scores, test scores, model weights, summaries, submissions.
submissions/ Early confirmed LightGCN submissions.
reports/ Exploration summary, preliminary report, final report.
env/ Environment exports / minimal requirements.
notes/ Experiment history.
manifests/ File manifests from the original transfer package.
Main Stages Preserved
| Stage | Key files | Result |
|---|---|---|
| 6-model LightGCN ensemble | submissions/sub_ens6_t0.36.csv |
public 0.93044 |
| Post95 stacker | validation_runs/dynamic_seed202/post95_submission/submission_post95_ens_r0.500.csv |
public about 0.95760 |
| Content + BPR-MF stacker | validation_runs/dynamic_seed202/extra_bprmf_submission/submission_post95_content_mf_lgb_score_ge0.500.csv |
public about 0.95996 |
| DeepWalk + Node2Vec stacker | validation_runs/dynamic_seed202/node2vec_deepwalk_submission/submission_content_mf_deepwalk_node2vec_lgb_th0.480000.csv |
public about 0.96252 |
| High-order citation propagation | validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv |
public 0.96626 |
Core Scripts
The most important scripts are:
| Script | Purpose |
|---|---|
code/train_val_lgcn_ensemble.py |
Dynamic validation LightGCN training and score generation. |
code/generate_post95_submission.py |
Post95 LightGCN + graph/content feature submission generator. |
code/extra_score_sources_ablation.py |
Content mean-cos, BPR-MF, and ranker score source ablations. |
code/node2vec_deepwalk_ablation.py |
Initial DeepWalk / Node2Vec score-source ablation. |
code/randomwalk_systematic_ablation.py |
Systematic random-walk feature experiments. |
code/generate_randomwalk_ensemble_submission.py |
Submission generation from selected random-walk feature blocks. |
code/content_rich_ablation.py |
Rich feature.pkl content feature construction. |
code/high_order_graph_stack.py |
Final high-order citation propagation experiment and submission generation. |
code/error_group_calibration.py |
Error analysis, threshold sweep, group calibration, boundary model. |
Final Method Summary
The final method is a LightGBM second-stage model over:
- LightGCN score / rank features.
- Explicit graph/meta-path features.
- Content mean-cos and top-k content similarity features.
- BPR-MF score features.
- Rich author-content profile features.
- Seven systematic DeepWalk / Node2Vec random-walk feature blocks.
- Aggregated random-walk agreement features.
- High-order citation propagation features:
A-P-P^kA-A-P-P^k- forward citation, backward citation, and undirected citation variants.
- popularity-normalized propagation scores.
The final test decision uses rank cutoff rather than a raw probability threshold:
sort test pairs by final score
predict top 50% as positive
force train/test-overlap known positives to 1
This was more stable than transferring the validation-optimal probability threshold because the validation split is an artificial 1:1 positive/negative split and LightGBM probabilities are not well calibrated across the validation-test distribution shift.
Environment
Original environment notes are in:
env/environment-cs3319.yml
env/requirements-minimal.txt
The project was run with Python 3.10 and these core packages:
numpy
pandas
scipy
scikit-learn
lightgbm
xgboost
torch
torch-geometric
gensim
node2vec
networkx
Quick Verification
After unzipping the package, the fastest way to verify the final result is:
cd cs3319_final_deliverable
# Check the final validation metric.
cat validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
# Check generated final submissions and their positive ratios.
cat validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
# Confirm the best public submission file exists.
ls -lh validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
Expected key validation row:
rich_rw7_highorder_directed validation F1 = 0.966873736337297
The corresponding public-best file is:
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
Command Reproduction
The package includes cached feature matrices, random-walk model weights, OOF scores, test scores, and final submissions. The quickest way to inspect the final result is to read:
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
To regenerate the final high-order stack from the included cached features and random-walk weights:
cd cs3319_final_deliverable
python code/high_order_graph_stack.py \
--package-root . \
--split-seed 202 \
--seed 202 \
--n-splits 5 \
--make-submission
This rewrites:
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/
validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
The final decision rule is rank-based. The public-best file uses:
ratio = 0.500000
instead of directly applying the validation probability threshold.
To regenerate the earlier 6-model LightGCN ensemble submissions from included checkpoints:
cd cs3319_final_deliverable
python code/generate_ens6_submission.py \
--package-root . \
--device cuda:0
If CUDA is unavailable, use:
python code/generate_ens6_submission.py \
--package-root . \
--device cpu
The confirmed early public file is:
submissions/sub_ens6_t0.36.csv
To regenerate the 7-block random-walk stack that feeds the final high-order experiment:
cd cs3319_final_deliverable
python code/generate_randomwalk_ensemble_submission.py \
--package-root . \
--split-seed 202 \
--main-val-score-file validation_runs/dynamic_seed202/dyn202_l2d512_bpr_bigbatch_more/scores/val_vanilla_ensemble_mean.npy \
--versions \
dw_base_d128_l40_w10_win10 \
dw_long_d128_l80_w10_win10 \
dw_highdim_d256_l40_w10_win10 \
dw_d256_l80_w10_win10 \
dw_seed3407_d128_l40_w10_win10 \
dw_graph_ap_pp \
n2v_p2_q1_d128_l40_w10_win10
The random-walk models required for the final stage are included under:
validation_runs/dynamic_seed202/randomwalk_systematic/models/
The cached high-order and rich content features are included under:
validation_runs/feature_cache/
Some scripts inherited from the original workspace contain absolute paths in older metadata files. For the curated final artifacts, use the files already included in this deliverable or adapt paths relative to the package root.
Reports
Read these in order:
reports/preliminary_report.md
reports/exploration_summary.md
reports/final_report.md
notes/experiment_history.md