| # CS3319 Project 2 Final Deliverable |
|
|
| This package contains the cleaned final artifacts for the CS3319 recommendation-system project. |
| It preserves the core code, data, model checkpoints, cached scores, random-walk model weights, |
| important submissions, and reports for the main stages of the work. |
|
|
| ## Best Confirmed Result |
|
|
| | Submission | Method | Public LB F1 | |
| |---|---|---:| |
| | `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` | rich content + 7 random-walk blocks + directed high-order citation propagation + LightGBM, rank top 50% | **0.96626** | |
|
|
| The strongest validation run for this final method is: |
|
|
| ```text |
| validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv |
| rich_rw7_highorder_directed validation F1 = 0.966874 |
| ``` |
|
|
| ## What Is Included |
|
|
| ```text |
| cs3319_final_deliverable/ |
| code/ Core experiment and generation scripts. |
| data_and_docs/ Official data files and course documents. |
| checkpoints/ LightGCN checkpoints, including final_ens6. |
| cached_scores/ Early cached BPR/LightGBM/ensemble score files. |
| validation_runs/ |
| feature_cache/ Cached content and high-order graph features. |
| dynamic_seed202/ Curated OOF scores, test scores, model weights, summaries, submissions. |
| submissions/ Early confirmed LightGCN submissions. |
| reports/ Exploration summary, preliminary report, final report. |
| env/ Environment exports / minimal requirements. |
| notes/ Experiment history. |
| manifests/ File manifests from the original transfer package. |
| ``` |
|
|
| ## Main Stages Preserved |
|
|
| | Stage | Key files | Result | |
| |---|---|---:| |
| | 6-model LightGCN ensemble | `submissions/sub_ens6_t0.36.csv` | public 0.93044 | |
| | Post95 stacker | `validation_runs/dynamic_seed202/post95_submission/submission_post95_ens_r0.500.csv` | public about 0.95760 | |
| | Content + BPR-MF stacker | `validation_runs/dynamic_seed202/extra_bprmf_submission/submission_post95_content_mf_lgb_score_ge0.500.csv` | public about 0.95996 | |
| | DeepWalk + Node2Vec stacker | `validation_runs/dynamic_seed202/node2vec_deepwalk_submission/submission_content_mf_deepwalk_node2vec_lgb_th0.480000.csv` | public about 0.96252 | |
| | High-order citation propagation | `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` | public 0.96626 | |
|
|
| ## Core Scripts |
|
|
| The most important scripts are: |
|
|
| | Script | Purpose | |
| |---|---| |
| | `code/train_val_lgcn_ensemble.py` | Dynamic validation LightGCN training and score generation. | |
| | `code/generate_post95_submission.py` | Post95 LightGCN + graph/content feature submission generator. | |
| | `code/extra_score_sources_ablation.py` | Content mean-cos, BPR-MF, and ranker score source ablations. | |
| | `code/node2vec_deepwalk_ablation.py` | Initial DeepWalk / Node2Vec score-source ablation. | |
| | `code/randomwalk_systematic_ablation.py` | Systematic random-walk feature experiments. | |
| | `code/generate_randomwalk_ensemble_submission.py` | Submission generation from selected random-walk feature blocks. | |
| | `code/content_rich_ablation.py` | Rich `feature.pkl` content feature construction. | |
| | `code/high_order_graph_stack.py` | Final high-order citation propagation experiment and submission generation. | |
| | `code/error_group_calibration.py` | Error analysis, threshold sweep, group calibration, boundary model. | |
|
|
| ## Final Method Summary |
|
|
| The final method is a LightGBM second-stage model over: |
|
|
| - LightGCN score / rank features. |
| - Explicit graph/meta-path features. |
| - Content mean-cos and top-k content similarity features. |
| - BPR-MF score features. |
| - Rich author-content profile features. |
| - Seven systematic DeepWalk / Node2Vec random-walk feature blocks. |
| - Aggregated random-walk agreement features. |
| - High-order citation propagation features: |
| - `A-P-P^k` |
| - `A-A-P-P^k` |
| - forward citation, backward citation, and undirected citation variants. |
| - popularity-normalized propagation scores. |
|
|
| The final test decision uses rank cutoff rather than a raw probability threshold: |
|
|
| ```text |
| sort test pairs by final score |
| predict top 50% as positive |
| force train/test-overlap known positives to 1 |
| ``` |
|
|
| This was more stable than transferring the validation-optimal probability threshold because the |
| validation split is an artificial 1:1 positive/negative split and LightGBM probabilities are not |
| well calibrated across the validation-test distribution shift. |
|
|
| ## Environment |
|
|
| Original environment notes are in: |
|
|
| ```text |
| env/environment-cs3319.yml |
| env/requirements-minimal.txt |
| ``` |
|
|
| The project was run with Python 3.10 and these core packages: |
|
|
| ```text |
| numpy |
| pandas |
| scipy |
| scikit-learn |
| lightgbm |
| xgboost |
| torch |
| torch-geometric |
| gensim |
| node2vec |
| networkx |
| ``` |
|
|
| ## Quick Verification |
|
|
| After unzipping the package, the fastest way to verify the final result is: |
|
|
| ```bash |
| cd cs3319_final_deliverable |
| |
| # Check the final validation metric. |
| cat validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv |
| |
| # Check generated final submissions and their positive ratios. |
| cat validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv |
| |
| # Confirm the best public submission file exists. |
| ls -lh validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv |
| ``` |
|
|
| Expected key validation row: |
|
|
| ```text |
| rich_rw7_highorder_directed validation F1 = 0.966873736337297 |
| ``` |
|
|
| The corresponding public-best file is: |
|
|
| ```text |
| validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv |
| ``` |
|
|
| ## Command Reproduction |
|
|
| The package includes cached feature matrices, random-walk model weights, OOF scores, test scores, |
| and final submissions. The quickest way to inspect the final result is to read: |
|
|
| ```text |
| validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv |
| validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv |
| validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv |
| ``` |
|
|
| To regenerate the final high-order stack from the included cached features and random-walk weights: |
|
|
| ```bash |
| cd cs3319_final_deliverable |
| python code/high_order_graph_stack.py \ |
| --package-root . \ |
| --split-seed 202 \ |
| --seed 202 \ |
| --n-splits 5 \ |
| --make-submission |
| ``` |
|
|
| This rewrites: |
|
|
| ```text |
| validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv |
| validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy |
| validation_runs/dynamic_seed202/high_order_graph_stack/submissions/ |
| validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv |
| ``` |
|
|
| The final decision rule is rank-based. The public-best file uses: |
|
|
| ```text |
| ratio = 0.500000 |
| ``` |
|
|
| instead of directly applying the validation probability threshold. |
|
|
| To regenerate the earlier 6-model LightGCN ensemble submissions from included checkpoints: |
|
|
| ```bash |
| cd cs3319_final_deliverable |
| python code/generate_ens6_submission.py \ |
| --package-root . \ |
| --device cuda:0 |
| ``` |
|
|
| If CUDA is unavailable, use: |
|
|
| ```bash |
| python code/generate_ens6_submission.py \ |
| --package-root . \ |
| --device cpu |
| ``` |
|
|
| The confirmed early public file is: |
|
|
| ```text |
| submissions/sub_ens6_t0.36.csv |
| ``` |
|
|
| To regenerate the 7-block random-walk stack that feeds the final high-order experiment: |
|
|
| ```bash |
| cd cs3319_final_deliverable |
| python code/generate_randomwalk_ensemble_submission.py \ |
| --package-root . \ |
| --split-seed 202 \ |
| --main-val-score-file validation_runs/dynamic_seed202/dyn202_l2d512_bpr_bigbatch_more/scores/val_vanilla_ensemble_mean.npy \ |
| --versions \ |
| dw_base_d128_l40_w10_win10 \ |
| dw_long_d128_l80_w10_win10 \ |
| dw_highdim_d256_l40_w10_win10 \ |
| dw_d256_l80_w10_win10 \ |
| dw_seed3407_d128_l40_w10_win10 \ |
| dw_graph_ap_pp \ |
| n2v_p2_q1_d128_l40_w10_win10 |
| ``` |
|
|
| The random-walk models required for the final stage are included under: |
|
|
| ```text |
| validation_runs/dynamic_seed202/randomwalk_systematic/models/ |
| ``` |
|
|
| The cached high-order and rich content features are included under: |
|
|
| ```text |
| validation_runs/feature_cache/ |
| ``` |
|
|
| Some scripts inherited from the original workspace contain absolute paths in older metadata files. |
| For the curated final artifacts, use the files already included in this deliverable or adapt paths |
| relative to the package root. |
|
|
| ## Reports |
|
|
| Read these in order: |
|
|
| ```text |
| reports/preliminary_report.md |
| reports/exploration_summary.md |
| reports/final_report.md |
| notes/experiment_history.md |
| ``` |
|
|