# CS3319 Project 2 Final Deliverable This package contains the cleaned final artifacts for the CS3319 recommendation-system project. It preserves the core code, data, model checkpoints, cached scores, random-walk model weights, important submissions, and reports for the main stages of the work. ## Best Confirmed Result | Submission | Method | Public LB F1 | |---|---|---:| | `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` | rich content + 7 random-walk blocks + directed high-order citation propagation + LightGBM, rank top 50% | **0.96626** | The strongest validation run for this final method is: ```text validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv rich_rw7_highorder_directed validation F1 = 0.966874 ``` ## What Is Included ```text cs3319_final_deliverable/ code/ Core experiment and generation scripts. data_and_docs/ Official data files and course documents. checkpoints/ LightGCN checkpoints, including final_ens6. cached_scores/ Early cached BPR/LightGBM/ensemble score files. validation_runs/ feature_cache/ Cached content and high-order graph features. dynamic_seed202/ Curated OOF scores, test scores, model weights, summaries, submissions. submissions/ Early confirmed LightGCN submissions. reports/ Exploration summary, preliminary report, final report. env/ Environment exports / minimal requirements. notes/ Experiment history. manifests/ File manifests from the original transfer package. ``` ## Main Stages Preserved | Stage | Key files | Result | |---|---|---:| | 6-model LightGCN ensemble | `submissions/sub_ens6_t0.36.csv` | public 0.93044 | | Post95 stacker | `validation_runs/dynamic_seed202/post95_submission/submission_post95_ens_r0.500.csv` | public about 0.95760 | | Content + BPR-MF stacker | `validation_runs/dynamic_seed202/extra_bprmf_submission/submission_post95_content_mf_lgb_score_ge0.500.csv` | public about 0.95996 | | DeepWalk + Node2Vec stacker | `validation_runs/dynamic_seed202/node2vec_deepwalk_submission/submission_content_mf_deepwalk_node2vec_lgb_th0.480000.csv` | public about 0.96252 | | High-order citation propagation | `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` | public 0.96626 | ## Core Scripts The most important scripts are: | Script | Purpose | |---|---| | `code/train_val_lgcn_ensemble.py` | Dynamic validation LightGCN training and score generation. | | `code/generate_post95_submission.py` | Post95 LightGCN + graph/content feature submission generator. | | `code/extra_score_sources_ablation.py` | Content mean-cos, BPR-MF, and ranker score source ablations. | | `code/node2vec_deepwalk_ablation.py` | Initial DeepWalk / Node2Vec score-source ablation. | | `code/randomwalk_systematic_ablation.py` | Systematic random-walk feature experiments. | | `code/generate_randomwalk_ensemble_submission.py` | Submission generation from selected random-walk feature blocks. | | `code/content_rich_ablation.py` | Rich `feature.pkl` content feature construction. | | `code/high_order_graph_stack.py` | Final high-order citation propagation experiment and submission generation. | | `code/error_group_calibration.py` | Error analysis, threshold sweep, group calibration, boundary model. | ## Final Method Summary The final method is a LightGBM second-stage model over: - LightGCN score / rank features. - Explicit graph/meta-path features. - Content mean-cos and top-k content similarity features. - BPR-MF score features. - Rich author-content profile features. - Seven systematic DeepWalk / Node2Vec random-walk feature blocks. - Aggregated random-walk agreement features. - High-order citation propagation features: - `A-P-P^k` - `A-A-P-P^k` - forward citation, backward citation, and undirected citation variants. - popularity-normalized propagation scores. The final test decision uses rank cutoff rather than a raw probability threshold: ```text sort test pairs by final score predict top 50% as positive force train/test-overlap known positives to 1 ``` This was more stable than transferring the validation-optimal probability threshold because the validation split is an artificial 1:1 positive/negative split and LightGBM probabilities are not well calibrated across the validation-test distribution shift. ## Environment Original environment notes are in: ```text env/environment-cs3319.yml env/requirements-minimal.txt ``` The project was run with Python 3.10 and these core packages: ```text numpy pandas scipy scikit-learn lightgbm xgboost torch torch-geometric gensim node2vec networkx ``` ## Quick Verification After unzipping the package, the fastest way to verify the final result is: ```bash cd cs3319_final_deliverable # Check the final validation metric. cat validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv # Check generated final submissions and their positive ratios. cat validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv # Confirm the best public submission file exists. ls -lh validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv ``` Expected key validation row: ```text rich_rw7_highorder_directed validation F1 = 0.966873736337297 ``` The corresponding public-best file is: ```text validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv ``` ## Command Reproduction The package includes cached feature matrices, random-walk model weights, OOF scores, test scores, and final submissions. The quickest way to inspect the final result is to read: ```text validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv ``` To regenerate the final high-order stack from the included cached features and random-walk weights: ```bash cd cs3319_final_deliverable python code/high_order_graph_stack.py \ --package-root . \ --split-seed 202 \ --seed 202 \ --n-splits 5 \ --make-submission ``` This rewrites: ```text validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy validation_runs/dynamic_seed202/high_order_graph_stack/submissions/ validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv ``` The final decision rule is rank-based. The public-best file uses: ```text ratio = 0.500000 ``` instead of directly applying the validation probability threshold. To regenerate the earlier 6-model LightGCN ensemble submissions from included checkpoints: ```bash cd cs3319_final_deliverable python code/generate_ens6_submission.py \ --package-root . \ --device cuda:0 ``` If CUDA is unavailable, use: ```bash python code/generate_ens6_submission.py \ --package-root . \ --device cpu ``` The confirmed early public file is: ```text submissions/sub_ens6_t0.36.csv ``` To regenerate the 7-block random-walk stack that feeds the final high-order experiment: ```bash cd cs3319_final_deliverable python code/generate_randomwalk_ensemble_submission.py \ --package-root . \ --split-seed 202 \ --main-val-score-file validation_runs/dynamic_seed202/dyn202_l2d512_bpr_bigbatch_more/scores/val_vanilla_ensemble_mean.npy \ --versions \ dw_base_d128_l40_w10_win10 \ dw_long_d128_l80_w10_win10 \ dw_highdim_d256_l40_w10_win10 \ dw_d256_l80_w10_win10 \ dw_seed3407_d128_l40_w10_win10 \ dw_graph_ap_pp \ n2v_p2_q1_d128_l40_w10_win10 ``` The random-walk models required for the final stage are included under: ```text validation_runs/dynamic_seed202/randomwalk_systematic/models/ ``` The cached high-order and rich content features are included under: ```text validation_runs/feature_cache/ ``` Some scripts inherited from the original workspace contain absolute paths in older metadata files. For the curated final artifacts, use the files already included in this deliverable or adapt paths relative to the package root. ## Reports Read these in order: ```text reports/preliminary_report.md reports/exploration_summary.md reports/final_report.md notes/experiment_history.md ```