CS3319 Project 2 final deliverable (public F1 = 0.96626)

f28d994 12 days ago

8.7 kB

CS3319 Project 2 Final Deliverable

This package contains the cleaned final artifacts for the CS3319 recommendation-system project. It preserves the core code, data, model checkpoints, cached scores, random-walk model weights, important submissions, and reports for the main stages of the work.

Best Confirmed Result

Submission	Method	Public LB F1
`validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv`	rich content + 7 random-walk blocks + directed high-order citation propagation + LightGBM, rank top 50%	0.96626

The strongest validation run for this final method is:

validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
rich_rw7_highorder_directed validation F1 = 0.966874

What Is Included

cs3319_final_deliverable/
  code/                         Core experiment and generation scripts.
  data_and_docs/                Official data files and course documents.
  checkpoints/                  LightGCN checkpoints, including final_ens6.
  cached_scores/                Early cached BPR/LightGBM/ensemble score files.
  validation_runs/
    feature_cache/              Cached content and high-order graph features.
    dynamic_seed202/            Curated OOF scores, test scores, model weights, summaries, submissions.
  submissions/                  Early confirmed LightGCN submissions.
  reports/                      Exploration summary, preliminary report, final report.
  env/                          Environment exports / minimal requirements.
  notes/                        Experiment history.
  manifests/                    File manifests from the original transfer package.

Main Stages Preserved

Stage	Key files	Result
6-model LightGCN ensemble	`submissions/sub_ens6_t0.36.csv`	public 0.93044
Post95 stacker	`validation_runs/dynamic_seed202/post95_submission/submission_post95_ens_r0.500.csv`	public about 0.95760
Content + BPR-MF stacker	`validation_runs/dynamic_seed202/extra_bprmf_submission/submission_post95_content_mf_lgb_score_ge0.500.csv`	public about 0.95996
DeepWalk + Node2Vec stacker	`validation_runs/dynamic_seed202/node2vec_deepwalk_submission/submission_content_mf_deepwalk_node2vec_lgb_th0.480000.csv`	public about 0.96252
High-order citation propagation	`validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv`	public 0.96626

Core Scripts

The most important scripts are:

Script	Purpose
`code/train_val_lgcn_ensemble.py`	Dynamic validation LightGCN training and score generation.
`code/generate_post95_submission.py`	Post95 LightGCN + graph/content feature submission generator.
`code/extra_score_sources_ablation.py`	Content mean-cos, BPR-MF, and ranker score source ablations.
`code/node2vec_deepwalk_ablation.py`	Initial DeepWalk / Node2Vec score-source ablation.
`code/randomwalk_systematic_ablation.py`	Systematic random-walk feature experiments.
`code/generate_randomwalk_ensemble_submission.py`	Submission generation from selected random-walk feature blocks.
`code/content_rich_ablation.py`	Rich `feature.pkl` content feature construction.
`code/high_order_graph_stack.py`	Final high-order citation propagation experiment and submission generation.
`code/error_group_calibration.py`	Error analysis, threshold sweep, group calibration, boundary model.

Final Method Summary

The final method is a LightGBM second-stage model over:

LightGCN score / rank features.
Explicit graph/meta-path features.
Content mean-cos and top-k content similarity features.
BPR-MF score features.
Rich author-content profile features.
Seven systematic DeepWalk / Node2Vec random-walk feature blocks.
Aggregated random-walk agreement features.
High-order citation propagation features:
- A-P-P^k
- A-A-P-P^k
- forward citation, backward citation, and undirected citation variants.
- popularity-normalized propagation scores.

The final test decision uses rank cutoff rather than a raw probability threshold:

sort test pairs by final score
predict top 50% as positive
force train/test-overlap known positives to 1

This was more stable than transferring the validation-optimal probability threshold because the validation split is an artificial 1:1 positive/negative split and LightGBM probabilities are not well calibrated across the validation-test distribution shift.

Environment

Original environment notes are in:

env/environment-cs3319.yml
env/requirements-minimal.txt

The project was run with Python 3.10 and these core packages:

numpy
pandas
scipy
scikit-learn
lightgbm
xgboost
torch
torch-geometric
gensim
node2vec
networkx

Quick Verification

After unzipping the package, the fastest way to verify the final result is:

cd cs3319_final_deliverable

# Check the final validation metric.
cat validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv

# Check generated final submissions and their positive ratios.
cat validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv

# Confirm the best public submission file exists.
ls -lh validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv

Expected key validation row:

rich_rw7_highorder_directed validation F1 = 0.966873736337297

The corresponding public-best file is:

validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv

Command Reproduction

The package includes cached feature matrices, random-walk model weights, OOF scores, test scores, and final submissions. The quickest way to inspect the final result is to read:

validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv

To regenerate the final high-order stack from the included cached features and random-walk weights:

cd cs3319_final_deliverable
python code/high_order_graph_stack.py \
  --package-root . \
  --split-seed 202 \
  --seed 202 \
  --n-splits 5 \
  --make-submission

This rewrites:

validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/
validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv

The final decision rule is rank-based. The public-best file uses:

ratio = 0.500000

instead of directly applying the validation probability threshold.

To regenerate the earlier 6-model LightGCN ensemble submissions from included checkpoints:

cd cs3319_final_deliverable
python code/generate_ens6_submission.py \
  --package-root . \
  --device cuda:0

If CUDA is unavailable, use:

python code/generate_ens6_submission.py \
  --package-root . \
  --device cpu

The confirmed early public file is:

submissions/sub_ens6_t0.36.csv

To regenerate the 7-block random-walk stack that feeds the final high-order experiment:

cd cs3319_final_deliverable
python code/generate_randomwalk_ensemble_submission.py \
  --package-root . \
  --split-seed 202 \
  --main-val-score-file validation_runs/dynamic_seed202/dyn202_l2d512_bpr_bigbatch_more/scores/val_vanilla_ensemble_mean.npy \
  --versions \
    dw_base_d128_l40_w10_win10 \
    dw_long_d128_l80_w10_win10 \
    dw_highdim_d256_l40_w10_win10 \
    dw_d256_l80_w10_win10 \
    dw_seed3407_d128_l40_w10_win10 \
    dw_graph_ap_pp \
    n2v_p2_q1_d128_l40_w10_win10

The random-walk models required for the final stage are included under:

validation_runs/dynamic_seed202/randomwalk_systematic/models/

The cached high-order and rich content features are included under:

validation_runs/feature_cache/

Some scripts inherited from the original workspace contain absolute paths in older metadata files. For the curated final artifacts, use the files already included in this deliverable or adapt paths relative to the package root.

Reports

Read these in order:

reports/preliminary_report.md
reports/exploration_summary.md
reports/final_report.md
notes/experiment_history.md