CS3319 Project 2 final deliverable (public F1 = 0.96626)

f28d994 13 days ago

8.7 kB

	# CS3319 Project 2 Final Deliverable

	This package contains the cleaned final artifacts for the CS3319 recommendation-system project.
	It preserves the core code, data, model checkpoints, cached scores, random-walk model weights,
	important submissions, and reports for the main stages of the work.

	## Best Confirmed Result

	\| Submission \| Method \| Public LB F1 \|
	\|---\|---\|---:\|
	\| `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` \| rich content + 7 random-walk blocks + directed high-order citation propagation + LightGBM, rank top 50% \| 0.96626 \|

	The strongest validation run for this final method is:

	```text
	validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
	rich_rw7_highorder_directed validation F1 = 0.966874
	```

	## What Is Included

	```text
	cs3319_final_deliverable/
	code/ Core experiment and generation scripts.
	data_and_docs/ Official data files and course documents.
	checkpoints/ LightGCN checkpoints, including final_ens6.
	cached_scores/ Early cached BPR/LightGBM/ensemble score files.
	validation_runs/
	feature_cache/ Cached content and high-order graph features.
	dynamic_seed202/ Curated OOF scores, test scores, model weights, summaries, submissions.
	submissions/ Early confirmed LightGCN submissions.
	reports/ Exploration summary, preliminary report, final report.
	env/ Environment exports / minimal requirements.
	notes/ Experiment history.
	manifests/ File manifests from the original transfer package.
	```

	## Main Stages Preserved

	\| Stage \| Key files \| Result \|
	\|---\|---\|---:\|
	\| 6-model LightGCN ensemble \| `submissions/sub_ens6_t0.36.csv` \| public 0.93044 \|
	\| Post95 stacker \| `validation_runs/dynamic_seed202/post95_submission/submission_post95_ens_r0.500.csv` \| public about 0.95760 \|
	\| Content + BPR-MF stacker \| `validation_runs/dynamic_seed202/extra_bprmf_submission/submission_post95_content_mf_lgb_score_ge0.500.csv` \| public about 0.95996 \|
	\| DeepWalk + Node2Vec stacker \| `validation_runs/dynamic_seed202/node2vec_deepwalk_submission/submission_content_mf_deepwalk_node2vec_lgb_th0.480000.csv` \| public about 0.96252 \|
	\| High-order citation propagation \| `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` \| public 0.96626 \|

	## Core Scripts

	The most important scripts are:

	\| Script \| Purpose \|
	\|---\|---\|
	\| `code/train_val_lgcn_ensemble.py` \| Dynamic validation LightGCN training and score generation. \|
	\| `code/generate_post95_submission.py` \| Post95 LightGCN + graph/content feature submission generator. \|
	\| `code/extra_score_sources_ablation.py` \| Content mean-cos, BPR-MF, and ranker score source ablations. \|
	\| `code/node2vec_deepwalk_ablation.py` \| Initial DeepWalk / Node2Vec score-source ablation. \|
	\| `code/randomwalk_systematic_ablation.py` \| Systematic random-walk feature experiments. \|
	\| `code/generate_randomwalk_ensemble_submission.py` \| Submission generation from selected random-walk feature blocks. \|
	\| `code/content_rich_ablation.py` \| Rich `feature.pkl` content feature construction. \|
	\| `code/high_order_graph_stack.py` \| Final high-order citation propagation experiment and submission generation. \|
	\| `code/error_group_calibration.py` \| Error analysis, threshold sweep, group calibration, boundary model. \|

	## Final Method Summary

	The final method is a LightGBM second-stage model over:

	- LightGCN score / rank features.
	- Explicit graph/meta-path features.
	- Content mean-cos and top-k content similarity features.
	- BPR-MF score features.
	- Rich author-content profile features.
	- Seven systematic DeepWalk / Node2Vec random-walk feature blocks.
	- Aggregated random-walk agreement features.
	- High-order citation propagation features:
	- `A-P-P^k`
	- `A-A-P-P^k`
	- forward citation, backward citation, and undirected citation variants.
	- popularity-normalized propagation scores.

	The final test decision uses rank cutoff rather than a raw probability threshold:

	```text
	sort test pairs by final score
	predict top 50% as positive
	force train/test-overlap known positives to 1
	```

	This was more stable than transferring the validation-optimal probability threshold because the
	validation split is an artificial 1:1 positive/negative split and LightGBM probabilities are not
	well calibrated across the validation-test distribution shift.

	## Environment

	Original environment notes are in:

	```text
	env/environment-cs3319.yml
	env/requirements-minimal.txt
	```

	The project was run with Python 3.10 and these core packages:

	```text
	numpy
	pandas
	scipy
	scikit-learn
	lightgbm
	xgboost
	torch
	torch-geometric
	gensim
	node2vec
	networkx
	```

	## Quick Verification

	After unzipping the package, the fastest way to verify the final result is:

	```bash
	cd cs3319_final_deliverable

	# Check the final validation metric.
	cat validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv

	# Check generated final submissions and their positive ratios.
	cat validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv

	# Confirm the best public submission file exists.
	ls -lh validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
	```

	Expected key validation row:

	```text
	rich_rw7_highorder_directed validation F1 = 0.966873736337297
	```

	The corresponding public-best file is:

	```text
	validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
	```

	## Command Reproduction

	The package includes cached feature matrices, random-walk model weights, OOF scores, test scores,
	and final submissions. The quickest way to inspect the final result is to read:

	```text
	validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
	validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
	validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
	```

	To regenerate the final high-order stack from the included cached features and random-walk weights:

	```bash
	cd cs3319_final_deliverable
	python code/high_order_graph_stack.py \
	--package-root . \
	--split-seed 202 \
	--seed 202 \
	--n-splits 5 \
	--make-submission
	```

	This rewrites:

	```text
	validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
	validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy
	validation_runs/dynamic_seed202/high_order_graph_stack/submissions/
	validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
	```

	The final decision rule is rank-based. The public-best file uses:

	```text
	ratio = 0.500000
	```

	instead of directly applying the validation probability threshold.

	To regenerate the earlier 6-model LightGCN ensemble submissions from included checkpoints:

	```bash
	cd cs3319_final_deliverable
	python code/generate_ens6_submission.py \
	--package-root . \
	--device cuda:0
	```

	If CUDA is unavailable, use:

	```bash
	python code/generate_ens6_submission.py \
	--package-root . \
	--device cpu
	```

	The confirmed early public file is:

	```text
	submissions/sub_ens6_t0.36.csv
	```

	To regenerate the 7-block random-walk stack that feeds the final high-order experiment:

	```bash
	cd cs3319_final_deliverable
	python code/generate_randomwalk_ensemble_submission.py \
	--package-root . \
	--split-seed 202 \
	--main-val-score-file validation_runs/dynamic_seed202/dyn202_l2d512_bpr_bigbatch_more/scores/val_vanilla_ensemble_mean.npy \
	--versions \
	dw_base_d128_l40_w10_win10 \
	dw_long_d128_l80_w10_win10 \
	dw_highdim_d256_l40_w10_win10 \
	dw_d256_l80_w10_win10 \
	dw_seed3407_d128_l40_w10_win10 \
	dw_graph_ap_pp \
	n2v_p2_q1_d128_l40_w10_win10
	```

	The random-walk models required for the final stage are included under:

	```text
	validation_runs/dynamic_seed202/randomwalk_systematic/models/
	```

	The cached high-order and rich content features are included under:

	```text
	validation_runs/feature_cache/
	```

	Some scripts inherited from the original workspace contain absolute paths in older metadata files.
	For the curated final artifacts, use the files already included in this deliverable or adapt paths
	relative to the package root.

	## Reports

	Read these in order:

	```text
	reports/preliminary_report.md
	reports/exploration_summary.md
	reports/final_report.md
	notes/experiment_history.md
	```