cs3319-project2 / README.md
NLP-beginner's picture
CS3319 Project 2 final deliverable (public F1 = 0.96626)
f28d994
|
Raw
History Blame Contribute Delete
8.7 kB
# CS3319 Project 2 Final Deliverable
This package contains the cleaned final artifacts for the CS3319 recommendation-system project.
It preserves the core code, data, model checkpoints, cached scores, random-walk model weights,
important submissions, and reports for the main stages of the work.
## Best Confirmed Result
| Submission | Method | Public LB F1 |
|---|---|---:|
| `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` | rich content + 7 random-walk blocks + directed high-order citation propagation + LightGBM, rank top 50% | **0.96626** |
The strongest validation run for this final method is:
```text
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
rich_rw7_highorder_directed validation F1 = 0.966874
```
## What Is Included
```text
cs3319_final_deliverable/
code/ Core experiment and generation scripts.
data_and_docs/ Official data files and course documents.
checkpoints/ LightGCN checkpoints, including final_ens6.
cached_scores/ Early cached BPR/LightGBM/ensemble score files.
validation_runs/
feature_cache/ Cached content and high-order graph features.
dynamic_seed202/ Curated OOF scores, test scores, model weights, summaries, submissions.
submissions/ Early confirmed LightGCN submissions.
reports/ Exploration summary, preliminary report, final report.
env/ Environment exports / minimal requirements.
notes/ Experiment history.
manifests/ File manifests from the original transfer package.
```
## Main Stages Preserved
| Stage | Key files | Result |
|---|---|---:|
| 6-model LightGCN ensemble | `submissions/sub_ens6_t0.36.csv` | public 0.93044 |
| Post95 stacker | `validation_runs/dynamic_seed202/post95_submission/submission_post95_ens_r0.500.csv` | public about 0.95760 |
| Content + BPR-MF stacker | `validation_runs/dynamic_seed202/extra_bprmf_submission/submission_post95_content_mf_lgb_score_ge0.500.csv` | public about 0.95996 |
| DeepWalk + Node2Vec stacker | `validation_runs/dynamic_seed202/node2vec_deepwalk_submission/submission_content_mf_deepwalk_node2vec_lgb_th0.480000.csv` | public about 0.96252 |
| High-order citation propagation | `validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv` | public 0.96626 |
## Core Scripts
The most important scripts are:
| Script | Purpose |
|---|---|
| `code/train_val_lgcn_ensemble.py` | Dynamic validation LightGCN training and score generation. |
| `code/generate_post95_submission.py` | Post95 LightGCN + graph/content feature submission generator. |
| `code/extra_score_sources_ablation.py` | Content mean-cos, BPR-MF, and ranker score source ablations. |
| `code/node2vec_deepwalk_ablation.py` | Initial DeepWalk / Node2Vec score-source ablation. |
| `code/randomwalk_systematic_ablation.py` | Systematic random-walk feature experiments. |
| `code/generate_randomwalk_ensemble_submission.py` | Submission generation from selected random-walk feature blocks. |
| `code/content_rich_ablation.py` | Rich `feature.pkl` content feature construction. |
| `code/high_order_graph_stack.py` | Final high-order citation propagation experiment and submission generation. |
| `code/error_group_calibration.py` | Error analysis, threshold sweep, group calibration, boundary model. |
## Final Method Summary
The final method is a LightGBM second-stage model over:
- LightGCN score / rank features.
- Explicit graph/meta-path features.
- Content mean-cos and top-k content similarity features.
- BPR-MF score features.
- Rich author-content profile features.
- Seven systematic DeepWalk / Node2Vec random-walk feature blocks.
- Aggregated random-walk agreement features.
- High-order citation propagation features:
- `A-P-P^k`
- `A-A-P-P^k`
- forward citation, backward citation, and undirected citation variants.
- popularity-normalized propagation scores.
The final test decision uses rank cutoff rather than a raw probability threshold:
```text
sort test pairs by final score
predict top 50% as positive
force train/test-overlap known positives to 1
```
This was more stable than transferring the validation-optimal probability threshold because the
validation split is an artificial 1:1 positive/negative split and LightGBM probabilities are not
well calibrated across the validation-test distribution shift.
## Environment
Original environment notes are in:
```text
env/environment-cs3319.yml
env/requirements-minimal.txt
```
The project was run with Python 3.10 and these core packages:
```text
numpy
pandas
scipy
scikit-learn
lightgbm
xgboost
torch
torch-geometric
gensim
node2vec
networkx
```
## Quick Verification
After unzipping the package, the fastest way to verify the final result is:
```bash
cd cs3319_final_deliverable
# Check the final validation metric.
cat validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
# Check generated final submissions and their positive ratios.
cat validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
# Confirm the best public submission file exists.
ls -lh validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
```
Expected key validation row:
```text
rich_rw7_highorder_directed validation F1 = 0.966873736337297
```
The corresponding public-best file is:
```text
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
```
## Command Reproduction
The package includes cached feature matrices, random-walk model weights, OOF scores, test scores,
and final submissions. The quickest way to inspect the final result is to read:
```text
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/submission_rich_rw7_highorder_directed_r0.500000.csv
```
To regenerate the final high-order stack from the included cached features and random-walk weights:
```bash
cd cs3319_final_deliverable
python code/high_order_graph_stack.py \
--package-root . \
--split-seed 202 \
--seed 202 \
--n-splits 5 \
--make-submission
```
This rewrites:
```text
validation_runs/dynamic_seed202/high_order_graph_stack/validation_summary.csv
validation_runs/dynamic_seed202/high_order_graph_stack/rich_rw7_highorder_directed_test_pred.npy
validation_runs/dynamic_seed202/high_order_graph_stack/submissions/
validation_runs/dynamic_seed202/high_order_graph_stack/submission_summary.csv
```
The final decision rule is rank-based. The public-best file uses:
```text
ratio = 0.500000
```
instead of directly applying the validation probability threshold.
To regenerate the earlier 6-model LightGCN ensemble submissions from included checkpoints:
```bash
cd cs3319_final_deliverable
python code/generate_ens6_submission.py \
--package-root . \
--device cuda:0
```
If CUDA is unavailable, use:
```bash
python code/generate_ens6_submission.py \
--package-root . \
--device cpu
```
The confirmed early public file is:
```text
submissions/sub_ens6_t0.36.csv
```
To regenerate the 7-block random-walk stack that feeds the final high-order experiment:
```bash
cd cs3319_final_deliverable
python code/generate_randomwalk_ensemble_submission.py \
--package-root . \
--split-seed 202 \
--main-val-score-file validation_runs/dynamic_seed202/dyn202_l2d512_bpr_bigbatch_more/scores/val_vanilla_ensemble_mean.npy \
--versions \
dw_base_d128_l40_w10_win10 \
dw_long_d128_l80_w10_win10 \
dw_highdim_d256_l40_w10_win10 \
dw_d256_l80_w10_win10 \
dw_seed3407_d128_l40_w10_win10 \
dw_graph_ap_pp \
n2v_p2_q1_d128_l40_w10_win10
```
The random-walk models required for the final stage are included under:
```text
validation_runs/dynamic_seed202/randomwalk_systematic/models/
```
The cached high-order and rich content features are included under:
```text
validation_runs/feature_cache/
```
Some scripts inherited from the original workspace contain absolute paths in older metadata files.
For the curated final artifacts, use the files already included in this deliverable or adapt paths
relative to the package root.
## Reports
Read these in order:
```text
reports/preliminary_report.md
reports/exploration_summary.md
reports/final_report.md
notes/experiment_history.md
```