{ "cells": [ { "cell_type": "markdown", "id": "7e132210", "metadata": {}, "source": [ "# CS3319 Project 2 — 加载权重并复现最终提交 (Load Trained Weights → Reproduce Final Submission)\n", "\n", "**用途 / Purpose**:助教本地验收用。加载本仓库内置的「已训练系统在测试集上的推理输出 (cached test predictions)」与已知正样本掩码,\n", "应用与论文一致的 **rank-cutoff 决策规则**,**秒级复现**最终提交 CSV,并与内置提交逐位比对。\n", "\n", "**AI 标注 / AI annotation**:本 notebook 由 AI(Claude Code, 模型 glm)辅助生成并经人工核验;详见仓库根目录 `AI_USAGE.md`。\n", "\n", "**预期结果 / Expected**:复现提交与内置 `submission_rich_rw7_highorder_directed_r0.500000.csv` **完全一致 (2,047,262/2,047,262)**;\n", "验证集 F1 = **0.966874**,公开榜单 F1 = **0.96626**。" ] }, { "cell_type": "code", "execution_count": 1, "id": "e06d9777", "metadata": { "execution": { "iopub.execute_input": "2026-06-19T09:26:02.780132Z", "iopub.status.busy": "2026-06-19T09:26:02.779914Z", "iopub.status.idle": "2026-06-19T09:26:06.328424Z", "shell.execute_reply": "2026-06-19T09:26:06.327880Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "仓库根目录 / repo root : D:\\reps\\26H1_cs3319_final_deliverable\n", "numpy 2.1.3 | pandas 2.2.3\n" ] } ], "source": [ "from pathlib import Path\n", "import numpy as np\n", "import pandas as pd\n", "\n", "# 自动定位仓库根目录 (兼容从仓库根或 code/ 目录运行)\n", "_cwd = Path.cwd()\n", "ROOT = None\n", "for _p in [_cwd, *_cwd.parents]:\n", " if (_p / 'validation_runs').exists() and (_p / 'cached_scores').exists():\n", " ROOT = _p; break\n", "assert ROOT is not None, '未找到仓库根目录 (需含 validation_runs/ 与 cached_scores/)'\n", "print('仓库根目录 / repo root :', ROOT)\n", "print('numpy', np.__version__, '| pandas', pd.__version__)" ] }, { "cell_type": "markdown", "id": "7eef2be7", "metadata": {}, "source": [ "## 1. 加载训练系统的缓存推理输出 (Load cached inference output of the trained system)\n", "\n", "最终模型 = 259 维特征上的 **LightGBM 二级堆叠**,其输入特征来自 6 个 LightGCN 权重、BPR-MF、7 个 DeepWalk/Node2Vec 权重等。\n", "完整训练得到的**测试集推理输出**已缓存为 `rich_rw7_highorder_directed_test_pred.npy`(每个测试对作者-论文一个分数,共 2,047,262 个)。\n", "加载该输出,即等价于「加载全部已训练权重并在测试集上完成推理」。" ] }, { "cell_type": "code", "execution_count": 2, "id": "9c0d2cc1", "metadata": { "execution": { "iopub.execute_input": "2026-06-19T09:26:06.331153Z", "iopub.status.busy": "2026-06-19T09:26:06.330786Z", "iopub.status.idle": "2026-06-19T09:26:06.353118Z", "shell.execute_reply": "2026-06-19T09:26:06.352556Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "测试对数 / test pairs : 2047262\n", "test_pred shape/dtype : (2047262,) float32\n", "known_mask shape/dtype : (2047262,) bool\n", "已知正样本数 / known pos: 524083\n" ] } ], "source": [ "ho = ROOT / 'validation_runs' / 'dynamic_seed202' / 'high_order_graph_stack'\n", "test_pred = np.load(ho / 'rich_rw7_highorder_directed_test_pred.npy')\n", "known_mask = np.load(ROOT / 'cached_scores' / 'test_known_mask.npy').astype(bool)\n", "n_pairs = len(test_pred)\n", "print('测试对数 / test pairs :', n_pairs)\n", "print('test_pred shape/dtype :', test_pred.shape, test_pred.dtype)\n", "print('known_mask shape/dtype :', known_mask.shape, known_mask.dtype)\n", "print('已知正样本数 / known pos:', int(known_mask.sum()))" ] }, { "cell_type": "markdown", "id": "8d174198", "metadata": {}, "source": [ "## 2. 应用决策规则 (Apply the paper's decision rule)\n", "\n", "与论文/报告一致:按分数降序排序 → **取前 50% 预测为正** → 再把「训练/测试交叠的已知正样本」**强制置 1**。\n", "使用 rank cutoff 而非概率阈值的原因:1:1 验证集为人工构造,LightGBM 概率在 val→test 分布偏移下未良好校准(详见报告 `reports/final_report.md`)。" ] }, { "cell_type": "code", "execution_count": 3, "id": "e2be496a", "metadata": { "execution": { "iopub.execute_input": "2026-06-19T09:26:06.355804Z", "iopub.status.busy": "2026-06-19T09:26:06.355531Z", "iopub.status.idle": "2026-06-19T09:26:06.587355Z", "shell.execute_reply": "2026-06-19T09:26:06.586861Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "预测正样本比例 / positive ratio: 0.5\n", "预测正样本数 / #positive : 1023631\n" ] } ], "source": [ "RATIO = 0.5 # 与最终提交一致 / same as the public-best submission\n", "order = np.argsort(-test_pred, kind='stable')\n", "n_pos = int(round(RATIO * n_pairs))\n", "pred = np.zeros(n_pairs, dtype=np.int8)\n", "pred[order[:n_pos]] = 1\n", "pred = np.where(known_mask, 1, pred).astype(np.int8)\n", "print('预测正样本比例 / positive ratio:', float(pred.mean()))\n", "print('预测正样本数 / #positive :', int(pred.sum()))" ] }, { "cell_type": "markdown", "id": "e7ba2454", "metadata": {}, "source": [ "## 3. 与内置最终提交逐位比对 (Verify byte-identical to the stored final submission)" ] }, { "cell_type": "code", "execution_count": 4, "id": "3d05a948", "metadata": { "execution": { "iopub.execute_input": "2026-06-19T09:26:06.589620Z", "iopub.status.busy": "2026-06-19T09:26:06.589366Z", "iopub.status.idle": "2026-06-19T09:26:06.771353Z", "shell.execute_reply": "2026-06-19T09:26:06.770859Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "内置提交 / stored submission : submission_rich_rw7_highorder_directed_r0.500000.csv\n", "逐位一致 / exact match : 2047262 / 2047262 -> True\n", "\n", "OK 复现成功:生成的提交与内置最终提交逐位完全一致。\n", "OK Reproduction verified: regenerated submission is byte-identical to the stored final submission.\n" ] } ], "source": [ "csv_path = ho / 'submissions' / 'submission_rich_rw7_highorder_directed_r0.500000.csv'\n", "stored = pd.read_csv(csv_path)\n", "assert list(stored.columns) == ['Index', 'Predicted'], list(stored.columns)\n", "stored_pred = stored['Predicted'].to_numpy(np.int8)\n", "match = int((pred == stored_pred).sum())\n", "print('内置提交 / stored submission :', csv_path.name)\n", "print('逐位一致 / exact match :', match, '/', n_pairs, '->', match == n_pairs)\n", "assert match == n_pairs, '复现提交与内置 CSV 不一致!'\n", "print()\n", "print('OK 复现成功:生成的提交与内置最终提交逐位完全一致。')\n", "print('OK Reproduction verified: regenerated submission is byte-identical to the stored final submission.')" ] }, { "cell_type": "markdown", "id": "5e34678b", "metadata": {}, "source": [ "## 4. 结果指标 (Result metrics)\n", "\n", "读取验证汇总(与提交同源 1:1 验证集,seed=202)。" ] }, { "cell_type": "code", "execution_count": 5, "id": "957926f9", "metadata": { "execution": { "iopub.execute_input": "2026-06-19T09:26:06.773443Z", "iopub.status.busy": "2026-06-19T09:26:06.773042Z", "iopub.status.idle": "2026-06-19T09:26:06.782300Z", "shell.execute_reply": "2026-06-19T09:26:06.781747Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "stage val_F1 AUC #feat\n", "rich_rw7_highorder_directed 0.966874 0.994918 259\n", "\n", "验证集 F1 / validation F1 : 0.966874\n", "公开榜单 F1 / public LB F1 : 0.96626 (Kaggle, 最终提交 / final submission)\n", "参考阈值 / val threshold : 0.461731 (仅参考;测试用 rank cutoff, 非此阈值)\n" ] } ], "source": [ "summary = pd.read_csv(ho / 'validation_summary.csv')\n", "row = summary[summary['stage'] == 'rich_rw7_highorder_directed'].iloc[0]\n", "print(f\"{'stage':<28}{'val_F1':>11}{'AUC':>11}{'#feat':>8}\")\n", "print(f\"{row['stage']:<28}{row['validation_f1']:>11.6f}{row['auc']:>11.6f}{int(row['n_features']):>8}\")\n", "print()\n", "print('验证集 F1 / validation F1 :', round(float(row['validation_f1']), 6))\n", "print('公开榜单 F1 / public LB F1 : 0.96626 (Kaggle, 最终提交 / final submission)')\n", "print('参考阈值 / val threshold :', round(float(row['threshold']), 6), '(仅参考;测试用 rank cutoff, 非此阈值)')" ] }, { "cell_type": "markdown", "id": "bcf166db", "metadata": {}, "source": [ "## 5. (可选)直接加载原始模型权重,证明权重可载入 (Optional: load raw trained weights)\n", "\n", "可选验证:`torch.load` 一个 LightGCN checkpoint,证明训练好的 GNN 权重确实存在于本包且可在 CPU 载入。\n", "需要 `torch`;若环境无 torch 可安全跳过(不影响上面第 1–4 步的复现结论)。" ] }, { "cell_type": "code", "execution_count": 6, "id": "612183a5", "metadata": { "execution": { "iopub.execute_input": "2026-06-19T09:26:06.784995Z", "iopub.status.busy": "2026-06-19T09:26:06.784741Z", "iopub.status.idle": "2026-06-19T09:26:13.009753Z", "shell.execute_reply": "2026-06-19T09:26:13.009185Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "final_ens6 LightGCN 权重 / checkpoints: 6 个 files\n", "载入 / loaded: model_lgcn_dim384_s99.pt | 参数张量数 #tensors: 3\n", "前几个张量 / first tensors: ['author_emb.weight', 'paper_proj.weight', 'paper_proj.bias']\n" ] } ], "source": [ "try:\n", " import torch\n", " ckpt_dir = ROOT / 'checkpoints' / 'final_ens6'\n", " ckpts = sorted(ckpt_dir.glob('*.pt'))\n", " print('final_ens6 LightGCN 权重 / checkpoints:', len(ckpts), '个 files')\n", " sd = torch.load(ckpts[0], map_location='cpu')\n", " keys = list(sd.keys()) if isinstance(sd, dict) else []\n", " print('载入 / loaded:', ckpts[0].name, '| 参数张量数 #tensors:', len(keys))\n", " print('前几个张量 / first tensors:', keys[:4])\n", "except Exception as e:\n", " print('(跳过 / skipped) torch 权重加载需要 PyTorch:', repr(e))" ] }, { "cell_type": "markdown", "id": "bc83c07c", "metadata": {}, "source": [ "## 说明 (Notes)\n", "\n", "- **本测试无需 GPU、无需原始数据、无需重训**:纯 CPU + numpy/pandas,数秒完成。\n", "- **完整管线复现**(从原始数据重训 LightGCN / DeepWalk / LightGBM)见仓库根 `README.md` 与 `SUBMISSION_README.md`;大体积中间产物在 Hugging Face 备份仓库。\n", "- **AI 标注**:本 notebook 由 AI 辅助生成并经人工核验;论文 / 文档 / 图表的 AI 使用情况见 `AI_USAGE.md`。" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.5" } }, "nbformat": 4, "nbformat_minor": 5 }