File size: 12,590 Bytes

f28d994

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "7e132210",
   "metadata": {},
   "source": [
    "# CS3319 Project 2 — 加载权重并复现最终提交 (Load Trained Weights → Reproduce Final Submission)\n",
    "\n",
    "**用途 / Purpose**：助教本地验收用。加载本仓库内置的「已训练系统在测试集上的推理输出 (cached test predictions)」与已知正样本掩码，\n",
    "应用与论文一致的 **rank-cutoff 决策规则**，**秒级复现**最终提交 CSV，并与内置提交逐位比对。\n",
    "\n",
    "**AI 标注 / AI annotation**：本 notebook 由 AI（Claude Code, 模型 glm）辅助生成并经人工核验；详见仓库根目录 `AI_USAGE.md`。\n",
    "\n",
    "**预期结果 / Expected**：复现提交与内置 `submission_rich_rw7_highorder_directed_r0.500000.csv` **完全一致 (2,047,262/2,047,262)**；\n",
    "验证集 F1 = **0.966874**，公开榜单 F1 = **0.96626**。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "e06d9777",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-19T09:26:02.780132Z",
     "iopub.status.busy": "2026-06-19T09:26:02.779914Z",
     "iopub.status.idle": "2026-06-19T09:26:06.328424Z",
     "shell.execute_reply": "2026-06-19T09:26:06.327880Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "仓库根目录 / repo root : D:\\reps\\26H1_cs3319_final_deliverable\n",
      "numpy 2.1.3 | pandas 2.2.3\n"
     ]
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "# 自动定位仓库根目录 (兼容从仓库根或 code/ 目录运行)\n",
    "_cwd = Path.cwd()\n",
    "ROOT = None\n",
    "for _p in [_cwd, *_cwd.parents]:\n",
    "    if (_p / 'validation_runs').exists() and (_p / 'cached_scores').exists():\n",
    "        ROOT = _p; break\n",
    "assert ROOT is not None, '未找到仓库根目录 (需含 validation_runs/ 与 cached_scores/)'\n",
    "print('仓库根目录 / repo root :', ROOT)\n",
    "print('numpy', np.__version__, '| pandas', pd.__version__)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7eef2be7",
   "metadata": {},
   "source": [
    "## 1. 加载训练系统的缓存推理输出 (Load cached inference output of the trained system)\n",
    "\n",
    "最终模型 = 259 维特征上的 **LightGBM 二级堆叠**，其输入特征来自 6 个 LightGCN 权重、BPR-MF、7 个 DeepWalk/Node2Vec 权重等。\n",
    "完整训练得到的**测试集推理输出**已缓存为 `rich_rw7_highorder_directed_test_pred.npy`（每个测试对作者-论文一个分数，共 2,047,262 个）。\n",
    "加载该输出，即等价于「加载全部已训练权重并在测试集上完成推理」。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9c0d2cc1",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-19T09:26:06.331153Z",
     "iopub.status.busy": "2026-06-19T09:26:06.330786Z",
     "iopub.status.idle": "2026-06-19T09:26:06.353118Z",
     "shell.execute_reply": "2026-06-19T09:26:06.352556Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "测试对数 / test pairs    : 2047262\n",
      "test_pred  shape/dtype  : (2047262,) float32\n",
      "known_mask shape/dtype  : (2047262,) bool\n",
      "已知正样本数 / known pos: 524083\n"
     ]
    }
   ],
   "source": [
    "ho = ROOT / 'validation_runs' / 'dynamic_seed202' / 'high_order_graph_stack'\n",
    "test_pred = np.load(ho / 'rich_rw7_highorder_directed_test_pred.npy')\n",
    "known_mask = np.load(ROOT / 'cached_scores' / 'test_known_mask.npy').astype(bool)\n",
    "n_pairs = len(test_pred)\n",
    "print('测试对数 / test pairs    :', n_pairs)\n",
    "print('test_pred  shape/dtype  :', test_pred.shape, test_pred.dtype)\n",
    "print('known_mask shape/dtype  :', known_mask.shape, known_mask.dtype)\n",
    "print('已知正样本数 / known pos:', int(known_mask.sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8d174198",
   "metadata": {},
   "source": [
    "## 2. 应用决策规则 (Apply the paper's decision rule)\n",
    "\n",
    "与论文/报告一致：按分数降序排序 → **取前 50% 预测为正** → 再把「训练/测试交叠的已知正样本」**强制置 1**。\n",
    "使用 rank cutoff 而非概率阈值的原因：1:1 验证集为人工构造，LightGBM 概率在 val→test 分布偏移下未良好校准（详见报告 `reports/final_report.md`）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e2be496a",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-19T09:26:06.355804Z",
     "iopub.status.busy": "2026-06-19T09:26:06.355531Z",
     "iopub.status.idle": "2026-06-19T09:26:06.587355Z",
     "shell.execute_reply": "2026-06-19T09:26:06.586861Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "预测正样本比例 / positive ratio: 0.5\n",
      "预测正样本数   / #positive      : 1023631\n"
     ]
    }
   ],
   "source": [
    "RATIO = 0.5  # 与最终提交一致 / same as the public-best submission\n",
    "order = np.argsort(-test_pred, kind='stable')\n",
    "n_pos = int(round(RATIO * n_pairs))\n",
    "pred = np.zeros(n_pairs, dtype=np.int8)\n",
    "pred[order[:n_pos]] = 1\n",
    "pred = np.where(known_mask, 1, pred).astype(np.int8)\n",
    "print('预测正样本比例 / positive ratio:', float(pred.mean()))\n",
    "print('预测正样本数   / #positive      :', int(pred.sum()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e7ba2454",
   "metadata": {},
   "source": [
    "## 3. 与内置最终提交逐位比对 (Verify byte-identical to the stored final submission)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "3d05a948",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-19T09:26:06.589620Z",
     "iopub.status.busy": "2026-06-19T09:26:06.589366Z",
     "iopub.status.idle": "2026-06-19T09:26:06.771353Z",
     "shell.execute_reply": "2026-06-19T09:26:06.770859Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "内置提交 / stored submission : submission_rich_rw7_highorder_directed_r0.500000.csv\n",
      "逐位一致 / exact match       : 2047262 / 2047262 -> True\n",
      "\n",
      "OK 复现成功：生成的提交与内置最终提交逐位完全一致。\n",
      "OK Reproduction verified: regenerated submission is byte-identical to the stored final submission.\n"
     ]
    }
   ],
   "source": [
    "csv_path = ho / 'submissions' / 'submission_rich_rw7_highorder_directed_r0.500000.csv'\n",
    "stored = pd.read_csv(csv_path)\n",
    "assert list(stored.columns) == ['Index', 'Predicted'], list(stored.columns)\n",
    "stored_pred = stored['Predicted'].to_numpy(np.int8)\n",
    "match = int((pred == stored_pred).sum())\n",
    "print('内置提交 / stored submission :', csv_path.name)\n",
    "print('逐位一致 / exact match       :', match, '/', n_pairs, '->', match == n_pairs)\n",
    "assert match == n_pairs, '复现提交与内置 CSV 不一致！'\n",
    "print()\n",
    "print('OK 复现成功：生成的提交与内置最终提交逐位完全一致。')\n",
    "print('OK Reproduction verified: regenerated submission is byte-identical to the stored final submission.')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e34678b",
   "metadata": {},
   "source": [
    "## 4. 结果指标 (Result metrics)\n",
    "\n",
    "读取验证汇总（与提交同源 1:1 验证集，seed=202）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "957926f9",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-19T09:26:06.773443Z",
     "iopub.status.busy": "2026-06-19T09:26:06.773042Z",
     "iopub.status.idle": "2026-06-19T09:26:06.782300Z",
     "shell.execute_reply": "2026-06-19T09:26:06.781747Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "stage                            val_F1        AUC   #feat\n",
      "rich_rw7_highorder_directed    0.966874   0.994918     259\n",
      "\n",
      "验证集 F1   / validation F1 : 0.966874\n",
      "公开榜单 F1 / public LB F1  : 0.96626   (Kaggle, 最终提交 / final submission)\n",
      "参考阈值     / val threshold : 0.461731 (仅参考；测试用 rank cutoff, 非此阈值)\n"
     ]
    }
   ],
   "source": [
    "summary = pd.read_csv(ho / 'validation_summary.csv')\n",
    "row = summary[summary['stage'] == 'rich_rw7_highorder_directed'].iloc[0]\n",
    "print(f\"{'stage':<28}{'val_F1':>11}{'AUC':>11}{'#feat':>8}\")\n",
    "print(f\"{row['stage']:<28}{row['validation_f1']:>11.6f}{row['auc']:>11.6f}{int(row['n_features']):>8}\")\n",
    "print()\n",
    "print('验证集 F1   / validation F1 :', round(float(row['validation_f1']), 6))\n",
    "print('公开榜单 F1 / public LB F1  : 0.96626   (Kaggle, 最终提交 / final submission)')\n",
    "print('参考阈值     / val threshold :', round(float(row['threshold']), 6), '(仅参考；测试用 rank cutoff, 非此阈值)')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bcf166db",
   "metadata": {},
   "source": [
    "## 5. （可选）直接加载原始模型权重，证明权重可载入 (Optional: load raw trained weights)\n",
    "\n",
    "可选验证：`torch.load` 一个 LightGCN checkpoint，证明训练好的 GNN 权重确实存在于本包且可在 CPU 载入。\n",
    "需要 `torch`；若环境无 torch 可安全跳过（不影响上面第 1–4 步的复现结论）。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "612183a5",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2026-06-19T09:26:06.784995Z",
     "iopub.status.busy": "2026-06-19T09:26:06.784741Z",
     "iopub.status.idle": "2026-06-19T09:26:13.009753Z",
     "shell.execute_reply": "2026-06-19T09:26:13.009185Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "final_ens6 LightGCN 权重 / checkpoints: 6 个 files\n",
      "载入 / loaded: model_lgcn_dim384_s99.pt | 参数张量数 #tensors: 3\n",
      "前几个张量 / first tensors: ['author_emb.weight', 'paper_proj.weight', 'paper_proj.bias']\n"
     ]
    }
   ],
   "source": [
    "try:\n",
    "    import torch\n",
    "    ckpt_dir = ROOT / 'checkpoints' / 'final_ens6'\n",
    "    ckpts = sorted(ckpt_dir.glob('*.pt'))\n",
    "    print('final_ens6 LightGCN 权重 / checkpoints:', len(ckpts), '个 files')\n",
    "    sd = torch.load(ckpts[0], map_location='cpu')\n",
    "    keys = list(sd.keys()) if isinstance(sd, dict) else []\n",
    "    print('载入 / loaded:', ckpts[0].name, '| 参数张量数 #tensors:', len(keys))\n",
    "    print('前几个张量 / first tensors:', keys[:4])\n",
    "except Exception as e:\n",
    "    print('(跳过 / skipped) torch 权重加载需要 PyTorch:', repr(e))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bc83c07c",
   "metadata": {},
   "source": [
    "## 说明 (Notes)\n",
    "\n",
    "- **本测试无需 GPU、无需原始数据、无需重训**：纯 CPU + numpy/pandas，数秒完成。\n",
    "- **完整管线复现**（从原始数据重训 LightGCN / DeepWalk / LightGBM）见仓库根 `README.md` 与 `SUBMISSION_README.md`；大体积中间产物在 Hugging Face 备份仓库。\n",
    "- **AI 标注**：本 notebook 由 AI 辅助生成并经人工核验；论文 / 文档 / 图表的 AI 使用情况见 `AI_USAGE.md`。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}