{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# ESM-2 embedding extraction on Colab T4\n",
    "\n",
    "Resumes the run from the checkpoint produced on the M2. Same model + `sample_n` so the existing rows in `data/embeddings.jsonl` are byte-compatible. Only `--batch-size` changes (8 → 64) to use the T4 properly.\n",
    "\n",
    "**Before you start:**\n",
    "1. Runtime → Change runtime type → **T4 GPU**\n",
    "2. Have on your laptop:\n",
    "   - `data/embeddings.jsonl` (the resume checkpoint)\n",
    "   - `data/bacdive_phenotypes.parquet` (the strain list)\n",
    "3. A GitHub Personal Access Token with `repo` scope (the repo is private)\n",
    "4. Your `NCBI_API_KEY`\n",
    "\n",
    "Estimated wall-clock: **1–3 hr** for the remaining ~15K genomes on T4.\n",
    "\n",
    "**Recommended: enable Step 5 (Drive durability).** Without it, a Colab disconnect wipes everything since the last manual download. With it, the JSONL lives in your Drive and survives session loss — just rerun the cells next time and it picks up where it left off."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Verify GPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!nvidia-smi"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Clone the repo (private, needs PAT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from getpass import getpass\n",
    "import os, subprocess\n",
    "\n",
    "pat = getpass('GitHub PAT (with repo scope): ')\n",
    "url = f'https://{pat}@github.com/miyu-horiuchi/microbe-model.git'\n",
    "subprocess.run(['git', 'clone', url], check=True)\n",
    "os.chdir('microbe-model')\n",
    "del pat, url  # don't keep the token around\n",
    "!git log --oneline -3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Install (~3 min)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q -e \".[embeddings]\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Upload the checkpoint files\n",
    "\n",
    "Pick **both** `data/embeddings.jsonl` (~13 MB) and `data/bacdive_phenotypes.parquet` (~1.6 MB) when the file picker opens.\n",
    "\n",
    "If you're rerunning after a disconnect and you've already done Step 5 once, you can skip this cell — Drive already has your latest jsonl. But you still need the parquet, so the easiest path is to upload both every time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.colab import files\n",
    "import os, shutil\n",
    "\n",
    "os.makedirs('data', exist_ok=True)\n",
    "uploaded = files.upload()\n",
    "for fname in uploaded:\n",
    "    shutil.move(fname, os.path.join('data', fname))\n",
    "!ls -la data/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Mount Google Drive for durability *(strongly recommended)*\n",
    "\n",
    "Mounts your Drive at `/content/drive`, then **symlinks** `data/embeddings.jsonl` to a file inside Drive. The extraction script writes to that path unchanged — but the data physically lives in your Drive.\n",
    "\n",
    "What this buys you:\n",
    "- Colab disconnects → your laptop sleeps → your browser closes: doesn't matter. The jsonl keeps whatever was already flushed.\n",
    "- Next session: just rerun all the cells. Step 5 detects the existing Drive file and reuses it; the extraction script skips genomes already done.\n",
    "\n",
    "First run will pop a Google auth flow — sign in with the Google account that owns the Drive."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from google.colab import drive\n",
    "import os, shutil\n",
    "\n",
    "drive.mount('/content/drive')\n",
    "\n",
    "DRIVE_DIR = '/content/drive/MyDrive/microbe-model-embeddings'\n",
    "os.makedirs(DRIVE_DIR, exist_ok=True)\n",
    "\n",
    "drive_jsonl = f'{DRIVE_DIR}/embeddings.jsonl'\n",
    "local_jsonl = 'data/embeddings.jsonl'\n",
    "\n",
    "# Seed Drive from the just-uploaded local file IF Drive doesn't already have a (longer) checkpoint.\n",
    "def _rows(path):\n",
    "    return sum(1 for _ in open(path)) if os.path.exists(path) else 0\n",
    "\n",
    "drive_rows = _rows(drive_jsonl)\n",
    "local_rows = _rows(local_jsonl)\n",
    "print(f'rows in Drive: {drive_rows:,}  |  rows in local upload: {local_rows:,}')\n",
    "\n",
    "if local_rows > drive_rows:\n",
    "    shutil.copy(local_jsonl, drive_jsonl)\n",
    "    print(f'Local upload was ahead — copied {local_rows:,} rows to Drive.')\n",
    "elif drive_rows > 0:\n",
    "    print(f'Drive checkpoint is current ({drive_rows:,} rows). Reusing it.')\n",
    "else:\n",
    "    open(drive_jsonl, 'a').close()\n",
    "    print('No checkpoint anywhere — starting fresh on Drive.')\n",
    "\n",
    "# Replace the local file with a symlink pointing into Drive.\n",
    "if os.path.lexists(local_jsonl):\n",
    "    os.remove(local_jsonl)\n",
    "os.symlink(drive_jsonl, local_jsonl)\n",
    "\n",
    "print(f'\\ndata/embeddings.jsonl -> {drive_jsonl}')\n",
    "!ls -la data/embeddings.jsonl"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Set NCBI API key"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "from getpass import getpass\n",
    "os.environ['NCBI_API_KEY'] = getpass('NCBI_API_KEY: ')\n",
    "# write to .env so the script's config loader picks it up too\n",
    "with open('.env', 'w') as f:\n",
    "    f.write(f\"NCBI_API_KEY={os.environ['NCBI_API_KEY']}\\n\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Keep the session alive\n",
    "\n",
    "Colab disconnects after ~90 min of UI inactivity. Run this in your browser console (F12 → Console) before walking away — it clicks the connect button every minute:\n",
    "\n",
    "```js\n",
    "setInterval(() => document.querySelector('colab-toolbar-button#connect')?.click(), 60000);\n",
    "```\n",
    "\n",
    "If you've enabled Step 5, this is less critical — even a disconnect doesn't cost you progress. But it still helps avoid having to manually rerun the cells."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Run extraction\n",
    "\n",
    "Same model + `sample_n` as the M2 run. `batch_size=64` to use T4 properly (M2 was at 8). Resumable — reads `data/embeddings.jsonl`, skips genomes already done."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python scripts/11_extract_embeddings.py \\\n",
    "    --model facebook/esm2_t6_8M_UR50D \\\n",
    "    --sample-n 20 \\\n",
    "    --batch-size 64"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 9. Download results\n",
    "\n",
    "If Step 5 is enabled, the JSONL is already in your Drive at `MyDrive/microbe-model-embeddings/embeddings.jsonl` — you can pull it from drive.google.com directly. The parquet is still local; the cell below also copies it into Drive for safekeeping, then offers both as browser downloads."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os, shutil\n",
    "from google.colab import files\n",
    "\n",
    "DRIVE_DIR = '/content/drive/MyDrive/microbe-model-embeddings'\n",
    "if os.path.isdir('/content/drive/MyDrive') and os.path.exists('data/embeddings.parquet'):\n",
    "    os.makedirs(DRIVE_DIR, exist_ok=True)\n",
    "    shutil.copy('data/embeddings.parquet', f'{DRIVE_DIR}/embeddings.parquet')\n",
    "    print(f'Mirrored parquet to {DRIVE_DIR}/embeddings.parquet')\n",
    "\n",
    "files.download('data/embeddings.jsonl')\n",
    "files.download('data/embeddings.parquet')"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "gpuType": "T4",
   "provenance": []
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  },
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}