{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# ESM-2 embedding extraction on Colab T4\n", "\n", "Resumes the run from the checkpoint produced on the M2. Same model + `sample_n` so the existing rows in `data/embeddings.jsonl` are byte-compatible. Only `--batch-size` changes (8 → 64) to use the T4 properly.\n", "\n", "**Before you start:**\n", "1. Runtime → Change runtime type → **T4 GPU**\n", "2. Have on your laptop:\n", " - `data/embeddings.jsonl` (the resume checkpoint)\n", " - `data/bacdive_phenotypes.parquet` (the strain list)\n", "3. A GitHub Personal Access Token with `repo` scope (the repo is private)\n", "4. Your `NCBI_API_KEY`\n", "\n", "Estimated wall-clock: **1–3 hr** for the remaining ~15K genomes on T4.\n", "\n", "**Recommended: enable Step 5 (Drive durability).** Without it, a Colab disconnect wipes everything since the last manual download. With it, the JSONL lives in your Drive and survives session loss — just rerun the cells next time and it picks up where it left off." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Verify GPU" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!nvidia-smi" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Clone the repo (private, needs PAT)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from getpass import getpass\n", "import os, subprocess\n", "\n", "pat = getpass('GitHub PAT (with repo scope): ')\n", "url = f'https://{pat}@github.com/miyu-horiuchi/microbe-model.git'\n", "subprocess.run(['git', 'clone', url], check=True)\n", "os.chdir('microbe-model')\n", "del pat, url # don't keep the token around\n", "!git log --oneline -3" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Install (~3 min)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install -q -e \".[embeddings]\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Upload the checkpoint files\n", "\n", "Pick **both** `data/embeddings.jsonl` (~13 MB) and `data/bacdive_phenotypes.parquet` (~1.6 MB) when the file picker opens.\n", "\n", "If you're rerunning after a disconnect and you've already done Step 5 once, you can skip this cell — Drive already has your latest jsonl. But you still need the parquet, so the easiest path is to upload both every time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from google.colab import files\n", "import os, shutil\n", "\n", "os.makedirs('data', exist_ok=True)\n", "uploaded = files.upload()\n", "for fname in uploaded:\n", " shutil.move(fname, os.path.join('data', fname))\n", "!ls -la data/" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Mount Google Drive for durability *(strongly recommended)*\n", "\n", "Mounts your Drive at `/content/drive`, then **symlinks** `data/embeddings.jsonl` to a file inside Drive. The extraction script writes to that path unchanged — but the data physically lives in your Drive.\n", "\n", "What this buys you:\n", "- Colab disconnects → your laptop sleeps → your browser closes: doesn't matter. The jsonl keeps whatever was already flushed.\n", "- Next session: just rerun all the cells. Step 5 detects the existing Drive file and reuses it; the extraction script skips genomes already done.\n", "\n", "First run will pop a Google auth flow — sign in with the Google account that owns the Drive." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from google.colab import drive\n", "import os, shutil\n", "\n", "drive.mount('/content/drive')\n", "\n", "DRIVE_DIR = '/content/drive/MyDrive/microbe-model-embeddings'\n", "os.makedirs(DRIVE_DIR, exist_ok=True)\n", "\n", "drive_jsonl = f'{DRIVE_DIR}/embeddings.jsonl'\n", "local_jsonl = 'data/embeddings.jsonl'\n", "\n", "# Seed Drive from the just-uploaded local file IF Drive doesn't already have a (longer) checkpoint.\n", "def _rows(path):\n", " return sum(1 for _ in open(path)) if os.path.exists(path) else 0\n", "\n", "drive_rows = _rows(drive_jsonl)\n", "local_rows = _rows(local_jsonl)\n", "print(f'rows in Drive: {drive_rows:,} | rows in local upload: {local_rows:,}')\n", "\n", "if local_rows > drive_rows:\n", " shutil.copy(local_jsonl, drive_jsonl)\n", " print(f'Local upload was ahead — copied {local_rows:,} rows to Drive.')\n", "elif drive_rows > 0:\n", " print(f'Drive checkpoint is current ({drive_rows:,} rows). Reusing it.')\n", "else:\n", " open(drive_jsonl, 'a').close()\n", " print('No checkpoint anywhere — starting fresh on Drive.')\n", "\n", "# Replace the local file with a symlink pointing into Drive.\n", "if os.path.lexists(local_jsonl):\n", " os.remove(local_jsonl)\n", "os.symlink(drive_jsonl, local_jsonl)\n", "\n", "print(f'\\ndata/embeddings.jsonl -> {drive_jsonl}')\n", "!ls -la data/embeddings.jsonl" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Set NCBI API key" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os\n", "from getpass import getpass\n", "os.environ['NCBI_API_KEY'] = getpass('NCBI_API_KEY: ')\n", "# write to .env so the script's config loader picks it up too\n", "with open('.env', 'w') as f:\n", " f.write(f\"NCBI_API_KEY={os.environ['NCBI_API_KEY']}\\n\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Keep the session alive\n", "\n", "Colab disconnects after ~90 min of UI inactivity. Run this in your browser console (F12 → Console) before walking away — it clicks the connect button every minute:\n", "\n", "```js\n", "setInterval(() => document.querySelector('colab-toolbar-button#connect')?.click(), 60000);\n", "```\n", "\n", "If you've enabled Step 5, this is less critical — even a disconnect doesn't cost you progress. But it still helps avoid having to manually rerun the cells." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Run extraction\n", "\n", "Same model + `sample_n` as the M2 run. `batch_size=64` to use T4 properly (M2 was at 8). Resumable — reads `data/embeddings.jsonl`, skips genomes already done." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!python scripts/11_extract_embeddings.py \\\n", " --model facebook/esm2_t6_8M_UR50D \\\n", " --sample-n 20 \\\n", " --batch-size 64" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Download results\n", "\n", "If Step 5 is enabled, the JSONL is already in your Drive at `MyDrive/microbe-model-embeddings/embeddings.jsonl` — you can pull it from drive.google.com directly. The parquet is still local; the cell below also copies it into Drive for safekeeping, then offers both as browser downloads." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import os, shutil\n", "from google.colab import files\n", "\n", "DRIVE_DIR = '/content/drive/MyDrive/microbe-model-embeddings'\n", "if os.path.isdir('/content/drive/MyDrive') and os.path.exists('data/embeddings.parquet'):\n", " os.makedirs(DRIVE_DIR, exist_ok=True)\n", " shutil.copy('data/embeddings.parquet', f'{DRIVE_DIR}/embeddings.parquet')\n", " print(f'Mirrored parquet to {DRIVE_DIR}/embeddings.parquet')\n", "\n", "files.download('data/embeddings.jsonl')\n", "files.download('data/embeddings.parquet')" ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }