File size: 12,146 Bytes
021edb3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# SF-Cluster \u2014 frustration-guided MSA subset builder\n",
    "\n",
    "**What this notebook does.** Installs the open-source `sf_cluster` package, downloads a small KaiB demo bundle (a 364-sequence MSA + a per-residue Frustration Index matrix from FrustrAI-Seq), and builds two flavours of stratified MSA subsets (`mosaic` and `gradient`) using the contrast-HV/LV score. Everything runs on CPU in roughly two minutes.\n",
    "\n",
    "**Who it is for.** Biologists who want reproducible, frustration-stratified MSA slices to feed into an AF-Cluster-style multi-conformer prediction loop.\n",
    "\n",
    "**What you do next.** Take the 12 mosaic or 12 gradient A3M subsets emitted at the end of this notebook, run each through ColabFold AF2, and aggregate per the SF-Cluster \u00a79.1 hit criterion.\n",
    "\n",
    "---\n",
    "\n",
    "> ## LIMITATIONS \u2014 please read\n",
    "> A controlled comparison on the Main-21 cases shows that **uniform random subsampling performs equivalently on most cases**. The frustration signal is **not** the active ingredient here \u2014 depth reduction is. See the OSS README for the full ablation.\n",
    ">\n",
    "> Use this tool when you want **stratified, reproducible MSA subsets** with a clear provenance story \u2014 not as a guaranteed conformational diversity engine. It is a research baseline, not a turnkey accuracy improvement."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Install the package\n",
    "\n",
    "Pulls the OSS release from Hugging Face. Pure-Python; only depends on `numpy` and `scipy`."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "!pip install -q git+https://huggingface.co/ChatterjeeLab/SF-Cluster"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Download the KaiB demo bundle\n",
    "\n",
    "Three files, ~200 KB total: a filtered MSA, a per-residue FI matrix from FrustrAI-Seq, and the parallel sequence-ID list."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from huggingface_hub import hf_hub_download\n",
    "from pathlib import Path\n",
    "import os\n",
    "\n",
    "REPO = 'ChatterjeeLab/SF-Cluster'\n",
    "FILES = ['examples/data/KaiB_filtered.a3m',\n",
    "         'examples/data/KaiB_fi_matrix.npy',\n",
    "         'examples/data/KaiB_seq_ids.txt']\n",
    "\n",
    "local = {}\n",
    "for fname in FILES:\n",
    "    p = hf_hub_download(repo_id=REPO, filename=fname, repo_type='model')\n",
    "    local[fname] = p\n",
    "    print(f'{fname:50s}  {os.path.getsize(p)/1024:7.1f} KB  ->  {p}')\n",
    "\n",
    "A3M = local['examples/data/KaiB_filtered.a3m']\n",
    "FI  = local['examples/data/KaiB_fi_matrix.npy']\n",
    "IDS = local['examples/data/KaiB_seq_ids.txt']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Build the pool and stratified subsets\n",
    "\n",
    "The `pool_msa` call ties the MSA records to their per-residue FI vectors. `contrast_hvlv` computes the per-sequence high-variance / low-variance FI contrast (see README for the formula). `method_mosaic` and `method_gradient` then deterministically draw 12 subsets of 32 sequences each."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient\n",
    "\n",
    "pool = pool_msa(A3M, FI)\n",
    "print(f'pool: N_seq={pool.n_seq}, L={pool.n_cols}, query={pool.headers[0]!r}')\n",
    "\n",
    "score = contrast_hvlv(pool.fi_matrix)\n",
    "print(f'contrast_hvlv: shape={score.shape}, '\n",
    "      f'min={score.min():+.3f}, median={np.median(score):+.3f}, max={score.max():+.3f}')\n",
    "\n",
    "mosaic_subsets   = method_mosaic(score)\n",
    "gradient_subsets = method_gradient(score)\n",
    "\n",
    "def summarize(name, subsets):\n",
    "    print(f'\\n[{name}]  {len(subsets)} subsets')\n",
    "    print(f'{\"subset_id\":>10}  {\"n_seqs\":>7}  {\"mean_contrast\":>14}')\n",
    "    for i, sub in enumerate(subsets):\n",
    "        m = float(np.mean(score[sub]))\n",
    "        print(f'{i:>10d}  {len(sub):>7d}  {m:>+14.4f}')\n",
    "\n",
    "summarize('mosaic',   mosaic_subsets)\n",
    "summarize('gradient', gradient_subsets)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Visualise\n",
    "\n",
    "Three plots: the contrast score distribution with tercile / quartile boundaries marked, the per-subset mean contrast score for both methods, and the pairwise sequence-overlap heatmap between mosaic and gradient subsets."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "\n",
    "fig, axes = plt.subplots(1, 3, figsize=(15, 4))\n",
    "\n",
    "# (a) score histogram with tercile + quartile lines\n",
    "ax = axes[0]\n",
    "ax.hist(score, bins=40, color='#4477AA', edgecolor='white', alpha=0.85)\n",
    "sorted_s = np.sort(score)\n",
    "N = len(sorted_s)\n",
    "terciles = [sorted_s[N//3], sorted_s[2*N//3]]\n",
    "quartiles = [sorted_s[N//4], sorted_s[N//2], sorted_s[3*N//4]]\n",
    "for t in terciles:\n",
    "    ax.axvline(t, color='#CC6677', linestyle='--', label='tercile (mosaic)' if t==terciles[0] else None)\n",
    "for q in quartiles:\n",
    "    ax.axvline(q, color='#117733', linestyle=':',  label='quartile (gradient)' if q==quartiles[0] else None)\n",
    "ax.set_xlabel('contrast_hvlv')\n",
    "ax.set_ylabel('count')\n",
    "ax.set_title('(a) per-sequence contrast score')\n",
    "ax.legend(fontsize=8)\n",
    "\n",
    "# (b) per-subset mean contrast\n",
    "ax = axes[1]\n",
    "x = np.arange(12)\n",
    "m_means = np.array([score[s].mean() for s in mosaic_subsets])\n",
    "g_means = np.array([score[s].mean() for s in gradient_subsets])\n",
    "w = 0.4\n",
    "ax.bar(x - w/2, m_means, width=w, label='mosaic',   color='#4477AA')\n",
    "ax.bar(x + w/2, g_means, width=w, label='gradient', color='#CC6677')\n",
    "ax.axhline(0, color='black', lw=0.5)\n",
    "ax.set_xlabel('subset id')\n",
    "ax.set_ylabel('mean contrast_hvlv')\n",
    "ax.set_title('(b) per-subset mean score')\n",
    "ax.legend(fontsize=8)\n",
    "\n",
    "# (c) pairwise overlap heatmap (mosaic x gradient)\n",
    "ax = axes[2]\n",
    "M = np.zeros((12, 12), dtype=int)\n",
    "for i, si in enumerate(mosaic_subsets):\n",
    "    set_i = set(si)\n",
    "    for j, sj in enumerate(gradient_subsets):\n",
    "        M[i, j] = len(set_i & set(sj))\n",
    "im = ax.imshow(M, cmap='magma', aspect='auto')\n",
    "ax.set_xlabel('gradient subset')\n",
    "ax.set_ylabel('mosaic subset')\n",
    "ax.set_title('(c) sequence overlap (count)')\n",
    "plt.colorbar(im, ax=ax, fraction=0.046, pad=0.04)\n",
    "\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Write subsets to A3M files\n",
    "\n",
    "Each subset is written as a ColabFold-compatible A3M with the query as the first record. Downstream you would feed one A3M per AF2 run."
   ]
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from sf_cluster import build_subsets\n",
    "\n",
    "out_mosaic   = Path('./subsets_mosaic')\n",
    "out_gradient = Path('./subsets_gradient')\n",
    "\n",
    "_, _, _, mosaic_paths   = build_subsets(A3M, FI, method='mosaic',   out_dir=out_mosaic)\n",
    "_, _, _, gradient_paths = build_subsets(A3M, FI, method='gradient', out_dir=out_gradient)\n",
    "\n",
    "print(f'mosaic   -> {len(mosaic_paths):2d} files in {out_mosaic}/')\n",
    "print(f'gradient -> {len(gradient_paths):2d} files in {out_gradient}/')\n",
    "\n",
    "sample = mosaic_paths[0]\n",
    "print(f'\\nFirst 3 records of {sample.name}:')\n",
    "with open(sample) as f:\n",
    "    lines = f.read().splitlines()\n",
    "shown = 0\n",
    "i = 0\n",
    "while i < len(lines) and shown < 3:\n",
    "    if lines[i].startswith('>'):\n",
    "        print(' ', lines[i])\n",
    "        if i+1 < len(lines):\n",
    "            seq = lines[i+1]\n",
    "            print(' ', seq[:80] + ('...' if len(seq) > 80 else ''))\n",
    "        shown += 1\n",
    "        i += 2\n",
    "    else:\n",
    "        i += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Bring your own protein\n",
    "\n",
    "The demo bundle is tiny and CPU-friendly. For your own target:\n",
    "\n",
    "1. **Build an MSA.** Use the official [ColabFold notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) (`mmseqs2_uniref_env` mode) to generate a deep `.a3m`, then filter it (e.g. 25%-gap filter) to obtain `your_msa.a3m`.\n",
    "2. **Compute the FI matrix.** Run [FrustrAI-Seq](https://huggingface.co/leuschj/FrustrAI-Seq) on `your_msa.a3m` to obtain a per-residue Frustration Index matrix `your_fi.npy` of shape `(N_seq, L)`. **A GPU is required for this step.** See the FrustrAI-Seq model card for inference details.\n",
    "3. **Re-run the cells above.** Just point `A3M` and `FI` at your files and re-execute from \u00a73 onward. The package will raise a `ValueError` if `N_seq` disagrees between the two."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Next: run AF2 on each subset\n",
    "\n",
    "Feed each subset A3M into the official [ColabFold AlphaFold2 notebook](https://colab.research.google.com/github/sokrypton/ColabFold/blob/main/AlphaFold2.ipynb) \u2014 one subset per AF2 run. Aggregate per the SF-Cluster \u00a79.1 hit criterion:\n",
    "\n",
    "- C\u03b1 RMSD \u2264 3.0 \u00c5 on the `common_core` residues vs. each reference state,\n",
    "- mean pLDDT \u2265 70 overall,\n",
    "- mean pLDDT \u2265 70 inside the `switch_region`.\n",
    "\n",
    "**Compute budget disclosure (per `docs/protocol_lock.md`).** The SF-Cluster paper locks AF2 at 3 recycles \u00d7 4 seeds \u00d7 5 models for KaiB / Mpt53, and 0 recycles \u00d7 8 seeds \u00d7 5 models for the GA/GB cases. The GA/GB row was further trimmed to **4 subsets per case** during refinement to stay within the compute envelope. Global seed: `20260422`. Per-case seed = `hash(case_name) mod 2^31`; per-subset seed = `base_seed + subset_index`. All inference uses `templates=OFF`, `relax=OFF`, `dropout=OFF`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 8. Citation, license, companion repo\n",
    "\n",
    "```bibtex\n",
    "@misc{sf_cluster_2026,\n",
    "  title  = {SF-Cluster: frustration-guided MSA subset builders for AF2 multi-conformer prediction},\n",
    "  author = {Cao, Hanqun and {Chatterjee Lab}},\n",
    "  year   = {2026},\n",
    "  note   = {Workshop release. Companion code: https://huggingface.co/ChatterjeeLab/SF-Cluster},\n",
    "  url    = {https://huggingface.co/ChatterjeeLab/SF-Cluster}\n",
    "}\n",
    "```\n",
    "\n",
    "**License:** MIT. See `LICENSE` in the OSS repo.\n",
    "\n",
    "**Companion private dev repo.** Full Phase II benchmark code (DBSCAN baselines, all four arms, evaluation harness, region partition ablation) lives in the SF-Cluster private dev repository. The OSS release here is a slim, dependency-light subset \u2014 only the `mosaic` and `gradient` arms and their scoring function \u2014 intended for reuse, not full reproduction of the benchmark."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}