File size: 5,799 Bytes
b1a1b0a ccbe063 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | [](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb)
> **Or run from Hugging Face:** open <https://colab.research.google.com/> β *File* β *Open notebook* β *URL* tab β paste
> `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb`
## Demo
A self-contained, CPU-only Colab notebook is provided at
[`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the
package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix,
~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score
distribution and per-subset means, and writes A3M files ready for AF2. Expected
end-to-end runtime on a free Colab CPU instance: **~2 minutes**.
# SF-Cluster (workshop OSS release)
Frustration-guided MSA subset builders for AlphaFold2 multi-conformer
prediction. This is the open-source workshop distribution of two subset
methods from the SF-Cluster benchmark:
- **mosaic** β each subset mixes high / mid / low contrast-FI sequences.
- **gradient** β each subset is homogeneous within a contrast-FI quartile.
The contrast score is computed from a per-residue Frustration Index (FI)
matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq)
(HF model: `leuschj/FrustrAI-Seq`).
This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is
designed to be a drop-in replacement for random / uniform MSA subsampling in
[AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines.
## Algorithm
Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a
per-residue FI matrix `F β β^{NΓL}`:
1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences.
2. **High-variance mask**: `HV = {l : v_l β₯ percentile(v, 80)}`,
`LV = Β¬HV`.
3. **Contrast score** per sequence:
```
contrast_hvlv(i) = mean_{l β HV} F_{i,l} β mean_{l β LV} F_{i,l}
```
4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32):
sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles;
for each subset `s β {0..11}`, draw `11 high + 11 low + 10 mid` with
`np.random.default_rng(seed=s)`.
5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32):
split sorted pool into 4 quartiles; for each bin `b β {0..3}` and
`s β {0..2}` draw 32 sequences from that bin only with
`np.random.default_rng(seed=10*b + s)`.
## Install
```bash
pip install -e .
```
Python β₯ 3.10. Dependencies: `numpy`, `scipy`.
## Inputs
You need two files per case:
1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters
are preserved verbatim in output subsets; only match-state (uppercase)
columns are scored.
2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is
the number of sequences in the A3M and `L` is the number of match-state
columns.
The FI matrix is produced by FrustrAI-Seq. We do not bundle weights β see
`https://github.com/leuschj/FrustrAI-Seq` (model card:
`https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions.
A reference usage pattern is documented in `examples/run_demo.sh`.
## CLI
```bash
sf-cluster build \
--a3m path/to/filtered.a3m \
--fi path/to/fi_matrix.npy \
--method mosaic \
--n-subsets 12 \
--subset-size 32 \
--seed 20260422 \
--out subsets/kaib_mosaic/
```
Outputs:
```
subsets/kaib_mosaic/
βββ mosaic_subset_000.a3m
βββ mosaic_subset_001.a3m
βββ ...
βββ mosaic_subset_011.a3m
βββ mosaic_subset_index.tsv # subset_id, pool_index, header, score
βββ mosaic_meta.json # provenance + score stats
```
## Library
```python
from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient
pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix) # (N,) per-sequence
subsets = method_mosaic(score) # list[list[int]] of 12 Γ 32
# or
subsets = method_gradient(score)
```
Each subset is a list of indices into `pool.headers` / `pool.sequences`.
## Reproducibility
All RNG draws use `np.random.default_rng(seed=...)` with method-specific
deterministic seeds (see Algorithm Β§4βΒ§5). Re-running the same A3M + FI
matrix yields byte-identical subset assignments. The CLI also records a
provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the
package version.
## LIMITATIONS
- **No frustration model included.** You must run FrustrAI-Seq separately to
obtain the `(N_seq, L)` FI matrix. This package only handles the
scoring + subset-construction stage.
- **No AF2 runner included.** The package emits A3M files; downstream
inference (AF2 / ColabFold) is the user's responsibility.
- **Only `mosaic` and `gradient` arms are open-sourced here.** The other
SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional
feature pipelines and are intentionally excluded from this workshop
release.
- **No re-sampling guarantee across subsets.** A sequence can appear in
multiple subsets (gradient draws from a single quartile with replacement
if the quartile is smaller than `subset_size`).
- **Empirical caveat (read this).** Controlled comparison shows uniform
subsampling performs equivalently on most Main-21 cases β see paper for
boundary conditions under which contrast-FI stratification yields a
measurable lift over random subsampling. Treat this package as a research
baseline, not a turnkey accuracy improvement.
## Citation
If you use this code, please cite the SF-Cluster paper (forthcoming) and
[FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq).
## License
MIT. See `LICENSE`.
|