File size: 5,799 Bytes

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb)

> **Or run from Hugging Face:** open <https://colab.research.google.com/> → *File* → *Open notebook* → *URL* tab → paste
> `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb`

## Demo

A self-contained, CPU-only Colab notebook is provided at
[`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the
package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix,
~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score
distribution and per-subset means, and writes A3M files ready for AF2. Expected
end-to-end runtime on a free Colab CPU instance: **~2 minutes**.

# SF-Cluster (workshop OSS release)

Frustration-guided MSA subset builders for AlphaFold2 multi-conformer
prediction. This is the open-source workshop distribution of two subset
methods from the SF-Cluster benchmark:

- **mosaic** — each subset mixes high / mid / low contrast-FI sequences.
- **gradient** — each subset is homogeneous within a contrast-FI quartile.

The contrast score is computed from a per-residue Frustration Index (FI)
matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq)
(HF model: `leuschj/FrustrAI-Seq`).

This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is
designed to be a drop-in replacement for random / uniform MSA subsampling in
[AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines.

## Algorithm

Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a
per-residue FI matrix `F ∈ ℝ^{N×L}`:

1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences.
2. **High-variance mask**: `HV = {l : v_l ≥ percentile(v, 80)}`,
   `LV = ¬HV`.
3. **Contrast score** per sequence:
   ```
   contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} − mean_{l ∈ LV} F_{i,l}
   ```
4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32):
   sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles;
   for each subset `s ∈ {0..11}`, draw `11 high + 11 low + 10 mid` with
   `np.random.default_rng(seed=s)`.
5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32):
   split sorted pool into 4 quartiles; for each bin `b ∈ {0..3}` and
   `s ∈ {0..2}` draw 32 sequences from that bin only with
   `np.random.default_rng(seed=10*b + s)`.

## Install

```bash
pip install -e .
```

Python ≥ 3.10. Dependencies: `numpy`, `scipy`.

## Inputs

You need two files per case:

1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters
   are preserved verbatim in output subsets; only match-state (uppercase)
   columns are scored.
2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is
   the number of sequences in the A3M and `L` is the number of match-state
   columns.

The FI matrix is produced by FrustrAI-Seq. We do not bundle weights — see
`https://github.com/leuschj/FrustrAI-Seq` (model card:
`https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions.
A reference usage pattern is documented in `examples/run_demo.sh`.

## CLI

```bash
sf-cluster build \
    --a3m   path/to/filtered.a3m \
    --fi    path/to/fi_matrix.npy \
    --method mosaic \
    --n-subsets 12 \
    --subset-size 32 \
    --seed 20260422 \
    --out   subsets/kaib_mosaic/
```

Outputs:
```
subsets/kaib_mosaic/
├── mosaic_subset_000.a3m
├── mosaic_subset_001.a3m
├── ...
├── mosaic_subset_011.a3m
├── mosaic_subset_index.tsv   # subset_id, pool_index, header, score
└── mosaic_meta.json          # provenance + score stats
```

## Library

```python
from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix)         # (N,) per-sequence
subsets = method_mosaic(score)                # list[list[int]] of 12 × 32
# or
subsets = method_gradient(score)
```

Each subset is a list of indices into `pool.headers` / `pool.sequences`.

## Reproducibility

All RNG draws use `np.random.default_rng(seed=...)` with method-specific
deterministic seeds (see Algorithm §4–§5). Re-running the same A3M + FI
matrix yields byte-identical subset assignments. The CLI also records a
provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the
package version.

## LIMITATIONS

- **No frustration model included.** You must run FrustrAI-Seq separately to
  obtain the `(N_seq, L)` FI matrix. This package only handles the
  scoring + subset-construction stage.
- **No AF2 runner included.** The package emits A3M files; downstream
  inference (AF2 / ColabFold) is the user's responsibility.
- **Only `mosaic` and `gradient` arms are open-sourced here.** The other
  SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional
  feature pipelines and are intentionally excluded from this workshop
  release.
- **No re-sampling guarantee across subsets.** A sequence can appear in
  multiple subsets (gradient draws from a single quartile with replacement
  if the quartile is smaller than `subset_size`).
- **Empirical caveat (read this).** Controlled comparison shows uniform
  subsampling performs equivalently on most Main-21 cases — see paper for
  boundary conditions under which contrast-FI stratification yields a
  measurable lift over random subsampling. Treat this package as a research
  baseline, not a turnkey accuracy improvement.

## Citation

If you use this code, please cite the SF-Cluster paper (forthcoming) and
[FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq).

## License

MIT. See `LICENSE`.