File size: 5,799 Bytes
b1a1b0a
 
 
 
 
 
 
 
 
 
 
 
 
 
ccbe063
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ChatterjeeLab/SF-Cluster/blob/main/examples/SF_Cluster_Demo.ipynb)

> **Or run from Hugging Face:** open <https://colab.research.google.com/> β†’ *File* β†’ *Open notebook* β†’ *URL* tab β†’ paste
> `https://huggingface.co/ChatterjeeLab/SF-Cluster/resolve/main/examples/SF_Cluster_Demo.ipynb`

## Demo

A self-contained, CPU-only Colab notebook is provided at
[`examples/SF_Cluster_Demo.ipynb`](examples/SF_Cluster_Demo.ipynb). It installs the
package, downloads a small KaiB demo bundle (filtered MSA + FrustrAI-Seq FI matrix,
~200 KB), builds 12 mosaic and 12 gradient subsets, visualises the contrast-score
distribution and per-subset means, and writes A3M files ready for AF2. Expected
end-to-end runtime on a free Colab CPU instance: **~2 minutes**.

# SF-Cluster (workshop OSS release)

Frustration-guided MSA subset builders for AlphaFold2 multi-conformer
prediction. This is the open-source workshop distribution of two subset
methods from the SF-Cluster benchmark:

- **mosaic** β€” each subset mixes high / mid / low contrast-FI sequences.
- **gradient** β€” each subset is homogeneous within a contrast-FI quartile.

The contrast score is computed from a per-residue Frustration Index (FI)
matrix produced by [FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq)
(HF model: `leuschj/FrustrAI-Seq`).

This package is dependency-light (`numpy`, `scipy`), provides a CLI, and is
designed to be a drop-in replacement for random / uniform MSA subsampling in
[AF-Cluster](https://github.com/HWaymentSteele/AF_Cluster)-style pipelines.

## Algorithm

Given a filtered MSA `A` of `N` sequences over `L` match-state columns, and a
per-residue FI matrix `F ∈ ℝ^{NΓ—L}`:

1. **Column variance**: `v_l = Var_i(F_{i,l})` over sequences.
2. **High-variance mask**: `HV = {l : v_l β‰₯ percentile(v, 80)}`,
   `LV = Β¬HV`.
3. **Contrast score** per sequence:
   ```
   contrast_hvlv(i) = mean_{l ∈ HV} F_{i,l} βˆ’ mean_{l ∈ LV} F_{i,l}
   ```
4. **Mosaic** (N_SUBSETS = 12, TARGET_SIZE = 32):
   sort pool by `contrast_hvlv`, tri-stratify into low/mid/high terciles;
   for each subset `s ∈ {0..11}`, draw `11 high + 11 low + 10 mid` with
   `np.random.default_rng(seed=s)`.
5. **Gradient** (N_SUBSETS = 12, TARGET_SIZE = 32):
   split sorted pool into 4 quartiles; for each bin `b ∈ {0..3}` and
   `s ∈ {0..2}` draw 32 sequences from that bin only with
   `np.random.default_rng(seed=10*b + s)`.

## Install

```bash
pip install -e .
```

Python β‰₯ 3.10. Dependencies: `numpy`, `scipy`.

## Inputs

You need two files per case:

1. A filtered A3M file (ColabFold-style). Lowercase insertion-state letters
   are preserved verbatim in output subsets; only match-state (uppercase)
   columns are scored.
2. A per-residue FI matrix `.npy` of shape `(N_seq, L)`, where `N_seq` is
   the number of sequences in the A3M and `L` is the number of match-state
   columns.

The FI matrix is produced by FrustrAI-Seq. We do not bundle weights β€” see
`https://github.com/leuschj/FrustrAI-Seq` (model card:
`https://huggingface.co/leuschj/FrustrAI-Seq`) for inference instructions.
A reference usage pattern is documented in `examples/run_demo.sh`.

## CLI

```bash
sf-cluster build \
    --a3m   path/to/filtered.a3m \
    --fi    path/to/fi_matrix.npy \
    --method mosaic \
    --n-subsets 12 \
    --subset-size 32 \
    --seed 20260422 \
    --out   subsets/kaib_mosaic/
```

Outputs:
```
subsets/kaib_mosaic/
β”œβ”€β”€ mosaic_subset_000.a3m
β”œβ”€β”€ mosaic_subset_001.a3m
β”œβ”€β”€ ...
β”œβ”€β”€ mosaic_subset_011.a3m
β”œβ”€β”€ mosaic_subset_index.tsv   # subset_id, pool_index, header, score
└── mosaic_meta.json          # provenance + score stats
```

## Library

```python
from sf_cluster import pool_msa, contrast_hvlv, method_mosaic, method_gradient

pool = pool_msa("filtered.a3m", "fi_matrix.npy")
score = contrast_hvlv(pool.fi_matrix)         # (N,) per-sequence
subsets = method_mosaic(score)                # list[list[int]] of 12 Γ— 32
# or
subsets = method_gradient(score)
```

Each subset is a list of indices into `pool.headers` / `pool.sequences`.

## Reproducibility

All RNG draws use `np.random.default_rng(seed=...)` with method-specific
deterministic seeds (see Algorithm Β§4–§5). Re-running the same A3M + FI
matrix yields byte-identical subset assignments. The CLI also records a
provenance JSON (`{method}_meta.json`) capturing inputs, sizes, and the
package version.

## LIMITATIONS

- **No frustration model included.** You must run FrustrAI-Seq separately to
  obtain the `(N_seq, L)` FI matrix. This package only handles the
  scoring + subset-construction stage.
- **No AF2 runner included.** The package emits A3M files; downstream
  inference (AF2 / ColabFold) is the user's responsibility.
- **Only `mosaic` and `gradient` arms are open-sourced here.** The other
  SF-Cluster arms (`region_cluster`, `contrast_nc`) require additional
  feature pipelines and are intentionally excluded from this workshop
  release.
- **No re-sampling guarantee across subsets.** A sequence can appear in
  multiple subsets (gradient draws from a single quartile with replacement
  if the quartile is smaller than `subset_size`).
- **Empirical caveat (read this).** Controlled comparison shows uniform
  subsampling performs equivalently on most Main-21 cases β€” see paper for
  boundary conditions under which contrast-FI stratification yields a
  measurable lift over random subsampling. Treat this package as a research
  baseline, not a turnkey accuracy improvement.

## Citation

If you use this code, please cite the SF-Cluster paper (forthcoming) and
[FrustrAI-Seq](https://github.com/leuschj/FrustrAI-Seq).

## License

MIT. See `LICENSE`.