FRET-FACS β FRET classification pipeline
Scripts, pre-trained models, and datasets for binary FRET classifiers (highFRET vs lowFRET) on designed protein variants, plus utilities for ESM embedding extraction and structure-based descriptors.
Paper: Learning the structural diversity in random protein sequence space (bioRxiv 2026; doi:10.64898/2026.04.30.722084). See LICENSE for terms and the recommended citation.
Hub repository: huggingface.co/neuwirtt/FRET-FACS
Download
pip install -U "huggingface_hub[cli]"
hf download neuwirtt/FRET-FACS --local-dir FRET-FACS
cd FRET-FACS
Quick start
Install evaluation dependencies:
pip install -r evaluation/requirements.txt
Evaluate with bundled weights (paths are relative to the repo root):
python evaluation/evaluate_nn_one_hot.py --fasta your_seqs.fa --output eval_out/
Train a new model (example β one-hot neural net):
pip install -r models/requirements_models_training.txt
python models/nn_one_hot.py --fasta seqs.fa --csv labels.csv --output run_out/
ESM models require per-sequence .pt embedding files β extract them first (extract_embeddings/), then train or point evaluators at --emb-mean / --emb-token.
See the README in each subdirectory for full CLI options, CSV column schemas, and artifact naming.
Repository layout
| Directory | Purpose |
|---|---|
models/ |
Training scripts (one-hot NN/RF, ESM NN/RF, composition logistic-regression baseline) |
evaluation/ |
Inference and supervised scoring on new FASTA inputs using checkpoints in weights/ |
weights/ |
Pre-trained checkpoints, calibration CSVs, and reproducibility metadata (see weights/README.md) |
extract_embeddings/ |
Extract fair-esm .pt embedding caches for ESM-based models |
datasets/ |
FRET variant labels and natural-protein reference FASTAs |
structure_predictions_processing/ |
Batch PDB + DSSP structure statistics (pdb_dssp_analyses.py) |
similarity_split/ |
MMseqs2 cluster-based train/test split (mmseqs_cluster_split.py) |
weights/ bundles
| Subdirectory | Evaluator |
|---|---|
weights/nn_one_hot/ |
evaluation/evaluate_nn_one_hot.py |
weights/nn_mean_pertoken_esm/ |
evaluation/evaluate_nn_mean_pertoken_esm.py |
weights/rf_one_hot/ |
evaluation/evaluate_rf_one_hot.py |
weights/rf_mean_pertoken_esm/ |
evaluation/evaluate_rf_mean_pertoken_esm.py |
Each bundle contains inference artifacts plus model_parameters.json, random_seed.txt, and hyperparameter CSVs for reproducibility. See weights/README.md for the full file list.
Composition LR is not pre-bundled β train with models/lr_sequence_composition_baseline.py, then pass --model-dir to evaluation/evaluate_lr_sequence_composition.py.
ESM embedding caches (.pt per FASTA header) are not included; create them with extract_embeddings/ and place or symlink mean_embeddings_pt/ and per_token_embeddings_pt/ next to the ESM checkpoint dirs.
Datasets
| File | Rows | What it contains |
|---|---|---|
datasets/full_dataset_mmseq.csv |
~94k | Full MMseq-clustered table: split=train (split=test ( |
datasets/test_only_mmseq.csv |
~17k | Held-out test split only (balanced highFRET/lowFRET); same rows as split=test in the full file |
datasets/natural_proteins/β SCOP globular sets and UniProt coiled-coil reference FASTAs
Citation
@article{Buchel2026structural,
title = {Learning the structural diversity in random protein sequence space},
author = {Buchel, Filip and Neuwirthova, Tereza and Tureckiova, Theodora and
Fuertes, Gustavo and Benda, Ales and Panek, Dalibor and Fricek, Matus and
AlQuraishi, Mohammed and Hlouchova, Klara},
journal = {bioRxiv},
year = {2026},
doi = {10.64898/2026.04.30.722084},
url = {https://www.biorxiv.org/content/10.64898/2026.04.30.722084v1}
}