FRET-FACS β€” FRET classification pipeline

Scripts, pre-trained models, and datasets for binary FRET classifiers (highFRET vs lowFRET) on designed protein variants, plus utilities for ESM embedding extraction and structure-based descriptors.

Paper: Learning the structural diversity in random protein sequence space (bioRxiv 2026; doi:10.64898/2026.04.30.722084). See LICENSE for terms and the recommended citation.

Hub repository: huggingface.co/neuwirtt/FRET-FACS

Download

pip install -U "huggingface_hub[cli]"
hf download neuwirtt/FRET-FACS --local-dir FRET-FACS
cd FRET-FACS

Quick start

Install evaluation dependencies:

pip install -r evaluation/requirements.txt

Evaluate with bundled weights (paths are relative to the repo root):

python evaluation/evaluate_nn_one_hot.py --fasta your_seqs.fa --output eval_out/

Train a new model (example β€” one-hot neural net):

pip install -r models/requirements_models_training.txt
python models/nn_one_hot.py --fasta seqs.fa --csv labels.csv --output run_out/

ESM models require per-sequence .pt embedding files β€” extract them first (extract_embeddings/), then train or point evaluators at --emb-mean / --emb-token.

See the README in each subdirectory for full CLI options, CSV column schemas, and artifact naming.

Repository layout

Directory Purpose
models/ Training scripts (one-hot NN/RF, ESM NN/RF, composition logistic-regression baseline)
evaluation/ Inference and supervised scoring on new FASTA inputs using checkpoints in weights/
weights/ Pre-trained checkpoints, calibration CSVs, and reproducibility metadata (see weights/README.md)
extract_embeddings/ Extract fair-esm .pt embedding caches for ESM-based models
datasets/ FRET variant labels and natural-protein reference FASTAs
structure_predictions_processing/ Batch PDB + DSSP structure statistics (pdb_dssp_analyses.py)
similarity_split/ MMseqs2 cluster-based train/test split (mmseqs_cluster_split.py)

weights/ bundles

Subdirectory Evaluator
weights/nn_one_hot/ evaluation/evaluate_nn_one_hot.py
weights/nn_mean_pertoken_esm/ evaluation/evaluate_nn_mean_pertoken_esm.py
weights/rf_one_hot/ evaluation/evaluate_rf_one_hot.py
weights/rf_mean_pertoken_esm/ evaluation/evaluate_rf_mean_pertoken_esm.py

Each bundle contains inference artifacts plus model_parameters.json, random_seed.txt, and hyperparameter CSVs for reproducibility. See weights/README.md for the full file list.

Composition LR is not pre-bundled β€” train with models/lr_sequence_composition_baseline.py, then pass --model-dir to evaluation/evaluate_lr_sequence_composition.py.

ESM embedding caches (.pt per FASTA header) are not included; create them with extract_embeddings/ and place or symlink mean_embeddings_pt/ and per_token_embeddings_pt/ next to the ESM checkpoint dirs.

Datasets

File Rows What it contains
datasets/full_dataset_mmseq.csv ~94k Full MMseq-clustered table: split=train (77k) + split=test (17k)
datasets/test_only_mmseq.csv ~17k Held-out test split only (balanced highFRET/lowFRET); same rows as split=test in the full file
  • datasets/natural_proteins/ β€” SCOP globular sets and UniProt coiled-coil reference FASTAs

Citation

@article{Buchel2026structural,
  title   = {Learning the structural diversity in random protein sequence space},
  author  = {Buchel, Filip and Neuwirthova, Tereza and Tureckiova, Theodora and
             Fuertes, Gustavo and Benda, Ales and Panek, Dalibor and Fricek, Matus and
             AlQuraishi, Mohammed and Hlouchova, Klara},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.04.30.722084},
  url     = {https://www.biorxiv.org/content/10.64898/2026.04.30.722084v1}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support