FRET-FACS — FRET classification pipeline

Scripts, pre-trained models, and datasets for binary FRET classifiers (highFRET vs lowFRET) on designed protein variants, plus utilities for ESM embedding extraction and structure-based descriptors.

Paper: Learning the structural diversity in random protein sequence space (bioRxiv 2026; doi:10.64898/2026.04.30.722084). See LICENSE for terms and the recommended citation.

Hub repository: huggingface.co/neuwirtt/FRET-FACS

Download

pip install -U "huggingface_hub[cli]"
hf download neuwirtt/FRET-FACS --local-dir FRET-FACS
cd FRET-FACS

Quick start

Install evaluation dependencies:

pip install -r evaluation/requirements.txt

Evaluate with bundled weights (paths are relative to the repo root):

python evaluation/evaluate_nn_one_hot.py --fasta your_seqs.fa --output eval_out/

Train a new model (example — one-hot neural net):

pip install -r models/requirements_models_training.txt
python models/nn_one_hot.py --fasta seqs.fa --csv labels.csv --output run_out/

ESM models require per-sequence .pt embedding files — extract them first (extract_embeddings/), then train or point evaluators at --emb-mean / --emb-token.

See the README in each subdirectory for full CLI options, CSV column schemas, and artifact naming.

Repository layout

Directory	Purpose
`models/`	Training scripts (one-hot NN/RF, ESM NN/RF, composition logistic-regression baseline)
`evaluation/`	Inference and supervised scoring on new FASTA inputs using checkpoints in `weights/`
`weights/`	Pre-trained checkpoints, calibration CSVs, and reproducibility metadata (see `weights/README.md`)
`extract_embeddings/`	Extract fair-esm `.pt` embedding caches for ESM-based models
`datasets/`	FRET variant labels and natural-protein reference FASTAs
`structure_predictions_processing/`	Batch PDB + DSSP structure statistics (`pdb_dssp_analyses.py`)
`similarity_split/`	MMseqs2 cluster-based train/test split (`mmseqs_cluster_split.py`)

`weights/` bundles

Subdirectory	Evaluator
`weights/nn_one_hot/`	`evaluation/evaluate_nn_one_hot.py`
`weights/nn_mean_pertoken_esm/`	`evaluation/evaluate_nn_mean_pertoken_esm.py`
`weights/rf_one_hot/`	`evaluation/evaluate_rf_one_hot.py`
`weights/rf_mean_pertoken_esm/`	`evaluation/evaluate_rf_mean_pertoken_esm.py`

Each bundle contains inference artifacts plus model_parameters.json, random_seed.txt, and hyperparameter CSVs for reproducibility. See weights/README.md for the full file list.

Composition LR is not pre-bundled — train with models/lr_sequence_composition_baseline.py, then pass --model-dir to evaluation/evaluate_lr_sequence_composition.py.

ESM embedding caches (.pt per FASTA header) are not included; create them with extract_embeddings/ and place or symlink mean_embeddings_pt/ and per_token_embeddings_pt/ next to the ESM checkpoint dirs.

Datasets

File	Rows	What it contains
`datasets/full_dataset_mmseq.csv`	~94k	Full MMseq-clustered table: `split=train` (~~77k) + `split=test` (~~17k)
`datasets/test_only_mmseq.csv`	~17k	Held-out test split only (balanced highFRET/lowFRET); same rows as `split=test` in the full file

datasets/natural_proteins/ — SCOP globular sets and UniProt coiled-coil reference FASTAs

Citation

@article{Buchel2026structural,
  title   = {Learning the structural diversity in random protein sequence space},
  author  = {Buchel, Filip and Neuwirthova, Tereza and Tureckiova, Theodora and
             Fuertes, Gustavo and Benda, Ales and Panek, Dalibor and Fricek, Matus and
             AlQuraishi, Mohammed and Hlouchova, Klara},
  journal = {bioRxiv},
  year    = {2026},
  doi     = {10.64898/2026.04.30.722084},
  url     = {https://www.biorxiv.org/content/10.64898/2026.04.30.722084v1}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support