neuwirtt
Initial release: FRET-FACS pipeline, weights, and datasets
6e4d123
|
Raw
History Blame Contribute Delete
3.76 kB

ESMFold structure statistics (PDB + DSSP)

Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure.

Files

File Description
pdb_dssp_analyses.py Main analysis script
requirements.txt Python dependencies

Requirements

  • Python 3.8+
  • Packages in requirements.txt (biopython, numpy, pandas)
  • PDB files (.pdb) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT
  • DSSP files (.dssp) — required for secondary-structure fractions; must be generated beforehand (see below)

Install:

pip install -r requirements.txt

Input data

PDB structures

Place all model .pdb files in one folder and pass it as --input_files. The script reads coordinates and B-factors from these files.

DSSP secondary structure

DSSP assignments must exist as .dssp files in a separate folder (--dssp_dir). Generate them with mkdssp from the PDB-REDO DSSP package:

input_folder="/path/to/pdbs"
output_folder="/path/to/dssp"
mkdir -p "$output_folder"

for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do
  echo $i
  mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \
    --output-format=dssp --min-pp-stretch=2
done

Use the mkdssp binary name from your installation if it differs from mkdssp-4.4.0-linux-x64.

Matching PDB and DSSP

  • Both folders should be flat (no subfolders).
  • Each name.pdb must have a matching name.dssp (same filename stem).
  • Only paired structures are analyzed; unmatched files are skipped with a warning.

Usage

python pdb_dssp_analyses.py \
  --input_files /path/to/pdbs \
  --dssp_dir /path/to/dssp \
  --output /path/to/stats.csv

If --output is omitted, the CSV is written to the current directory as stats.csv.

Output columns

sequence_id is the PDB/DSSP filename without extension.

Column Source Description
sequence_id Filename Structure identifier
sequence PDB One-letter sequence (chain A, or first chain)
GRAVY Sequence Mean Kyte–Doolittle hydropathy
pI Sequence Isoelectric point (Biopython)
helix DSSP Fraction α-helix (3₁₀), π, or α: G, H, I, P
strand DSSP Fraction β-strand (E)
disorder DSSP Fraction coil / irregular: -, C, space, B, T, S
structured DSSP Fraction structured: G, H, I, E, P
alpha-helix DSSP Fraction H
helix-3 DSSP Fraction 3₁₀ (G)
helix-5 DSSP Fraction π-helix (I)
helix-PPII DSSP Fraction polyproline II (P)
betabridge DSSP Fraction bridge (B)
turn DSSP Fraction turn (T)
bend DSSP Fraction bend (S)
loops DSSP Fraction B + T + S
ASA PDB Total solvent-accessible surface area (Ų, Shrake–Rupley)
Rg PDB Radius of gyration (Å)
NtoCdistance PDB Cα distance N-terminus to C-terminus (Å)
pLDDT PDB Mean B-factor (pLDDT proxy for ESMFold models)

Secondary-structure fractions use raw DSSP one-letter codes.

Notes

  • ASA is computed from the PDB structure, not from DSSP relative accessibility.
  • If Biopython cannot parse a .dssp file, the script falls back to a simple line parser for classic DSSP output format.
  • Structures with fewer than two Cα atoms get NaN for NtoCdistance.

References

  • Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. Biopolymers 22, 2577–2637.
  • PDB-REDO DSSP / mkdssp