ESMFold structure statistics (PDB + DSSP)

Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure.

Files

File	Description
`pdb_dssp_analyses.py`	Main analysis script
`requirements.txt`	Python dependencies

Requirements

Python 3.8+
Packages in requirements.txt (biopython, numpy, pandas)
PDB files (.pdb) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT
DSSP files (.dssp) — required for secondary-structure fractions; must be generated beforehand (see below)

Install:

pip install -r requirements.txt

Input data

PDB structures

Place all model .pdb files in one folder and pass it as --input_files. The script reads coordinates and B-factors from these files.

DSSP secondary structure

DSSP assignments must exist as .dssp files in a separate folder (--dssp_dir). Generate them with mkdssp from the PDB-REDO DSSP package:

input_folder="/path/to/pdbs"
output_folder="/path/to/dssp"
mkdir -p "$output_folder"

for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do
  echo $i
  mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \
    --output-format=dssp --min-pp-stretch=2
done

Use the mkdssp binary name from your installation if it differs from mkdssp-4.4.0-linux-x64.

Matching PDB and DSSP

Both folders should be flat (no subfolders).
Each name.pdb must have a matching name.dssp (same filename stem).
Only paired structures are analyzed; unmatched files are skipped with a warning.

Usage

python pdb_dssp_analyses.py \
  --input_files /path/to/pdbs \
  --dssp_dir /path/to/dssp \
  --output /path/to/stats.csv

If --output is omitted, the CSV is written to the current directory as stats.csv.

Output columns

sequence_id is the PDB/DSSP filename without extension.

Column	Source	Description
`sequence_id`	Filename	Structure identifier
`sequence`	PDB	One-letter sequence (chain A, or first chain)
`GRAVY`	Sequence	Mean Kyte–Doolittle hydropathy
`pI`	Sequence	Isoelectric point (Biopython)
`helix`	DSSP	Fraction α-helix (3₁₀), π, or α: G, H, I, P
`strand`	DSSP	Fraction β-strand (E)
`disorder`	DSSP	Fraction coil / irregular: -, C, space, B, T, S
`structured`	DSSP	Fraction structured: G, H, I, E, P
`alpha-helix`	DSSP	Fraction H
`helix-3`	DSSP	Fraction 3₁₀ (G)
`helix-5`	DSSP	Fraction π-helix (I)
`helix-PPII`	DSSP	Fraction polyproline II (P)
`betabridge`	DSSP	Fraction bridge (B)
`turn`	DSSP	Fraction turn (T)
`bend`	DSSP	Fraction bend (S)
`loops`	DSSP	Fraction B + T + S
`ASA`	PDB	Total solvent-accessible surface area (Å², Shrake–Rupley)
`Rg`	PDB	Radius of gyration (Å)
`NtoCdistance`	PDB	Cα distance N-terminus to C-terminus (Å)
`pLDDT`	PDB	Mean B-factor (pLDDT proxy for ESMFold models)

Secondary-structure fractions use raw DSSP one-letter codes.

Notes

ASA is computed from the PDB structure, not from DSSP relative accessibility.
If Biopython cannot parse a .dssp file, the script falls back to a simple line parser for classic DSSP output format.
Structures with fewer than two Cα atoms get NaN for NtoCdistance.

References

Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. Biopolymers 22, 2577–2637.
PDB-REDO DSSP / mkdssp