ESMFold structure statistics (PDB + DSSP)
Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure.
Files
| File | Description |
|---|---|
pdb_dssp_analyses.py |
Main analysis script |
requirements.txt |
Python dependencies |
Requirements
- Python 3.8+
- Packages in
requirements.txt(biopython,numpy,pandas) - PDB files (
.pdb) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT - DSSP files (
.dssp) — required for secondary-structure fractions; must be generated beforehand (see below)
Install:
pip install -r requirements.txt
Input data
PDB structures
Place all model .pdb files in one folder and pass it as --input_files. The script reads coordinates and B-factors from these files.
DSSP secondary structure
DSSP assignments must exist as .dssp files in a separate folder (--dssp_dir). Generate them with mkdssp from the PDB-REDO DSSP package:
input_folder="/path/to/pdbs"
output_folder="/path/to/dssp"
mkdir -p "$output_folder"
for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do
echo $i
mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \
--output-format=dssp --min-pp-stretch=2
done
Use the mkdssp binary name from your installation if it differs from mkdssp-4.4.0-linux-x64.
Matching PDB and DSSP
- Both folders should be flat (no subfolders).
- Each
name.pdbmust have a matchingname.dssp(same filename stem). - Only paired structures are analyzed; unmatched files are skipped with a warning.
Usage
python pdb_dssp_analyses.py \
--input_files /path/to/pdbs \
--dssp_dir /path/to/dssp \
--output /path/to/stats.csv
If --output is omitted, the CSV is written to the current directory as stats.csv.
Output columns
sequence_id is the PDB/DSSP filename without extension.
| Column | Source | Description |
|---|---|---|
sequence_id |
Filename | Structure identifier |
sequence |
PDB | One-letter sequence (chain A, or first chain) |
GRAVY |
Sequence | Mean Kyte–Doolittle hydropathy |
pI |
Sequence | Isoelectric point (Biopython) |
helix |
DSSP | Fraction α-helix (3₁₀), π, or α: G, H, I, P |
strand |
DSSP | Fraction β-strand (E) |
disorder |
DSSP | Fraction coil / irregular: -, C, space, B, T, S |
structured |
DSSP | Fraction structured: G, H, I, E, P |
alpha-helix |
DSSP | Fraction H |
helix-3 |
DSSP | Fraction 3₁₀ (G) |
helix-5 |
DSSP | Fraction π-helix (I) |
helix-PPII |
DSSP | Fraction polyproline II (P) |
betabridge |
DSSP | Fraction bridge (B) |
turn |
DSSP | Fraction turn (T) |
bend |
DSSP | Fraction bend (S) |
loops |
DSSP | Fraction B + T + S |
ASA |
PDB | Total solvent-accessible surface area (Ų, Shrake–Rupley) |
Rg |
PDB | Radius of gyration (Å) |
NtoCdistance |
PDB | Cα distance N-terminus to C-terminus (Å) |
pLDDT |
PDB | Mean B-factor (pLDDT proxy for ESMFold models) |
Secondary-structure fractions use raw DSSP one-letter codes.
Notes
- ASA is computed from the PDB structure, not from DSSP relative accessibility.
- If Biopython cannot parse a
.dsspfile, the script falls back to a simple line parser for classic DSSP output format. - Structures with fewer than two Cα atoms get
NaNforNtoCdistance.
References
- Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. Biopolymers 22, 2577–2637.
- PDB-REDO DSSP / mkdssp