# ESMFold structure statistics (PDB + DSSP) Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure. ## Files | File | Description | |------|-------------| | `pdb_dssp_analyses.py` | Main analysis script | | `requirements.txt` | Python dependencies | ## Requirements - Python 3.8+ - Packages in `requirements.txt` (`biopython`, `numpy`, `pandas`) - **PDB files** (`.pdb`) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT - **DSSP files** (`.dssp`) — required for secondary-structure fractions; must be generated beforehand (see below) Install: ```bash pip install -r requirements.txt ``` ## Input data ### PDB structures Place all model `.pdb` files in one folder and pass it as `--input_files`. The script reads coordinates and B-factors from these files. ### DSSP secondary structure DSSP assignments must exist as `.dssp` files in a separate folder (`--dssp_dir`). Generate them with **mkdssp** from the [PDB-REDO DSSP](https://github.com/PDB-REDO/dssp) package: ```bash input_folder="/path/to/pdbs" output_folder="/path/to/dssp" mkdir -p "$output_folder" for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do echo $i mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \ --output-format=dssp --min-pp-stretch=2 done ``` Use the `mkdssp` binary name from your installation if it differs from `mkdssp-4.4.0-linux-x64`. ### Matching PDB and DSSP - Both folders should be **flat** (no subfolders). - Each `name.pdb` must have a matching `name.dssp` (same filename stem). - Only paired structures are analyzed; unmatched files are skipped with a warning. ## Usage ```bash python pdb_dssp_analyses.py \ --input_files /path/to/pdbs \ --dssp_dir /path/to/dssp \ --output /path/to/stats.csv ``` If `--output` is omitted, the CSV is written to the current directory as `stats.csv`. ## Output columns `sequence_id` is the PDB/DSSP filename without extension. | Column | Source | Description | |--------|--------|-------------| | `sequence_id` | Filename | Structure identifier | | `sequence` | PDB | One-letter sequence (chain A, or first chain) | | `GRAVY` | Sequence | Mean Kyte–Doolittle hydropathy | | `pI` | Sequence | Isoelectric point (Biopython) | | `helix` | DSSP | Fraction α-helix (3₁₀), π, or α: G, H, I, P | | `strand` | DSSP | Fraction β-strand (E) | | `disorder` | DSSP | Fraction coil / irregular: -, C, space, B, T, S | | `structured` | DSSP | Fraction structured: G, H, I, E, P | | `alpha-helix` | DSSP | Fraction H | | `helix-3` | DSSP | Fraction 3₁₀ (G) | | `helix-5` | DSSP | Fraction π-helix (I) | | `helix-PPII` | DSSP | Fraction polyproline II (P) | | `betabridge` | DSSP | Fraction bridge (B) | | `turn` | DSSP | Fraction turn (T) | | `bend` | DSSP | Fraction bend (S) | | `loops` | DSSP | Fraction B + T + S | | `ASA` | PDB | Total solvent-accessible surface area (Ų, Shrake–Rupley) | | `Rg` | PDB | Radius of gyration (Å) | | `NtoCdistance` | PDB | Cα distance N-terminus to C-terminus (Å) | | `pLDDT` | PDB | Mean B-factor (pLDDT proxy for ESMFold models) | Secondary-structure fractions use **raw DSSP one-letter codes**. ## Notes - **ASA** is computed from the PDB structure, not from DSSP relative accessibility. - If Biopython cannot parse a `.dssp` file, the script falls back to a simple line parser for classic DSSP output format. - Structures with fewer than two Cα atoms get `NaN` for `NtoCdistance`. ## References - Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. *Biopolymers* 22, 2577–2637. - [PDB-REDO DSSP / mkdssp](https://github.com/PDB-REDO/dssp)