| # ESMFold structure statistics (PDB + DSSP) |
|
|
| Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure. |
|
|
| ## Files |
|
|
| | File | Description | |
| |------|-------------| |
| | `pdb_dssp_analyses.py` | Main analysis script | |
| | `requirements.txt` | Python dependencies | |
|
|
| ## Requirements |
|
|
| - Python 3.8+ |
| - Packages in `requirements.txt` (`biopython`, `numpy`, `pandas`) |
| - **PDB files** (`.pdb`) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT |
| - **DSSP files** (`.dssp`) — required for secondary-structure fractions; must be generated beforehand (see below) |
|
|
| Install: |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ## Input data |
|
|
| ### PDB structures |
|
|
| Place all model `.pdb` files in one folder and pass it as `--input_files`. The script reads coordinates and B-factors from these files. |
|
|
| ### DSSP secondary structure |
|
|
| DSSP assignments must exist as `.dssp` files in a separate folder (`--dssp_dir`). Generate them with **mkdssp** from the [PDB-REDO DSSP](https://github.com/PDB-REDO/dssp) package: |
|
|
| ```bash |
| input_folder="/path/to/pdbs" |
| output_folder="/path/to/dssp" |
| mkdir -p "$output_folder" |
| |
| for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do |
| echo $i |
| mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \ |
| --output-format=dssp --min-pp-stretch=2 |
| done |
| ``` |
|
|
| Use the `mkdssp` binary name from your installation if it differs from `mkdssp-4.4.0-linux-x64`. |
|
|
| ### Matching PDB and DSSP |
|
|
| - Both folders should be **flat** (no subfolders). |
| - Each `name.pdb` must have a matching `name.dssp` (same filename stem). |
| - Only paired structures are analyzed; unmatched files are skipped with a warning. |
|
|
| ## Usage |
|
|
| ```bash |
| python pdb_dssp_analyses.py \ |
| --input_files /path/to/pdbs \ |
| --dssp_dir /path/to/dssp \ |
| --output /path/to/stats.csv |
| ``` |
|
|
| If `--output` is omitted, the CSV is written to the current directory as `stats.csv`. |
|
|
| ## Output columns |
|
|
| `sequence_id` is the PDB/DSSP filename without extension. |
|
|
| | Column | Source | Description | |
| |--------|--------|-------------| |
| | `sequence_id` | Filename | Structure identifier | |
| | `sequence` | PDB | One-letter sequence (chain A, or first chain) | |
| | `GRAVY` | Sequence | Mean Kyte–Doolittle hydropathy | |
| | `pI` | Sequence | Isoelectric point (Biopython) | |
| | `helix` | DSSP | Fraction α-helix (3₁₀), π, or α: G, H, I, P | |
| | `strand` | DSSP | Fraction β-strand (E) | |
| | `disorder` | DSSP | Fraction coil / irregular: -, C, space, B, T, S | |
| | `structured` | DSSP | Fraction structured: G, H, I, E, P | |
| | `alpha-helix` | DSSP | Fraction H | |
| | `helix-3` | DSSP | Fraction 3₁₀ (G) | |
| | `helix-5` | DSSP | Fraction π-helix (I) | |
| | `helix-PPII` | DSSP | Fraction polyproline II (P) | |
| | `betabridge` | DSSP | Fraction bridge (B) | |
| | `turn` | DSSP | Fraction turn (T) | |
| | `bend` | DSSP | Fraction bend (S) | |
| | `loops` | DSSP | Fraction B + T + S | |
| | `ASA` | PDB | Total solvent-accessible surface area (Ų, Shrake–Rupley) | |
| | `Rg` | PDB | Radius of gyration (Å) | |
| | `NtoCdistance` | PDB | Cα distance N-terminus to C-terminus (Å) | |
| | `pLDDT` | PDB | Mean B-factor (pLDDT proxy for ESMFold models) | |
|
|
| Secondary-structure fractions use **raw DSSP one-letter codes**. |
|
|
| ## Notes |
|
|
| - **ASA** is computed from the PDB structure, not from DSSP relative accessibility. |
| - If Biopython cannot parse a `.dssp` file, the script falls back to a simple line parser for classic DSSP output format. |
| - Structures with fewer than two Cα atoms get `NaN` for `NtoCdistance`. |
|
|
| ## References |
|
|
| - Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. *Biopolymers* 22, 2577–2637. |
| - [PDB-REDO DSSP / mkdssp](https://github.com/PDB-REDO/dssp) |
|
|