# ESMFold structure statistics (PDB + DSSP)

Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure.

## Files

| File | Description |
|------|-------------|
| `pdb_dssp_analyses.py` | Main analysis script |
| `requirements.txt` | Python dependencies |

## Requirements

- Python 3.8+
- Packages in `requirements.txt` (`biopython`, `numpy`, `pandas`)
- **PDB files** (`.pdb`) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT
- **DSSP files** (`.dssp`) — required for secondary-structure fractions; must be generated beforehand (see below)

Install:

```bash
pip install -r requirements.txt
```

## Input data

### PDB structures

Place all model `.pdb` files in one folder and pass it as `--input_files`. The script reads coordinates and B-factors from these files.

### DSSP secondary structure

DSSP assignments must exist as `.dssp` files in a separate folder (`--dssp_dir`). Generate them with **mkdssp** from the [PDB-REDO DSSP](https://github.com/PDB-REDO/dssp) package:

```bash
input_folder="/path/to/pdbs"
output_folder="/path/to/dssp"
mkdir -p "$output_folder"

for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do
  echo $i
  mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \
    --output-format=dssp --min-pp-stretch=2
done
```

Use the `mkdssp` binary name from your installation if it differs from `mkdssp-4.4.0-linux-x64`.

### Matching PDB and DSSP

- Both folders should be **flat** (no subfolders).
- Each `name.pdb` must have a matching `name.dssp` (same filename stem).
- Only paired structures are analyzed; unmatched files are skipped with a warning.

## Usage

```bash
python pdb_dssp_analyses.py \
  --input_files /path/to/pdbs \
  --dssp_dir /path/to/dssp \
  --output /path/to/stats.csv
```

If `--output` is omitted, the CSV is written to the current directory as `stats.csv`.

## Output columns

`sequence_id` is the PDB/DSSP filename without extension.

| Column | Source | Description |
|--------|--------|-------------|
| `sequence_id` | Filename | Structure identifier |
| `sequence` | PDB | One-letter sequence (chain A, or first chain) |
| `GRAVY` | Sequence | Mean Kyte–Doolittle hydropathy |
| `pI` | Sequence | Isoelectric point (Biopython) |
| `helix` | DSSP | Fraction α-helix (3₁₀), π, or α: G, H, I, P |
| `strand` | DSSP | Fraction β-strand (E) |
| `disorder` | DSSP | Fraction coil / irregular: -, C, space, B, T, S |
| `structured` | DSSP | Fraction structured: G, H, I, E, P |
| `alpha-helix` | DSSP | Fraction H |
| `helix-3` | DSSP | Fraction 3₁₀ (G) |
| `helix-5` | DSSP | Fraction π-helix (I) |
| `helix-PPII` | DSSP | Fraction polyproline II (P) |
| `betabridge` | DSSP | Fraction bridge (B) |
| `turn` | DSSP | Fraction turn (T) |
| `bend` | DSSP | Fraction bend (S) |
| `loops` | DSSP | Fraction B + T + S |
| `ASA` | PDB | Total solvent-accessible surface area (Å², Shrake–Rupley) |
| `Rg` | PDB | Radius of gyration (Å) |
| `NtoCdistance` | PDB | Cα distance N-terminus to C-terminus (Å) |
| `pLDDT` | PDB | Mean B-factor (pLDDT proxy for ESMFold models) |

Secondary-structure fractions use **raw DSSP one-letter codes**.

## Notes

- **ASA** is computed from the PDB structure, not from DSSP relative accessibility.
- If Biopython cannot parse a `.dssp` file, the script falls back to a simple line parser for classic DSSP output format.
- Structures with fewer than two Cα atoms get `NaN` for `NtoCdistance`.

## References

- Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. *Biopolymers* 22, 2577–2637.
- [PDB-REDO DSSP / mkdssp](https://github.com/PDB-REDO/dssp)