neuwirtt
Initial release: FRET-FACS pipeline, weights, and datasets
6e4d123
|
Raw
History Blame Contribute Delete
3.76 kB
# ESMFold structure statistics (PDB + DSSP)
Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure.
## Files
| File | Description |
|------|-------------|
| `pdb_dssp_analyses.py` | Main analysis script |
| `requirements.txt` | Python dependencies |
## Requirements
- Python 3.8+
- Packages in `requirements.txt` (`biopython`, `numpy`, `pandas`)
- **PDB files** (`.pdb`) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT
- **DSSP files** (`.dssp`) — required for secondary-structure fractions; must be generated beforehand (see below)
Install:
```bash
pip install -r requirements.txt
```
## Input data
### PDB structures
Place all model `.pdb` files in one folder and pass it as `--input_files`. The script reads coordinates and B-factors from these files.
### DSSP secondary structure
DSSP assignments must exist as `.dssp` files in a separate folder (`--dssp_dir`). Generate them with **mkdssp** from the [PDB-REDO DSSP](https://github.com/PDB-REDO/dssp) package:
```bash
input_folder="/path/to/pdbs"
output_folder="/path/to/dssp"
mkdir -p "$output_folder"
for i in `ls $input_folder | grep pdb | sed "s/.pdb//"` ; do
echo $i
mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \
--output-format=dssp --min-pp-stretch=2
done
```
Use the `mkdssp` binary name from your installation if it differs from `mkdssp-4.4.0-linux-x64`.
### Matching PDB and DSSP
- Both folders should be **flat** (no subfolders).
- Each `name.pdb` must have a matching `name.dssp` (same filename stem).
- Only paired structures are analyzed; unmatched files are skipped with a warning.
## Usage
```bash
python pdb_dssp_analyses.py \
--input_files /path/to/pdbs \
--dssp_dir /path/to/dssp \
--output /path/to/stats.csv
```
If `--output` is omitted, the CSV is written to the current directory as `stats.csv`.
## Output columns
`sequence_id` is the PDB/DSSP filename without extension.
| Column | Source | Description |
|--------|--------|-------------|
| `sequence_id` | Filename | Structure identifier |
| `sequence` | PDB | One-letter sequence (chain A, or first chain) |
| `GRAVY` | Sequence | Mean Kyte–Doolittle hydropathy |
| `pI` | Sequence | Isoelectric point (Biopython) |
| `helix` | DSSP | Fraction α-helix (3₁₀), π, or α: G, H, I, P |
| `strand` | DSSP | Fraction β-strand (E) |
| `disorder` | DSSP | Fraction coil / irregular: -, C, space, B, T, S |
| `structured` | DSSP | Fraction structured: G, H, I, E, P |
| `alpha-helix` | DSSP | Fraction H |
| `helix-3` | DSSP | Fraction 3₁₀ (G) |
| `helix-5` | DSSP | Fraction π-helix (I) |
| `helix-PPII` | DSSP | Fraction polyproline II (P) |
| `betabridge` | DSSP | Fraction bridge (B) |
| `turn` | DSSP | Fraction turn (T) |
| `bend` | DSSP | Fraction bend (S) |
| `loops` | DSSP | Fraction B + T + S |
| `ASA` | PDB | Total solvent-accessible surface area (Ų, Shrake–Rupley) |
| `Rg` | PDB | Radius of gyration (Å) |
| `NtoCdistance` | PDB | Cα distance N-terminus to C-terminus (Å) |
| `pLDDT` | PDB | Mean B-factor (pLDDT proxy for ESMFold models) |
Secondary-structure fractions use **raw DSSP one-letter codes**.
## Notes
- **ASA** is computed from the PDB structure, not from DSSP relative accessibility.
- If Biopython cannot parse a `.dssp` file, the script falls back to a simple line parser for classic DSSP output format.
- Structures with fewer than two Cα atoms get `NaN` for `NtoCdistance`.
## References
- Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. *Biopolymers* 22, 2577–2637.
- [PDB-REDO DSSP / mkdssp](https://github.com/PDB-REDO/dssp)