neuwirtt

Initial release: FRET-FACS pipeline, weights, and datasets

6e4d123 13 days ago

3.76 kB

	# ESMFold structure statistics (PDB + DSSP)

	Batch-compute structure and sequence descriptors from protein structure models and pre-generated DSSP files. Produces one CSV row per structure.

	## Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `pdb_dssp_analyses.py` \| Main analysis script \|
	\| `requirements.txt` \| Python dependencies \|

	## Requirements

	- Python 3.8+
	- Packages in `requirements.txt` (`biopython`, `numpy`, `pandas`)
	- PDB files (`.pdb`) — required for sequence extraction, GRAVY, pI, ASA, radius of gyration, N–C distance, and pLDDT
	- DSSP files (`.dssp`) — required for secondary-structure fractions; must be generated beforehand (see below)

	Install:

	```bash
	pip install -r requirements.txt
	```

	## Input data

	### PDB structures

	Place all model `.pdb` files in one folder and pass it as `--input_files`. The script reads coordinates and B-factors from these files.

	### DSSP secondary structure

	DSSP assignments must exist as `.dssp` files in a separate folder (`--dssp_dir`). Generate them with mkdssp from the [PDB-REDO DSSP](https://github.com/PDB-REDO/dssp) package:

	```bash
	input_folder="/path/to/pdbs"
	output_folder="/path/to/dssp"
	mkdir -p "$output_folder"

	for i in `ls $input_folder \| grep pdb \| sed "s/.pdb//"` ; do
	echo $i
	mkdssp-4.4.0-linux-x64 --write-other $input_folder/$i.pdb $output_folder/$i.dssp \
	--output-format=dssp --min-pp-stretch=2
	done
	```

	Use the `mkdssp` binary name from your installation if it differs from `mkdssp-4.4.0-linux-x64`.

	### Matching PDB and DSSP

	- Both folders should be flat (no subfolders).
	- Each `name.pdb` must have a matching `name.dssp` (same filename stem).
	- Only paired structures are analyzed; unmatched files are skipped with a warning.

	## Usage

	```bash
	python pdb_dssp_analyses.py \
	--input_files /path/to/pdbs \
	--dssp_dir /path/to/dssp \
	--output /path/to/stats.csv
	```

	If `--output` is omitted, the CSV is written to the current directory as `stats.csv`.

	## Output columns

	`sequence_id` is the PDB/DSSP filename without extension.

	\| Column \| Source \| Description \|
	\|--------\|--------\|-------------\|
	\| `sequence_id` \| Filename \| Structure identifier \|
	\| `sequence` \| PDB \| One-letter sequence (chain A, or first chain) \|
	\| `GRAVY` \| Sequence \| Mean Kyte–Doolittle hydropathy \|
	\| `pI` \| Sequence \| Isoelectric point (Biopython) \|
	\| `helix` \| DSSP \| Fraction α-helix (3₁₀), π, or α: G, H, I, P \|
	\| `strand` \| DSSP \| Fraction β-strand (E) \|
	\| `disorder` \| DSSP \| Fraction coil / irregular: -, C, space, B, T, S \|
	\| `structured` \| DSSP \| Fraction structured: G, H, I, E, P \|
	\| `alpha-helix` \| DSSP \| Fraction H \|
	\| `helix-3` \| DSSP \| Fraction 3₁₀ (G) \|
	\| `helix-5` \| DSSP \| Fraction π-helix (I) \|
	\| `helix-PPII` \| DSSP \| Fraction polyproline II (P) \|
	\| `betabridge` \| DSSP \| Fraction bridge (B) \|
	\| `turn` \| DSSP \| Fraction turn (T) \|
	\| `bend` \| DSSP \| Fraction bend (S) \|
	\| `loops` \| DSSP \| Fraction B + T + S \|
	\| `ASA` \| PDB \| Total solvent-accessible surface area (Å², Shrake–Rupley) \|
	\| `Rg` \| PDB \| Radius of gyration (Å) \|
	\| `NtoCdistance` \| PDB \| Cα distance N-terminus to C-terminus (Å) \|
	\| `pLDDT` \| PDB \| Mean B-factor (pLDDT proxy for ESMFold models) \|

	Secondary-structure fractions use raw DSSP one-letter codes.

	## Notes

	- ASA is computed from the PDB structure, not from DSSP relative accessibility.
	- If Biopython cannot parse a `.dssp` file, the script falls back to a simple line parser for classic DSSP output format.
	- Structures with fewer than two Cα atoms get `NaN` for `NtoCdistance`.

	## References

	- Kabsch, W. & Sander, C. (1983). Dictionary of protein secondary structure. Biopolymers 22, 2577–2637.
	- [PDB-REDO DSSP / mkdssp](https://github.com/PDB-REDO/dssp)