PLLM_PDB_Dataset / README.md
zzqsb's picture
Update README.md
5dffefc verified
# PDB 3Di Chains Dataset
This repository contains chain-level sequences and 3Di tokens derived from RCSB PDB structures, with per-chain polymer class labels (`prot`, `DNA`, `RNA`, `other`). Files are chunked/merged from 1k-folders and cleaned for consistent CSV schema.
Current data based on only 120K proteins.
## Files
3di_chains_chaintag_filtered_prot_only.csv **<- Use this one, the main file**
3di_chains_chaintag_all.csv
474 MB
3di_chains_chaintag_prot_only.csv
464 MB
3di_chains_chaintag_sample_1000.csv
- `3di_chains_chaintag_filtered_prot_only.csv` filtered out repeated protein and DNA RNA chain. **Main file used for SFT**
- `_all` contains **all chains** (including **D-amino**, **RNA**, **DNA**, and any others).
- `_prot_only` contains **protein chains only** (L- and D-amino acids treated as protein).
- `_sample_1000` is a 1,000-row random sample for quick inspection.
All raw PDB files are in https://drive.google.com/drive/folders/1jdz5c_EoNCpqXXmDdklr1tlZtzN8b4jY?usp=sharing
## Schema
All CSVs use the same columns:
- `index` β€” global row index from the source CSV (0-based)
- `pdb_id` β€” 4-character PDB code (e.g., `9B4J`)
- `chain_id` β€” chain identifier (alphanumeric, may include digits/letters)
- `aa_seq` β€” amino-acid sequence (when available)
- `threeDi_seq` β€” Foldseek 3Di token sequence (Used for sft)
- `combined_seq` β€” helper concatenation (3Di/AA) used upstream
- `seq_len` β€” chain sequence length (prefer AA length; else derived)
- `chunk` β€” source folder name (e.g., `1000_1999`)
- `path` β€” absolute path to the structure file used
- `polymer_class` β€” one of `prot`, `DNA`, `RNA`, `other`
(D-amino peptides are classified as `prot`)
## Basic Stats (from `_all`)
=== BASIC COUNTS ===
Proteins (unique pdb_id) : 111629
Chains (rows) : 418335
Avg chains per protein : 3.75
=== CHAIN-LEVEL COMPOSITION ===
prot: 391317 (93.54%)
DNA: 18575 (4.44%)
RNA: 8375 (2.00%)
other: 68 (0.02%)
=== PROTEIN-LEVEL PRESENCE (non-exclusive) ===
Proteins with any prot: 109267 (97.88%)
Proteins with any DNA : 7334 (6.57%)
Proteins with any RNA : 4622 (4.14%)
=== CHAIN LENGTH (residues) ===
mean: 254.88 | Q25: 103 | Q50: 208 | Q75: 335
=== PROTEIN TOTAL LENGTH (sum of chains) ===
mean: 955.19 | Q25: 270 | Q50: 516 | Q75: 1089