| # PDB 3Di Chains Dataset | |
| This repository contains chain-level sequences and 3Di tokens derived from RCSB PDB structures, with per-chain polymer class labels (`prot`, `DNA`, `RNA`, `other`). Files are chunked/merged from 1k-folders and cleaned for consistent CSV schema. | |
| Current data based on only 120K proteins. | |
| ## Files | |
| 3di_chains_chaintag_filtered_prot_only.csv **<- Use this one, the main file** | |
| 3di_chains_chaintag_all.csv | |
| 474 MB | |
| 3di_chains_chaintag_prot_only.csv | |
| 464 MB | |
| 3di_chains_chaintag_sample_1000.csv | |
| - `3di_chains_chaintag_filtered_prot_only.csv` filtered out repeated protein and DNA RNA chain. **Main file used for SFT** | |
| - `_all` contains **all chains** (including **D-amino**, **RNA**, **DNA**, and any others). | |
| - `_prot_only` contains **protein chains only** (L- and D-amino acids treated as protein). | |
| - `_sample_1000` is a 1,000-row random sample for quick inspection. | |
| All raw PDB files are in https://drive.google.com/drive/folders/1jdz5c_EoNCpqXXmDdklr1tlZtzN8b4jY?usp=sharing | |
| ## Schema | |
| All CSVs use the same columns: | |
| - `index` β global row index from the source CSV (0-based) | |
| - `pdb_id` β 4-character PDB code (e.g., `9B4J`) | |
| - `chain_id` β chain identifier (alphanumeric, may include digits/letters) | |
| - `aa_seq` β amino-acid sequence (when available) | |
| - `threeDi_seq` β Foldseek 3Di token sequence (Used for sft) | |
| - `combined_seq` β helper concatenation (3Di/AA) used upstream | |
| - `seq_len` β chain sequence length (prefer AA length; else derived) | |
| - `chunk` β source folder name (e.g., `1000_1999`) | |
| - `path` β absolute path to the structure file used | |
| - `polymer_class` β one of `prot`, `DNA`, `RNA`, `other` | |
| (D-amino peptides are classified as `prot`) | |
| ## Basic Stats (from `_all`) | |
| === BASIC COUNTS === | |
| Proteins (unique pdb_id) : 111629 | |
| Chains (rows) : 418335 | |
| Avg chains per protein : 3.75 | |
| === CHAIN-LEVEL COMPOSITION === | |
| prot: 391317 (93.54%) | |
| DNA: 18575 (4.44%) | |
| RNA: 8375 (2.00%) | |
| other: 68 (0.02%) | |
| === PROTEIN-LEVEL PRESENCE (non-exclusive) === | |
| Proteins with any prot: 109267 (97.88%) | |
| Proteins with any DNA : 7334 (6.57%) | |
| Proteins with any RNA : 4622 (4.14%) | |
| === CHAIN LENGTH (residues) === | |
| mean: 254.88 | Q25: 103 | Q50: 208 | Q75: 335 | |
| === PROTEIN TOTAL LENGTH (sum of chains) === | |
| mean: 955.19 | Q25: 270 | Q50: 516 | Q75: 1089 | |