# PDB 3Di Chains Dataset This repository contains chain-level sequences and 3Di tokens derived from RCSB PDB structures, with per-chain polymer class labels (`prot`, `DNA`, `RNA`, `other`). Files are chunked/merged from 1k-folders and cleaned for consistent CSV schema. Current data based on only 120K proteins. ## Files 3di_chains_chaintag_filtered_prot_only.csv **<- Use this one, the main file** 3di_chains_chaintag_all.csv 474 MB 3di_chains_chaintag_prot_only.csv 464 MB 3di_chains_chaintag_sample_1000.csv - `3di_chains_chaintag_filtered_prot_only.csv` filtered out repeated protein and DNA RNA chain. **Main file used for SFT** - `_all` contains **all chains** (including **D-amino**, **RNA**, **DNA**, and any others). - `_prot_only` contains **protein chains only** (L- and D-amino acids treated as protein). - `_sample_1000` is a 1,000-row random sample for quick inspection. All raw PDB files are in https://drive.google.com/drive/folders/1jdz5c_EoNCpqXXmDdklr1tlZtzN8b4jY?usp=sharing ## Schema All CSVs use the same columns: - `index` — global row index from the source CSV (0-based) - `pdb_id` — 4-character PDB code (e.g., `9B4J`) - `chain_id` — chain identifier (alphanumeric, may include digits/letters) - `aa_seq` — amino-acid sequence (when available) - `threeDi_seq` — Foldseek 3Di token sequence (Used for sft) - `combined_seq` — helper concatenation (3Di/AA) used upstream - `seq_len` — chain sequence length (prefer AA length; else derived) - `chunk` — source folder name (e.g., `1000_1999`) - `path` — absolute path to the structure file used - `polymer_class` — one of `prot`, `DNA`, `RNA`, `other` (D-amino peptides are classified as `prot`) ## Basic Stats (from `_all`) === BASIC COUNTS === Proteins (unique pdb_id) : 111629 Chains (rows) : 418335 Avg chains per protein : 3.75 === CHAIN-LEVEL COMPOSITION === prot: 391317 (93.54%) DNA: 18575 (4.44%) RNA: 8375 (2.00%) other: 68 (0.02%) === PROTEIN-LEVEL PRESENCE (non-exclusive) === Proteins with any prot: 109267 (97.88%) Proteins with any DNA : 7334 (6.57%) Proteins with any RNA : 4622 (4.14%) === CHAIN LENGTH (residues) === mean: 254.88 | Q25: 103 | Q50: 208 | Q75: 335 === PROTEIN TOTAL LENGTH (sum of chains) === mean: 955.19 | Q25: 270 | Q50: 516 | Q75: 1089