File size: 6,706 Bytes
2d33606
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
# MuSProt Dataset Documentation

**MuSProt** (Multistate Protein Database) is a million-scale multimodal database for multistate proteins, designed to support programmable protein design and AI model development. It links experimentally observed conformational states of identical protein sequences from the PDB, organizes them into state clusters and transition relationships, and enriches each record with structural similarity, experimental context, state-specific function rankings and transition fidelity labels. Users can search, browse and download MuSProt records to study conformational diversity, state-dependent functions and feasible protein state transitions.

---

## Download

The full dataset is distributed as a single SQLite database file (`MuSProt.db`, ~6.3 GB). Download it from the dataset page using the **Download DB** button.

For detailed protein structures and atom coordinates, users are expected to download [Protein Data Bank (PDB)](https://rcsb.org) and fetch from `mmCIF` or `.pdb` files based on the PDB entry and chain ID.

---

## Database Schema

The database contains two tables: **`node`** and **`edge`**.

---

### Table: `node`

Each row represents a single protein chain instance (one PDB entry + chain).

| Column | Type | Description |
|---|---|---|
| `uniprot_id` | TEXT | UniProt accession number |
| `pdb_id` | TEXT | 4-character PDB entry ID (lowercase) |
| `auth_asym_id` | TEXT | Author chain identifier (e.g. `A`) |
| `base_label` | TEXT | Canonical label combining UniProt ID and chain state |
| `sequence` | TEXT | SEQRES amino acid sequence |
| `sequence_length` | INT | Number of residues in the chain |
| `original_metals` | TEXT | Metal ions present in the structure |
| `original_ligands` | TEXT | Small-molecule ligands present in the structure |
| `sequence_id` | TEXT | Internal sequence cluster identifier |
| `state_id` | TEXT | Conformational state cluster this chain is assigned to (`0`, `1`, `2`, …) |
| `CATH_ID` | TEXT | CATH domain assignment |
| `cath_class` | TEXT | CATH class (e.g. `1` = Mainly Alpha) |
| `cath_arch` | TEXT | CATH architecture |
| `cath_topo` | TEXT | CATH topology |
| `cath_homology` | TEXT | CATH homology superfamily |
| `cath_superfamily` | TEXT | Full CATH superfamily code (e.g. `1.10.10.10`) |
| `domain_length` | INT | Length of the matched CATH domain |
| `experimental_method` | TEXT | Structure determination method (e.g. `X-RAY DIFFRACTION`, `ELECTRON MICROSCOPY`, `SOLUTION NMR`) |
| `pH` | FLOAT | pH of the experimental / crystallization condition |
| `temp_K` | FLOAT | Temperature of the experiment, in Kelvin |
| `experimental_details` | TEXT | Free-text crystallization / sample-preparation details |
| `resolution` | FLOAT | Experimental resolution in Å (lower is sharper; empty for methods without a resolution) |
| `Rosetta` | FLOAT | Rosetta total energy score |
| `FoldX` | FLOAT | FoldX total energy score |
| `EvoEF2` | FLOAT | EvoEF2 total energy score |
| `RW` | FLOAT | Random-Walk (RW) energy score |
| `RW+` | FLOAT | Random-Walk+ (RWplus) energy score |
| `ranked_functions` | TEXT | JSON-encoded list of ranked GO/functional annotations |
| `chain_composition` | TEXT | Quaternary composition of the deposited assembly: `monomeric`, `homomeric`, or `heteromeric` |
| `non_protein_polymer_binding` | TEXT | Non-protein polymer bound in the structure (`DNA`, `RNA`, or `DNA/RNA`); empty when none is present |


---

### Table: `edge`

Each row represents a pairwise structural comparison between two chain instances (A and B) that share the same UniProt identity.

| Column | Type | Description |
|---|---|---|
| `pdb_id_A` | TEXT | PDB ID of chain A |
| `auth_asym_id_A` | TEXT | Chain identifier of chain A |
| `pdb_id_B` | TEXT | PDB ID of chain B |
| `auth_asym_id_B` | TEXT | Chain identifier of chain B |
| `TM1` | FLOAT | TM-score of the alignment (chain A as reference) |
| `RMSD` | FLOAT | Root-mean-square deviation of Cα atoms (Å) |
| `structure_sim` | FLOAT | Composite structural similarity score |
| `delta_Rosetta` | FLOAT | Rosetta energy difference (B − A) |
| `delta_FoldX` | FLOAT | FoldX energy difference (B − A) |
| `delta_EvoEF2` | FLOAT | EvoEF2 energy difference (B − A) |
| `delta_RW` | FLOAT | Random-Walk (RW) energy difference (B − A) |
| `delta_RW+` | FLOAT | Random-Walk+ (RWplus) energy difference (B − A) |
| `state_id_A` | TEXT | Conformational state cluster of chain A |
| `state_id_B` | TEXT | Conformational state cluster of chain B |
| `avg_sim` | TEXT | Average structural similarity within the state cluster (`>0.95` or a numeric value) |
| `state_fidelity` | TEXT | State-level transition fidelity label: `identical`, `high`, `medium`, or `low` |
| `observation_fidelity` | TEXT | Observation-level transition fidelity label: `identical`, `high`, `medium`, or `low` |

---

## Usage Examples

### Python (sqlite3)

```python

import sqlite3

import pandas as pd



conn = sqlite3.connect("MuSProt.db")



# Load all chains for a UniProt entry

df_nodes = pd.read_sql(

    "SELECT * FROM node WHERE uniprot_id = 'P00533'",

    conn

)



# Find all structural neighbours of a given chain

df_edges = pd.read_sql(

    "SELECT * FROM edge WHERE pdb_id_A = '1ivo' AND auth_asym_id_A = 'A'",

    conn

)



conn.close()

```

### Filter by structural similarity

```python

# Retrieve pairs with high TM-score and low RMSD

df = pd.read_sql("""

    SELECT *

    FROM edge

    WHERE CAST(TM1 AS REAL) > 0.8

      AND CAST(RMSD AS REAL) < 2.0

    LIMIT 1000

""", conn)

```

### Join nodes and edges

```python

# Get full info for both chains in each pair

df = pd.read_sql("""

    SELECT

        e.pdb_id_A, e.auth_asym_id_A,

        e.pdb_id_B, e.auth_asym_id_B,

        e.TM1, e.RMSD,

        nA.sequence_length AS len_A,

        nB.sequence_length AS len_B,

        nA.cath_superfamily

    FROM edge e

    JOIN node nA ON e.pdb_id_A = nA.pdb_id AND e.auth_asym_id_A = nA.auth_asym_id

    JOIN node nB ON e.pdb_id_B = nB.pdb_id AND e.auth_asym_id_B = nB.auth_asym_id

    WHERE nA.uniprot_id = 'P00533'

    LIMIT 500

""", conn)

```

---

## Notes

- All numeric fields (energies, TM-score, RMSD, lengths) are stored as `TEXT`; cast them with `CAST(col AS REAL)` or `CAST(col AS INTEGER)` as needed.
- `ranked_functions` in the `node` table is a JSON string. Parse it with `json.loads()`.
- The `edge` table is directional: (A→B) and (B→A) are separate rows and may differ slightly in TM-score.
- Energy delta values represent B − A; a negative delta means chain B is lower energy than chain A.