Spaces:

omaib
/

MuSProt

Sleeping

File size: 6,706 Bytes

2d33606

# MuSProt Dataset Documentation

**MuSProt** (Multistate Protein Database) is a million-scale multimodal database for multistate proteins, designed to support programmable protein design and AI model development. It links experimentally observed conformational states of identical protein sequences from the PDB, organizes them into state clusters and transition relationships, and enriches each record with structural similarity, experimental context, state-specific function rankings and transition fidelity labels. Users can search, browse and download MuSProt records to study conformational diversity, state-dependent functions and feasible protein state transitions.

---

## Download

The full dataset is distributed as a single SQLite database file (`MuSProt.db`, ~6.3 GB). Download it from the dataset page using the **Download DB** button.

For detailed protein structures and atom coordinates, users are expected to download [Protein Data Bank (PDB)](https://rcsb.org) and fetch from `mmCIF` or `.pdb` files based on the PDB entry and chain ID.

---

## Database Schema

The database contains two tables: **`node`** and **`edge`**.

---

### Table: `node`

Each row represents a single protein chain instance (one PDB entry + chain).

| Column | Type | Description |
|---|---|---|
| `uniprot_id` | TEXT | UniProt accession number |
| `pdb_id` | TEXT | 4-character PDB entry ID (lowercase) |
| `auth_asym_id` | TEXT | Author chain identifier (e.g. `A`) |
| `base_label` | TEXT | Canonical label combining UniProt ID and chain state |
| `sequence` | TEXT | SEQRES amino acid sequence |
| `sequence_length` | INT | Number of residues in the chain |
| `original_metals` | TEXT | Metal ions present in the structure |
| `original_ligands` | TEXT | Small-molecule ligands present in the structure |
| `sequence_id` | TEXT | Internal sequence cluster identifier |
| `state_id` | TEXT | Conformational state cluster this chain is assigned to (`0`, `1`, `2`, …) |
| `CATH_ID` | TEXT | CATH domain assignment |
| `cath_class` | TEXT | CATH class (e.g. `1` = Mainly Alpha) |
| `cath_arch` | TEXT | CATH architecture |
| `cath_topo` | TEXT | CATH topology |
| `cath_homology` | TEXT | CATH homology superfamily |
| `cath_superfamily` | TEXT | Full CATH superfamily code (e.g. `1.10.10.10`) |
| `domain_length` | INT | Length of the matched CATH domain |
| `experimental_method` | TEXT | Structure determination method (e.g. `X-RAY DIFFRACTION`, `ELECTRON MICROSCOPY`, `SOLUTION NMR`) |
| `pH` | FLOAT | pH of the experimental / crystallization condition |
| `temp_K` | FLOAT | Temperature of the experiment, in Kelvin |
| `experimental_details` | TEXT | Free-text crystallization / sample-preparation details |
| `resolution` | FLOAT | Experimental resolution in Å (lower is sharper; empty for methods without a resolution) |
| `Rosetta` | FLOAT | Rosetta total energy score |
| `FoldX` | FLOAT | FoldX total energy score |
| `EvoEF2` | FLOAT | EvoEF2 total energy score |
| `RW` | FLOAT | Random-Walk (RW) energy score |
| `RW+` | FLOAT | Random-Walk+ (RWplus) energy score |
| `ranked_functions` | TEXT | JSON-encoded list of ranked GO/functional annotations |
| `chain_composition` | TEXT | Quaternary composition of the deposited assembly: `monomeric`, `homomeric`, or `heteromeric` |
| `non_protein_polymer_binding` | TEXT | Non-protein polymer bound in the structure (`DNA`, `RNA`, or `DNA/RNA`); empty when none is present |


---

### Table: `edge`

Each row represents a pairwise structural comparison between two chain instances (A and B) that share the same UniProt identity.

| Column | Type | Description |
|---|---|---|
| `pdb_id_A` | TEXT | PDB ID of chain A |
| `auth_asym_id_A` | TEXT | Chain identifier of chain A |
| `pdb_id_B` | TEXT | PDB ID of chain B |
| `auth_asym_id_B` | TEXT | Chain identifier of chain B |
| `TM1` | FLOAT | TM-score of the alignment (chain A as reference) |
| `RMSD` | FLOAT | Root-mean-square deviation of Cα atoms (Å) |
| `structure_sim` | FLOAT | Composite structural similarity score |
| `delta_Rosetta` | FLOAT | Rosetta energy difference (B − A) |
| `delta_FoldX` | FLOAT | FoldX energy difference (B − A) |
| `delta_EvoEF2` | FLOAT | EvoEF2 energy difference (B − A) |
| `delta_RW` | FLOAT | Random-Walk (RW) energy difference (B − A) |
| `delta_RW+` | FLOAT | Random-Walk+ (RWplus) energy difference (B − A) |
| `state_id_A` | TEXT | Conformational state cluster of chain A |
| `state_id_B` | TEXT | Conformational state cluster of chain B |
| `avg_sim` | TEXT | Average structural similarity within the state cluster (`>0.95` or a numeric value) |
| `state_fidelity` | TEXT | State-level transition fidelity label: `identical`, `high`, `medium`, or `low` |
| `observation_fidelity` | TEXT | Observation-level transition fidelity label: `identical`, `high`, `medium`, or `low` |

---

## Usage Examples

### Python (sqlite3)

```python

import sqlite3

import pandas as pd



conn = sqlite3.connect("MuSProt.db")



# Load all chains for a UniProt entry

df_nodes = pd.read_sql(

    "SELECT * FROM node WHERE uniprot_id = 'P00533'",

    conn

)



# Find all structural neighbours of a given chain

df_edges = pd.read_sql(

    "SELECT * FROM edge WHERE pdb_id_A = '1ivo' AND auth_asym_id_A = 'A'",

    conn

)



conn.close()

```

### Filter by structural similarity

```python

# Retrieve pairs with high TM-score and low RMSD

df = pd.read_sql("""

    SELECT *

    FROM edge

    WHERE CAST(TM1 AS REAL) > 0.8

      AND CAST(RMSD AS REAL) < 2.0

    LIMIT 1000

""", conn)

```

### Join nodes and edges

```python

# Get full info for both chains in each pair

df = pd.read_sql("""

    SELECT

        e.pdb_id_A, e.auth_asym_id_A,

        e.pdb_id_B, e.auth_asym_id_B,

        e.TM1, e.RMSD,

        nA.sequence_length AS len_A,

        nB.sequence_length AS len_B,

        nA.cath_superfamily

    FROM edge e

    JOIN node nA ON e.pdb_id_A = nA.pdb_id AND e.auth_asym_id_A = nA.auth_asym_id

    JOIN node nB ON e.pdb_id_B = nB.pdb_id AND e.auth_asym_id_B = nB.auth_asym_id

    WHERE nA.uniprot_id = 'P00533'

    LIMIT 500

""", conn)

```

---

## Notes

- All numeric fields (energies, TM-score, RMSD, lengths) are stored as `TEXT`; cast them with `CAST(col AS REAL)` or `CAST(col AS INTEGER)` as needed.
- `ranked_functions` in the `node` table is a JSON string. Parse it with `json.loads()`.
- The `edge` table is directional: (A→B) and (B→A) are separate rows and may differ slightly in TM-score.
- Energy delta values represent B − A; a negative delta means chain B is lower energy than chain A.