Buckets:
MuSProt Dataset Documentation
MuSProt (Multistate Protein Database) is a million-scale multimodal database for multistate proteins, designed to support programmable protein design and AI model development. It links experimentally observed conformational states of identical protein sequences from the PDB, organizes them into state clusters and transition relationships, and enriches each record with structural similarity, experimental context, state-specific function rankings and transition fidelity labels. Users can search, browse and download MuSProt records to study conformational diversity, state-dependent functions and feasible protein state transitions.
Download
The full dataset is distributed as a single SQLite database file (MuSProt.db, ~6.3 GB). Download it from the dataset page using the Download DB button.
For detailed protein structures and atom coordinates, users are expected to download Protein Data Bank (PDB) and fetch from mmCIF or .pdb files based on the PDB entry and chain ID.
Database Schema
The database contains two tables: node and edge.
Table: node
Each row represents a single protein chain instance (one PDB entry + chain).
| Column | Type | Description |
|---|---|---|
uniprot_id |
TEXT | UniProt accession number |
pdb_id |
TEXT | 4-character PDB entry ID (lowercase) |
auth_asym_id |
TEXT | Author chain identifier (e.g. A) |
base_label |
TEXT | Canonical label combining UniProt ID and chain state |
sequence |
TEXT | SEQRES amino acid sequence |
sequence_length |
INT | Number of residues in the chain |
original_metals |
TEXT | Metal ions present in the structure |
original_ligands |
TEXT | Small-molecule ligands present in the structure |
sequence_id |
TEXT | Internal sequence cluster identifier |
state_id |
TEXT | Conformational state cluster this chain is assigned to (0, 1, 2, …) |
CATH_ID |
TEXT | CATH domain assignment |
cath_class |
TEXT | CATH class (e.g. 1 = Mainly Alpha) |
cath_arch |
TEXT | CATH architecture |
cath_topo |
TEXT | CATH topology |
cath_homology |
TEXT | CATH homology superfamily |
cath_superfamily |
TEXT | Full CATH superfamily code (e.g. 1.10.10.10) |
domain_length |
INT | Length of the matched CATH domain |
experimental_method |
TEXT | Structure determination method (e.g. X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR) |
pH |
FLOAT | pH of the experimental / crystallization condition |
temp_K |
FLOAT | Temperature of the experiment, in Kelvin |
experimental_details |
TEXT | Free-text crystallization / sample-preparation details |
resolution |
FLOAT | Experimental resolution in Å (lower is sharper; empty for methods without a resolution) |
Rosetta |
FLOAT | Rosetta total energy score |
FoldX |
FLOAT | FoldX total energy score |
EvoEF2 |
FLOAT | EvoEF2 total energy score |
RW |
FLOAT | Random-Walk (RW) energy score |
RW+ |
FLOAT | Random-Walk+ (RWplus) energy score |
ranked_functions |
TEXT | JSON-encoded list of ranked GO/functional annotations |
chain_composition |
TEXT | Quaternary composition of the deposited assembly: monomeric, homomeric, or heteromeric |
non_protein_polymer_binding |
TEXT | Non-protein polymer bound in the structure (DNA, RNA, or DNA/RNA); empty when none is present |
Table: edge
Each row represents a pairwise structural comparison between two chain instances (A and B) that share the same UniProt identity.
| Column | Type | Description |
|---|---|---|
pdb_id_A |
TEXT | PDB ID of chain A |
auth_asym_id_A |
TEXT | Chain identifier of chain A |
pdb_id_B |
TEXT | PDB ID of chain B |
auth_asym_id_B |
TEXT | Chain identifier of chain B |
TM1 |
FLOAT | TM-score of the alignment (chain A as reference) |
RMSD |
FLOAT | Root-mean-square deviation of Cα atoms (Å) |
structure_sim |
FLOAT | Composite structural similarity score |
delta_Rosetta |
FLOAT | Rosetta energy difference (B − A) |
delta_FoldX |
FLOAT | FoldX energy difference (B − A) |
delta_EvoEF2 |
FLOAT | EvoEF2 energy difference (B − A) |
delta_RW |
FLOAT | Random-Walk (RW) energy difference (B − A) |
delta_RW+ |
FLOAT | Random-Walk+ (RWplus) energy difference (B − A) |
state_id_A |
TEXT | Conformational state cluster of chain A |
state_id_B |
TEXT | Conformational state cluster of chain B |
avg_sim |
TEXT | Average structural similarity within the state cluster (>0.95 or a numeric value) |
state_fidelity |
TEXT | State-level transition fidelity label: identical, high, medium, or low |
observation_fidelity |
TEXT | Observation-level transition fidelity label: identical, high, medium, or low |
Usage Examples
Python (sqlite3)
import sqlite3
import pandas as pd
conn = sqlite3.connect("MuSProt.db")
# Load all chains for a UniProt entry
df_nodes = pd.read_sql(
"SELECT * FROM node WHERE uniprot_id = 'P00533'",
conn
)
# Find all structural neighbours of a given chain
df_edges = pd.read_sql(
"SELECT * FROM edge WHERE pdb_id_A = '1ivo' AND auth_asym_id_A = 'A'",
conn
)
conn.close()
Filter by structural similarity
# Retrieve pairs with high TM-score and low RMSD
df = pd.read_sql("""
SELECT *
FROM edge
WHERE CAST(TM1 AS REAL) > 0.8
AND CAST(RMSD AS REAL) < 2.0
LIMIT 1000
""", conn)
Join nodes and edges
# Get full info for both chains in each pair
df = pd.read_sql("""
SELECT
e.pdb_id_A, e.auth_asym_id_A,
e.pdb_id_B, e.auth_asym_id_B,
e.TM1, e.RMSD,
nA.sequence_length AS len_A,
nB.sequence_length AS len_B,
nA.cath_superfamily
FROM edge e
JOIN node nA ON e.pdb_id_A = nA.pdb_id AND e.auth_asym_id_A = nA.auth_asym_id
JOIN node nB ON e.pdb_id_B = nB.pdb_id AND e.auth_asym_id_B = nB.auth_asym_id
WHERE nA.uniprot_id = 'P00533'
LIMIT 500
""", conn)
Notes
- All numeric fields (energies, TM-score, RMSD, lengths) are stored as
TEXT; cast them withCAST(col AS REAL)orCAST(col AS INTEGER)as needed. ranked_functionsin thenodetable is a JSON string. Parse it withjson.loads().- The
edgetable is directional: (A→B) and (B→A) are separate rows and may differ slightly in TM-score. - Energy delta values represent B − A; a negative delta means chain B is lower energy than chain A.
Xet Storage Details
- Size:
- 6.55 kB
- Xet hash:
- 1115a123d05f3e20ce2a75e094acfb43dc2404e182c6b2ae2f323b0ade123890
Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.