Buckets:

omaib/MuSProt-storage / MuSProt_documentation.md
wenruifan's picture
|
download
raw
6.55 kB

MuSProt Dataset Documentation

MuSProt (Multistate Protein Database) is a million-scale multimodal database for multistate proteins, designed to support programmable protein design and AI model development. It links experimentally observed conformational states of identical protein sequences from the PDB, organizes them into state clusters and transition relationships, and enriches each record with structural similarity, experimental context, state-specific function rankings and transition fidelity labels. Users can search, browse and download MuSProt records to study conformational diversity, state-dependent functions and feasible protein state transitions.


Download

The full dataset is distributed as a single SQLite database file (MuSProt.db, ~6.3 GB). Download it from the dataset page using the Download DB button.

For detailed protein structures and atom coordinates, users are expected to download Protein Data Bank (PDB) and fetch from mmCIF or .pdb files based on the PDB entry and chain ID.


Database Schema

The database contains two tables: node and edge.


Table: node

Each row represents a single protein chain instance (one PDB entry + chain).

Column Type Description
uniprot_id TEXT UniProt accession number
pdb_id TEXT 4-character PDB entry ID (lowercase)
auth_asym_id TEXT Author chain identifier (e.g. A)
base_label TEXT Canonical label combining UniProt ID and chain state
sequence TEXT SEQRES amino acid sequence
sequence_length INT Number of residues in the chain
original_metals TEXT Metal ions present in the structure
original_ligands TEXT Small-molecule ligands present in the structure
sequence_id TEXT Internal sequence cluster identifier
state_id TEXT Conformational state cluster this chain is assigned to (0, 1, 2, …)
CATH_ID TEXT CATH domain assignment
cath_class TEXT CATH class (e.g. 1 = Mainly Alpha)
cath_arch TEXT CATH architecture
cath_topo TEXT CATH topology
cath_homology TEXT CATH homology superfamily
cath_superfamily TEXT Full CATH superfamily code (e.g. 1.10.10.10)
domain_length INT Length of the matched CATH domain
experimental_method TEXT Structure determination method (e.g. X-RAY DIFFRACTION, ELECTRON MICROSCOPY, SOLUTION NMR)
pH FLOAT pH of the experimental / crystallization condition
temp_K FLOAT Temperature of the experiment, in Kelvin
experimental_details TEXT Free-text crystallization / sample-preparation details
resolution FLOAT Experimental resolution in Å (lower is sharper; empty for methods without a resolution)
Rosetta FLOAT Rosetta total energy score
FoldX FLOAT FoldX total energy score
EvoEF2 FLOAT EvoEF2 total energy score
RW FLOAT Random-Walk (RW) energy score
RW+ FLOAT Random-Walk+ (RWplus) energy score
ranked_functions TEXT JSON-encoded list of ranked GO/functional annotations
chain_composition TEXT Quaternary composition of the deposited assembly: monomeric, homomeric, or heteromeric
non_protein_polymer_binding TEXT Non-protein polymer bound in the structure (DNA, RNA, or DNA/RNA); empty when none is present

Table: edge

Each row represents a pairwise structural comparison between two chain instances (A and B) that share the same UniProt identity.

Column Type Description
pdb_id_A TEXT PDB ID of chain A
auth_asym_id_A TEXT Chain identifier of chain A
pdb_id_B TEXT PDB ID of chain B
auth_asym_id_B TEXT Chain identifier of chain B
TM1 FLOAT TM-score of the alignment (chain A as reference)
RMSD FLOAT Root-mean-square deviation of Cα atoms (Å)
structure_sim FLOAT Composite structural similarity score
delta_Rosetta FLOAT Rosetta energy difference (B − A)
delta_FoldX FLOAT FoldX energy difference (B − A)
delta_EvoEF2 FLOAT EvoEF2 energy difference (B − A)
delta_RW FLOAT Random-Walk (RW) energy difference (B − A)
delta_RW+ FLOAT Random-Walk+ (RWplus) energy difference (B − A)
state_id_A TEXT Conformational state cluster of chain A
state_id_B TEXT Conformational state cluster of chain B
avg_sim TEXT Average structural similarity within the state cluster (>0.95 or a numeric value)
state_fidelity TEXT State-level transition fidelity label: identical, high, medium, or low
observation_fidelity TEXT Observation-level transition fidelity label: identical, high, medium, or low

Usage Examples

Python (sqlite3)

import sqlite3
import pandas as pd

conn = sqlite3.connect("MuSProt.db")

# Load all chains for a UniProt entry
df_nodes = pd.read_sql(
    "SELECT * FROM node WHERE uniprot_id = 'P00533'",
    conn
)

# Find all structural neighbours of a given chain
df_edges = pd.read_sql(
    "SELECT * FROM edge WHERE pdb_id_A = '1ivo' AND auth_asym_id_A = 'A'",
    conn
)

conn.close()

Filter by structural similarity

# Retrieve pairs with high TM-score and low RMSD
df = pd.read_sql("""
    SELECT *
    FROM edge
    WHERE CAST(TM1 AS REAL) > 0.8
      AND CAST(RMSD AS REAL) < 2.0
    LIMIT 1000
""", conn)

Join nodes and edges

# Get full info for both chains in each pair
df = pd.read_sql("""
    SELECT
        e.pdb_id_A, e.auth_asym_id_A,
        e.pdb_id_B, e.auth_asym_id_B,
        e.TM1, e.RMSD,
        nA.sequence_length AS len_A,
        nB.sequence_length AS len_B,
        nA.cath_superfamily
    FROM edge e
    JOIN node nA ON e.pdb_id_A = nA.pdb_id AND e.auth_asym_id_A = nA.auth_asym_id
    JOIN node nB ON e.pdb_id_B = nB.pdb_id AND e.auth_asym_id_B = nB.auth_asym_id
    WHERE nA.uniprot_id = 'P00533'
    LIMIT 500
""", conn)

Notes

  • All numeric fields (energies, TM-score, RMSD, lengths) are stored as TEXT; cast them with CAST(col AS REAL) or CAST(col AS INTEGER) as needed.
  • ranked_functions in the node table is a JSON string. Parse it with json.loads().
  • The edge table is directional: (A→B) and (B→A) are separate rows and may differ slightly in TM-score.
  • Energy delta values represent B − A; a negative delta means chain B is lower energy than chain A.

Xet Storage Details

Size:
6.55 kB
·
Xet hash:
1115a123d05f3e20ce2a75e094acfb43dc2404e182c6b2ae2f323b0ade123890

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.