Buckets:

omaib
/

MuSProt-storage

Files

xet

omaib/MuSProt-storage / MuSProt_documentation.md

wenruifan

1 day ago

preview code

download

raw

6.55 kB

MuSProt Dataset Documentation

MuSProt (Multistate Protein Database) is a million-scale multimodal database for multistate proteins, designed to support programmable protein design and AI model development. It links experimentally observed conformational states of identical protein sequences from the PDB, organizes them into state clusters and transition relationships, and enriches each record with structural similarity, experimental context, state-specific function rankings and transition fidelity labels. Users can search, browse and download MuSProt records to study conformational diversity, state-dependent functions and feasible protein state transitions.

Download

The full dataset is distributed as a single SQLite database file (MuSProt.db, ~6.3 GB). Download it from the dataset page using the Download DB button.

For detailed protein structures and atom coordinates, users are expected to download Protein Data Bank (PDB) and fetch from mmCIF or .pdb files based on the PDB entry and chain ID.

Database Schema

The database contains two tables: node and edge.

Table: `node`

Each row represents a single protein chain instance (one PDB entry + chain).

Column	Type	Description
`uniprot_id`	TEXT	UniProt accession number
`pdb_id`	TEXT	4-character PDB entry ID (lowercase)
`auth_asym_id`	TEXT	Author chain identifier (e.g. `A`)
`base_label`	TEXT	Canonical label combining UniProt ID and chain state
`sequence`	TEXT	SEQRES amino acid sequence
`sequence_length`	INT	Number of residues in the chain
`original_metals`	TEXT	Metal ions present in the structure
`original_ligands`	TEXT	Small-molecule ligands present in the structure
`sequence_id`	TEXT	Internal sequence cluster identifier
`state_id`	TEXT	Conformational state cluster this chain is assigned to (`0`, `1`, `2`, …)
`CATH_ID`	TEXT	CATH domain assignment
`cath_class`	TEXT	CATH class (e.g. `1` = Mainly Alpha)
`cath_arch`	TEXT	CATH architecture
`cath_topo`	TEXT	CATH topology
`cath_homology`	TEXT	CATH homology superfamily
`cath_superfamily`	TEXT	Full CATH superfamily code (e.g. `1.10.10.10`)
`domain_length`	INT	Length of the matched CATH domain
`experimental_method`	TEXT	Structure determination method (e.g. `X-RAY DIFFRACTION`, `ELECTRON MICROSCOPY`, `SOLUTION NMR`)
`pH`	FLOAT	pH of the experimental / crystallization condition
`temp_K`	FLOAT	Temperature of the experiment, in Kelvin
`experimental_details`	TEXT	Free-text crystallization / sample-preparation details
`resolution`	FLOAT	Experimental resolution in Å (lower is sharper; empty for methods without a resolution)
`Rosetta`	FLOAT	Rosetta total energy score
`FoldX`	FLOAT	FoldX total energy score
`EvoEF2`	FLOAT	EvoEF2 total energy score
`RW`	FLOAT	Random-Walk (RW) energy score
`RW+`	FLOAT	Random-Walk+ (RWplus) energy score
`ranked_functions`	TEXT	JSON-encoded list of ranked GO/functional annotations
`chain_composition`	TEXT	Quaternary composition of the deposited assembly: `monomeric`, `homomeric`, or `heteromeric`
`non_protein_polymer_binding`	TEXT	Non-protein polymer bound in the structure (`DNA`, `RNA`, or `DNA/RNA`); empty when none is present

Table: `edge`

Each row represents a pairwise structural comparison between two chain instances (A and B) that share the same UniProt identity.

Column	Type	Description
`pdb_id_A`	TEXT	PDB ID of chain A
`auth_asym_id_A`	TEXT	Chain identifier of chain A
`pdb_id_B`	TEXT	PDB ID of chain B
`auth_asym_id_B`	TEXT	Chain identifier of chain B
`TM1`	FLOAT	TM-score of the alignment (chain A as reference)
`RMSD`	FLOAT	Root-mean-square deviation of Cα atoms (Å)
`structure_sim`	FLOAT	Composite structural similarity score
`delta_Rosetta`	FLOAT	Rosetta energy difference (B − A)
`delta_FoldX`	FLOAT	FoldX energy difference (B − A)
`delta_EvoEF2`	FLOAT	EvoEF2 energy difference (B − A)
`delta_RW`	FLOAT	Random-Walk (RW) energy difference (B − A)
`delta_RW+`	FLOAT	Random-Walk+ (RWplus) energy difference (B − A)
`state_id_A`	TEXT	Conformational state cluster of chain A
`state_id_B`	TEXT	Conformational state cluster of chain B
`avg_sim`	TEXT	Average structural similarity within the state cluster (`>0.95` or a numeric value)
`state_fidelity`	TEXT	State-level transition fidelity label: `identical`, `high`, `medium`, or `low`
`observation_fidelity`	TEXT	Observation-level transition fidelity label: `identical`, `high`, `medium`, or `low`

Usage Examples

Python (sqlite3)

import sqlite3
import pandas as pd

conn = sqlite3.connect("MuSProt.db")

# Load all chains for a UniProt entry
df_nodes = pd.read_sql(
    "SELECT * FROM node WHERE uniprot_id = 'P00533'",
    conn
)

# Find all structural neighbours of a given chain
df_edges = pd.read_sql(
    "SELECT * FROM edge WHERE pdb_id_A = '1ivo' AND auth_asym_id_A = 'A'",
    conn
)

conn.close()

Filter by structural similarity

# Retrieve pairs with high TM-score and low RMSD
df = pd.read_sql("""
    SELECT *
    FROM edge
    WHERE CAST(TM1 AS REAL) > 0.8
      AND CAST(RMSD AS REAL) < 2.0
    LIMIT 1000
""", conn)

Join nodes and edges

# Get full info for both chains in each pair
df = pd.read_sql("""
    SELECT
        e.pdb_id_A, e.auth_asym_id_A,
        e.pdb_id_B, e.auth_asym_id_B,
        e.TM1, e.RMSD,
        nA.sequence_length AS len_A,
        nB.sequence_length AS len_B,
        nA.cath_superfamily
    FROM edge e
    JOIN node nA ON e.pdb_id_A = nA.pdb_id AND e.auth_asym_id_A = nA.auth_asym_id
    JOIN node nB ON e.pdb_id_B = nB.pdb_id AND e.auth_asym_id_B = nB.auth_asym_id
    WHERE nA.uniprot_id = 'P00533'
    LIMIT 500
""", conn)

Notes

All numeric fields (energies, TM-score, RMSD, lengths) are stored as TEXT; cast them with CAST(col AS REAL) or CAST(col AS INTEGER) as needed.
ranked_functions in the node table is a JSON string. Parse it with json.loads().
The edge table is directional: (A→B) and (B→A) are separate rows and may differ slightly in TM-score.
Energy delta values represent B − A; a negative delta means chain B is lower energy than chain A.

Xet Storage Details

Size:: 6.55 kB
Xet hash:: 1115a123d05f3e20ce2a75e094acfb43dc2404e182c6b2ae2f323b0ade123890

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.