Buckets:

omaib
/

MuSProt-storage

Files

xet

omaib/MuSProt-storage / MuSProt_documentation.md

wenruifan

1 day ago

preview code

download

raw

6.55 kB

	# MuSProt Dataset Documentation

	MuSProt (Multistate Protein Database) is a million-scale multimodal database for multistate proteins, designed to support programmable protein design and AI model development. It links experimentally observed conformational states of identical protein sequences from the PDB, organizes them into state clusters and transition relationships, and enriches each record with structural similarity, experimental context, state-specific function rankings and transition fidelity labels. Users can search, browse and download MuSProt records to study conformational diversity, state-dependent functions and feasible protein state transitions.

	---

	## Download

	The full dataset is distributed as a single SQLite database file (`MuSProt.db`, ~6.3 GB). Download it from the dataset page using the Download DB button.

	For detailed protein structures and atom coordinates, users are expected to download [Protein Data Bank (PDB)](https://rcsb.org) and fetch from `mmCIF` or `.pdb` files based on the PDB entry and chain ID.

	---

	## Database Schema

	The database contains two tables: `node` and `edge`.

	---

	### Table: `node`

	Each row represents a single protein chain instance (one PDB entry + chain).

	\| Column \| Type \| Description \|
	\|---\|---\|---\|
	\| `uniprot_id` \| TEXT \| UniProt accession number \|
	\| `pdb_id` \| TEXT \| 4-character PDB entry ID (lowercase) \|
	\| `auth_asym_id` \| TEXT \| Author chain identifier (e.g. `A`) \|
	\| `base_label` \| TEXT \| Canonical label combining UniProt ID and chain state \|
	\| `sequence` \| TEXT \| SEQRES amino acid sequence \|
	\| `sequence_length` \| INT \| Number of residues in the chain \|
	\| `original_metals` \| TEXT \| Metal ions present in the structure \|
	\| `original_ligands` \| TEXT \| Small-molecule ligands present in the structure \|
	\| `sequence_id` \| TEXT \| Internal sequence cluster identifier \|
	\| `state_id` \| TEXT \| Conformational state cluster this chain is assigned to (`0`, `1`, `2`, …) \|
	\| `CATH_ID` \| TEXT \| CATH domain assignment \|
	\| `cath_class` \| TEXT \| CATH class (e.g. `1` = Mainly Alpha) \|
	\| `cath_arch` \| TEXT \| CATH architecture \|
	\| `cath_topo` \| TEXT \| CATH topology \|
	\| `cath_homology` \| TEXT \| CATH homology superfamily \|
	\| `cath_superfamily` \| TEXT \| Full CATH superfamily code (e.g. `1.10.10.10`) \|
	\| `domain_length` \| INT \| Length of the matched CATH domain \|
	\| `experimental_method` \| TEXT \| Structure determination method (e.g. `X-RAY DIFFRACTION`, `ELECTRON MICROSCOPY`, `SOLUTION NMR`) \|
	\| `pH` \| FLOAT \| pH of the experimental / crystallization condition \|
	\| `temp_K` \| FLOAT \| Temperature of the experiment, in Kelvin \|
	\| `experimental_details` \| TEXT \| Free-text crystallization / sample-preparation details \|
	\| `resolution` \| FLOAT \| Experimental resolution in Å (lower is sharper; empty for methods without a resolution) \|
	\| `Rosetta` \| FLOAT \| Rosetta total energy score \|
	\| `FoldX` \| FLOAT \| FoldX total energy score \|
	\| `EvoEF2` \| FLOAT \| EvoEF2 total energy score \|
	\| `RW` \| FLOAT \| Random-Walk (RW) energy score \|
	\| `RW+` \| FLOAT \| Random-Walk+ (RWplus) energy score \|
	\| `ranked_functions` \| TEXT \| JSON-encoded list of ranked GO/functional annotations \|
	\| `chain_composition` \| TEXT \| Quaternary composition of the deposited assembly: `monomeric`, `homomeric`, or `heteromeric` \|
	\| `non_protein_polymer_binding` \| TEXT \| Non-protein polymer bound in the structure (`DNA`, `RNA`, or `DNA/RNA`); empty when none is present \|


	---

	### Table: `edge`

	Each row represents a pairwise structural comparison between two chain instances (A and B) that share the same UniProt identity.

	\| Column \| Type \| Description \|
	\|---\|---\|---\|
	\| `pdb_id_A` \| TEXT \| PDB ID of chain A \|
	\| `auth_asym_id_A` \| TEXT \| Chain identifier of chain A \|
	\| `pdb_id_B` \| TEXT \| PDB ID of chain B \|
	\| `auth_asym_id_B` \| TEXT \| Chain identifier of chain B \|
	\| `TM1` \| FLOAT \| TM-score of the alignment (chain A as reference) \|
	\| `RMSD` \| FLOAT \| Root-mean-square deviation of Cα atoms (Å) \|
	\| `structure_sim` \| FLOAT \| Composite structural similarity score \|
	\| `delta_Rosetta` \| FLOAT \| Rosetta energy difference (B − A) \|
	\| `delta_FoldX` \| FLOAT \| FoldX energy difference (B − A) \|
	\| `delta_EvoEF2` \| FLOAT \| EvoEF2 energy difference (B − A) \|
	\| `delta_RW` \| FLOAT \| Random-Walk (RW) energy difference (B − A) \|
	\| `delta_RW+` \| FLOAT \| Random-Walk+ (RWplus) energy difference (B − A) \|
	\| `state_id_A` \| TEXT \| Conformational state cluster of chain A \|
	\| `state_id_B` \| TEXT \| Conformational state cluster of chain B \|
	\| `avg_sim` \| TEXT \| Average structural similarity within the state cluster (`>0.95` or a numeric value) \|
	\| `state_fidelity` \| TEXT \| State-level transition fidelity label: `identical`, `high`, `medium`, or `low` \|
	\| `observation_fidelity` \| TEXT \| Observation-level transition fidelity label: `identical`, `high`, `medium`, or `low` \|

	---

	## Usage Examples

	### Python (sqlite3)

	```python
	import sqlite3
	import pandas as pd

	conn = sqlite3.connect("MuSProt.db")

	# Load all chains for a UniProt entry
	df_nodes = pd.read_sql(
	"SELECT * FROM node WHERE uniprot_id = 'P00533'",
	conn
	)

	# Find all structural neighbours of a given chain
	df_edges = pd.read_sql(
	"SELECT * FROM edge WHERE pdb_id_A = '1ivo' AND auth_asym_id_A = 'A'",
	conn
	)

	conn.close()
	```

	### Filter by structural similarity

	```python
	# Retrieve pairs with high TM-score and low RMSD
	df = pd.read_sql("""
	SELECT *
	FROM edge
	WHERE CAST(TM1 AS REAL) > 0.8
	AND CAST(RMSD AS REAL) < 2.0
	LIMIT 1000
	""", conn)
	```

	### Join nodes and edges

	```python
	# Get full info for both chains in each pair
	df = pd.read_sql("""
	SELECT
	e.pdb_id_A, e.auth_asym_id_A,
	e.pdb_id_B, e.auth_asym_id_B,
	e.TM1, e.RMSD,
	nA.sequence_length AS len_A,
	nB.sequence_length AS len_B,
	nA.cath_superfamily
	FROM edge e
	JOIN node nA ON e.pdb_id_A = nA.pdb_id AND e.auth_asym_id_A = nA.auth_asym_id
	JOIN node nB ON e.pdb_id_B = nB.pdb_id AND e.auth_asym_id_B = nB.auth_asym_id
	WHERE nA.uniprot_id = 'P00533'
	LIMIT 500
	""", conn)
	```

	---

	## Notes

	- All numeric fields (energies, TM-score, RMSD, lengths) are stored as `TEXT`; cast them with `CAST(col AS REAL)` or `CAST(col AS INTEGER)` as needed.
	- `ranked_functions` in the `node` table is a JSON string. Parse it with `json.loads()`.
	- The `edge` table is directional: (A→B) and (B→A) are separate rows and may differ slightly in TM-score.
	- Energy delta values represent B − A; a negative delta means chain B is lower energy than chain A.

Xet Storage Details

Size:: 6.55 kB
Xet hash:: 1115a123d05f3e20ce2a75e094acfb43dc2404e182c6b2ae2f323b0ade123890

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.