𧬠Protein SSP β Artifacts Repository
This repository hosts all large binary artifacts for the Protein Secondary Structure Predictor β a deep learning pipeline that predicts Ξ±-Helix (H), Ξ²-Sheet (E), and Coil (C) secondary structure from raw amino acid sequences.
Decoupling these artifacts from the application code bypasses Hugging Face Space storage limits (1 GB) and Git LFS constraints, keeping the Space lightweight and fast to build while the app still downloads what it needs on-demand at runtime.
π Project Links
| Resource | URL |
|---|---|
| Live Streamlit App | huggingface.co/spaces/Chimera418/protein-ssp |
| Source Code | github.com/Chimera418/protein-ssp |
| This Artifacts Repo | huggingface.co/Chimera418/protein-ssp-artifacts |
π Repository Layout
protein-ssp-artifacts/
βββ models/ # ~1.5 GB β downloaded by the Streamlit app at runtime
βββ embeddings/ # ~26.7 GB β pre-computed for offline research
βββ data/ # ~1.1 GB β curated dataset CSVs
βββ raw_data/ # ~2.2 GB β raw RCSB/PISCES source files
π Folder Contents
/models β Trained Model Weights & Transforms (~1.5 GB)
These files are lazily pulled by the Streamlit app on first prediction per mode, then cached locally:
| File | Size | Description |
|---|---|---|
phase_8_best_model_Rostlab_prot_t5_xl_uniref50.pt |
~23 MB | PyTorch model for Direct mode (1024-dim input) |
phase_8_best_model_filtered_embeddings.pt |
~23 MB | PyTorch model for Pearson-filtered mode (1017-dim input) |
phase_8_best_model_pca_embeddings.pt |
~21 MB | PyTorch model for PCA Pipeline mode (739-dim input) β Best |
phase_8_best_model_final_features.pt |
~17 MB | PyTorch model for Feature Selected V1 (109-dim input) |
phase_8_best_model_final_features_v2.pt |
~16 MB | PyTorch model for Feature Selected V2 (12-dim input) |
pca_model.pkl |
~6 MB | Fitted PCA transform (1017 β 739 principal components) |
keep_indices.pkl |
~3 KB | Pearson filter mask β column indices to retain after Phase 5 filtering |
feature_selector_mask.pkl |
~1.3 GB | ExtraTrees V1 selection mask β 109 PCA-space indices (Phase 7) |
feature_selector_mask_v2.pkl |
~2 KB | ExtraTrees V2 selection mask β top-12 refined indices (Phase 7.5) |
All five PyTorch models share the same architecture: 1D-CNN β BiLSTM β Multi-Head Attention β Linear head (3 classes).
/embeddings β Pre-computed LLM Embeddings (~26.7 GB)
Per-residue embeddings pre-computed with Rostlab/prot_t5_xl_uniref50 across all ~9,000 training proteins. Used for offline model training, validation, and batch experiments β not needed to run the Streamlit app.
| File | Approx. Size | Description |
|---|---|---|
Rostlab_prot_t5_xl_uniref50.pkl |
~9.44 GB | Raw 1024-dim ProtT5 embeddings for all proteins |
filtered_embeddings.pkl |
~9.38 GB | After Phase 5 Pearson filter (1017-dim) |
pca_embeddings.pkl |
~6.81 GB | After Phase 6 PCA (739-dim) |
final_features.pkl |
~1.00 GB | After Phase 7 ExtraTrees selection (109-dim) |
final_features_v2.pkl |
~0.11 GB | After Phase 7.5 top-12 refinement (12-dim) |
/data β Curated Dataset CSVs (~1.1 GB)
Intermediate and final dataset files produced by the data curation pipeline (Phases 1β7):
| File | Description |
|---|---|
protein_sequences_raw.csv |
All sequences parsed from RCSB PDB ss.txt.gz (Phase 1 output) |
protein_sequences_curated.csv |
After deduplication, length filtering, invalid-AA removal, and PISCES redundancy culling at β€70% identity (Phase 2 output) |
protein_labelled_curated.csv |
Curated sequences with per-residue SST8 and SST3 labels matched from RCSB (Phase 3 output) |
filtered_protein_embeddings.csv |
Pearson-filtered embedding matrix as CSV (Phase 5 output) |
pca_protein_embeddings.csv |
PCA-reduced embedding matrix as CSV (Phase 6 output) |
final_selected_features.csv |
ExtraTrees V1 selected features as CSV (Phase 7 output) |
final_selected_features_v2.csv |
ExtraTrees V2 top-12 features as CSV (Phase 7.5 output) |
/raw_data β Raw Source Data (~2.2 GB)
Original files downloaded directly from RCSB PDB and the Dunbrack PISCES server:
| File | Description |
|---|---|
pisces_lists_2026_05_14.tar.gz |
PISCES culled PDB list archive (~1.9 GB), used to enforce β€70% sequence identity in Phase 2 |
2026-05-16-ss.txt.gz / 2026-05-17-ss.txt.gz |
RCSB PDB secondary structure annotation files (ss.txt.gz), used in Phases 1 & 3 |
2026-05-16-source.idx / 2026-05-17-source.idx |
RCSB PDB organism source index files, used to annotate sequences with organism names in Phase 1 |
2026-05-16-ss.csv |
Parsed ss.txt.gz intermediate CSV |
cullpdb_pc25.0_res0.0-2.0_len40-10000_R0.25_Xray_d2026_05_14_chains9162 |
PISCES culled list (pc25, resolution β€2.0 Γ , R-factor β€0.25) used as a cross-reference |
βοΈ How the App Uses This Repository
The Streamlit app (app.py) uses the huggingface_hub SDK to lazily pull model files the first time each prediction mode is used:
def ensure_model_exists(local_path):
hf_path = local_path.replace("\\", "/")
if not os.path.exists(local_path):
from huggingface_hub import hf_hub_download
hf_hub_download(
repo_id="Chimera418/protein-ssp-artifacts",
filename=hf_path,
local_dir="."
)
return local_path
This is called before loading every model weight or pickle file. Files are cached on disk after the first download, so subsequent runs are instant.
π― Model Performance Summary
All five models trained in Phase 8, evaluated on a held-out test set:
| Mode | Input Dims | Q3 Accuracy | Macro F1 | AUC |
|---|---|---|---|---|
| Direct (raw ProtT5) | 1024 | 85.03% | 0.8494 | 0.9685 |
| Pearson-filtered | 1017 | 85.41% | 0.8530 | 0.9683 |
| PCA Pipeline β | 739 | 85.67% | 0.8563 | 0.9692 |
| Feature Selected V1 | 109 | 84.21% | 0.8413 | 0.9617 |
| Feature Selected V2 | 12 | 82.30% | 0.8215 | 0.9600 |
The PCA Pipeline achieves the best balance of accuracy and computational efficiency.
π License
This project and all contained artifacts are licensed under the MIT License.