🧬 Protein SSP β€” Artifacts Repository

This repository hosts all large binary artifacts for the Protein Secondary Structure Predictor β€” a deep learning pipeline that predicts Ξ±-Helix (H), Ξ²-Sheet (E), and Coil (C) secondary structure from raw amino acid sequences.

Decoupling these artifacts from the application code bypasses Hugging Face Space storage limits (1 GB) and Git LFS constraints, keeping the Space lightweight and fast to build while the app still downloads what it needs on-demand at runtime.

πŸ”— Project Links


πŸ“‚ Repository Layout

protein-ssp-artifacts/
β”œβ”€β”€ models/          # ~1.5 GB β€” downloaded by the Streamlit app at runtime
β”œβ”€β”€ embeddings/      # ~26.7 GB β€” pre-computed for offline research
β”œβ”€β”€ data/            # ~1.1 GB β€” curated dataset CSVs
└── raw_data/        # ~2.2 GB β€” raw RCSB/PISCES source files

πŸ“ Folder Contents

/models β€” Trained Model Weights & Transforms (~1.5 GB)

These files are lazily pulled by the Streamlit app on first prediction per mode, then cached locally:

File Size Description
phase_8_best_model_Rostlab_prot_t5_xl_uniref50.pt ~23 MB PyTorch model for Direct mode (1024-dim input)
phase_8_best_model_filtered_embeddings.pt ~23 MB PyTorch model for Pearson-filtered mode (1017-dim input)
phase_8_best_model_pca_embeddings.pt ~21 MB PyTorch model for PCA Pipeline mode (739-dim input) β˜… Best
phase_8_best_model_final_features.pt ~17 MB PyTorch model for Feature Selected V1 (109-dim input)
phase_8_best_model_final_features_v2.pt ~16 MB PyTorch model for Feature Selected V2 (12-dim input)
pca_model.pkl ~6 MB Fitted PCA transform (1017 β†’ 739 principal components)
keep_indices.pkl ~3 KB Pearson filter mask β€” column indices to retain after Phase 5 filtering
feature_selector_mask.pkl ~1.3 GB ExtraTrees V1 selection mask β€” 109 PCA-space indices (Phase 7)
feature_selector_mask_v2.pkl ~2 KB ExtraTrees V2 selection mask β€” top-12 refined indices (Phase 7.5)

All five PyTorch models share the same architecture: 1D-CNN β†’ BiLSTM β†’ Multi-Head Attention β†’ Linear head (3 classes).

/embeddings β€” Pre-computed LLM Embeddings (~26.7 GB)

Per-residue embeddings pre-computed with Rostlab/prot_t5_xl_uniref50 across all ~9,000 training proteins. Used for offline model training, validation, and batch experiments β€” not needed to run the Streamlit app.

File Approx. Size Description
Rostlab_prot_t5_xl_uniref50.pkl ~9.44 GB Raw 1024-dim ProtT5 embeddings for all proteins
filtered_embeddings.pkl ~9.38 GB After Phase 5 Pearson filter (1017-dim)
pca_embeddings.pkl ~6.81 GB After Phase 6 PCA (739-dim)
final_features.pkl ~1.00 GB After Phase 7 ExtraTrees selection (109-dim)
final_features_v2.pkl ~0.11 GB After Phase 7.5 top-12 refinement (12-dim)

/data β€” Curated Dataset CSVs (~1.1 GB)

Intermediate and final dataset files produced by the data curation pipeline (Phases 1–7):

File Description
protein_sequences_raw.csv All sequences parsed from RCSB PDB ss.txt.gz (Phase 1 output)
protein_sequences_curated.csv After deduplication, length filtering, invalid-AA removal, and PISCES redundancy culling at ≀70% identity (Phase 2 output)
protein_labelled_curated.csv Curated sequences with per-residue SST8 and SST3 labels matched from RCSB (Phase 3 output)
filtered_protein_embeddings.csv Pearson-filtered embedding matrix as CSV (Phase 5 output)
pca_protein_embeddings.csv PCA-reduced embedding matrix as CSV (Phase 6 output)
final_selected_features.csv ExtraTrees V1 selected features as CSV (Phase 7 output)
final_selected_features_v2.csv ExtraTrees V2 top-12 features as CSV (Phase 7.5 output)

/raw_data β€” Raw Source Data (~2.2 GB)

Original files downloaded directly from RCSB PDB and the Dunbrack PISCES server:

File Description
pisces_lists_2026_05_14.tar.gz PISCES culled PDB list archive (~1.9 GB), used to enforce ≀70% sequence identity in Phase 2
2026-05-16-ss.txt.gz / 2026-05-17-ss.txt.gz RCSB PDB secondary structure annotation files (ss.txt.gz), used in Phases 1 & 3
2026-05-16-source.idx / 2026-05-17-source.idx RCSB PDB organism source index files, used to annotate sequences with organism names in Phase 1
2026-05-16-ss.csv Parsed ss.txt.gz intermediate CSV
cullpdb_pc25.0_res0.0-2.0_len40-10000_R0.25_Xray_d2026_05_14_chains9162 PISCES culled list (pc25, resolution ≀2.0 Γ…, R-factor ≀0.25) used as a cross-reference

βš™οΈ How the App Uses This Repository

The Streamlit app (app.py) uses the huggingface_hub SDK to lazily pull model files the first time each prediction mode is used:

def ensure_model_exists(local_path):
    hf_path = local_path.replace("\\", "/")
    if not os.path.exists(local_path):
        from huggingface_hub import hf_hub_download
        hf_hub_download(
            repo_id="Chimera418/protein-ssp-artifacts",
            filename=hf_path,
            local_dir="."
        )
    return local_path

This is called before loading every model weight or pickle file. Files are cached on disk after the first download, so subsequent runs are instant.


🎯 Model Performance Summary

All five models trained in Phase 8, evaluated on a held-out test set:

Mode Input Dims Q3 Accuracy Macro F1 AUC
Direct (raw ProtT5) 1024 85.03% 0.8494 0.9685
Pearson-filtered 1017 85.41% 0.8530 0.9683
PCA Pipeline β˜… 739 85.67% 0.8563 0.9692
Feature Selected V1 109 84.21% 0.8413 0.9617
Feature Selected V2 12 82.30% 0.8215 0.9600

The PCA Pipeline achieves the best balance of accuracy and computational efficiency.


πŸ“œ License

This project and all contained artifacts are licensed under the MIT License.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using Chimera418/protein-ssp-artifacts 1