🧬 Protein SSP — Artifacts Repository

This repository hosts all large binary artifacts for the Protein Secondary Structure Predictor — a deep learning pipeline that predicts α-Helix (H), β-Sheet (E), and Coil (C) secondary structure from raw amino acid sequences.

Decoupling these artifacts from the application code bypasses Hugging Face Space storage limits (1 GB) and Git LFS constraints, keeping the Space lightweight and fast to build while the app still downloads what it needs on-demand at runtime.

🔗 Project Links

Resource	URL
Live Streamlit App	huggingface.co/spaces/Chimera418/protein-ssp
Source Code	github.com/Chimera418/protein-ssp
This Artifacts Repo	huggingface.co/Chimera418/protein-ssp-artifacts

📂 Repository Layout

protein-ssp-artifacts/
├── models/          # ~1.5 GB — downloaded by the Streamlit app at runtime
├── embeddings/      # ~26.7 GB — pre-computed for offline research
├── data/            # ~1.1 GB — curated dataset CSVs
└── raw_data/        # ~2.2 GB — raw RCSB/PISCES source files

📁 Folder Contents

`/models` — Trained Model Weights & Transforms (~1.5 GB)

These files are lazily pulled by the Streamlit app on first prediction per mode, then cached locally:

File	Size	Description
`phase_8_best_model_Rostlab_prot_t5_xl_uniref50.pt`	~23 MB	PyTorch model for Direct mode (1024-dim input)
`phase_8_best_model_filtered_embeddings.pt`	~23 MB	PyTorch model for Pearson-filtered mode (1017-dim input)
`phase_8_best_model_pca_embeddings.pt`	~21 MB	PyTorch model for PCA Pipeline mode (739-dim input) ★ Best
`phase_8_best_model_final_features.pt`	~17 MB	PyTorch model for Feature Selected V1 (109-dim input)
`phase_8_best_model_final_features_v2.pt`	~16 MB	PyTorch model for Feature Selected V2 (12-dim input)
`pca_model.pkl`	~6 MB	Fitted PCA transform (1017 → 739 principal components)
`keep_indices.pkl`	~3 KB	Pearson filter mask — column indices to retain after Phase 5 filtering
`feature_selector_mask.pkl`	~1.3 GB	ExtraTrees V1 selection mask — 109 PCA-space indices (Phase 7)
`feature_selector_mask_v2.pkl`	~2 KB	ExtraTrees V2 selection mask — top-12 refined indices (Phase 7.5)

All five PyTorch models share the same architecture: 1D-CNN → BiLSTM → Multi-Head Attention → Linear head (3 classes).

`/embeddings` — Pre-computed LLM Embeddings (~26.7 GB)

Per-residue embeddings pre-computed with Rostlab/prot_t5_xl_uniref50 across all ~9,000 training proteins. Used for offline model training, validation, and batch experiments — not needed to run the Streamlit app.

File	Approx. Size	Description
`Rostlab_prot_t5_xl_uniref50.pkl`	~9.44 GB	Raw 1024-dim ProtT5 embeddings for all proteins
`filtered_embeddings.pkl`	~9.38 GB	After Phase 5 Pearson filter (1017-dim)
`pca_embeddings.pkl`	~6.81 GB	After Phase 6 PCA (739-dim)
`final_features.pkl`	~1.00 GB	After Phase 7 ExtraTrees selection (109-dim)
`final_features_v2.pkl`	~0.11 GB	After Phase 7.5 top-12 refinement (12-dim)

`/data` — Curated Dataset CSVs (~1.1 GB)

Intermediate and final dataset files produced by the data curation pipeline (Phases 1–7):

File	Description
`protein_sequences_raw.csv`	All sequences parsed from RCSB PDB `ss.txt.gz` (Phase 1 output)
`protein_sequences_curated.csv`	After deduplication, length filtering, invalid-AA removal, and PISCES redundancy culling at ≤70% identity (Phase 2 output)
`protein_labelled_curated.csv`	Curated sequences with per-residue SST8 and SST3 labels matched from RCSB (Phase 3 output)
`filtered_protein_embeddings.csv`	Pearson-filtered embedding matrix as CSV (Phase 5 output)
`pca_protein_embeddings.csv`	PCA-reduced embedding matrix as CSV (Phase 6 output)
`final_selected_features.csv`	ExtraTrees V1 selected features as CSV (Phase 7 output)
`final_selected_features_v2.csv`	ExtraTrees V2 top-12 features as CSV (Phase 7.5 output)

`/raw_data` — Raw Source Data (~2.2 GB)

Original files downloaded directly from RCSB PDB and the Dunbrack PISCES server:

File	Description
`pisces_lists_2026_05_14.tar.gz`	PISCES culled PDB list archive (~1.9 GB), used to enforce ≤70% sequence identity in Phase 2
`2026-05-16-ss.txt.gz` / `2026-05-17-ss.txt.gz`	RCSB PDB secondary structure annotation files (`ss.txt.gz`), used in Phases 1 & 3
`2026-05-16-source.idx` / `2026-05-17-source.idx`	RCSB PDB organism source index files, used to annotate sequences with organism names in Phase 1
`2026-05-16-ss.csv`	Parsed ss.txt.gz intermediate CSV
`cullpdb_pc25.0_res0.0-2.0_len40-10000_R0.25_Xray_d2026_05_14_chains9162`	PISCES culled list (pc25, resolution ≤2.0 Å, R-factor ≤0.25) used as a cross-reference

⚙️ How the App Uses This Repository

The Streamlit app (app.py) uses the huggingface_hub SDK to lazily pull model files the first time each prediction mode is used:

def ensure_model_exists(local_path):
    hf_path = local_path.replace("\\", "/")
    if not os.path.exists(local_path):
        from huggingface_hub import hf_hub_download
        hf_hub_download(
            repo_id="Chimera418/protein-ssp-artifacts",
            filename=hf_path,
            local_dir="."
        )
    return local_path

This is called before loading every model weight or pickle file. Files are cached on disk after the first download, so subsequent runs are instant.

🎯 Model Performance Summary

All five models trained in Phase 8, evaluated on a held-out test set:

Mode	Input Dims	Q3 Accuracy	Macro F1	AUC
Direct (raw ProtT5)	1024	85.03%	0.8494	0.9685
Pearson-filtered	1017	85.41%	0.8530	0.9683
PCA Pipeline ★	739	85.67%	0.8563	0.9692
Feature Selected V1	109	84.21%	0.8413	0.9617
Feature Selected V2	12	82.30%	0.8215	0.9600

The PCA Pipeline achieves the best balance of accuracy and computational efficiency.

📜 License

This project and all contained artifacts are licensed under the MIT License.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Chimera418
/

protein-ssp-artifacts

🧬 Protein SSP — Artifacts Repository

🔗 Project Links

📂 Repository Layout

📁 Folder Contents

`/models` — Trained Model Weights & Transforms (~1.5 GB)

`/embeddings` — Pre-computed LLM Embeddings (~26.7 GB)

`/data` — Curated Dataset CSVs (~1.1 GB)

`/raw_data` — Raw Source Data (~2.2 GB)

⚙️ How the App Uses This Repository

🎯 Model Performance Summary

📜 License

Space using Chimera418/protein-ssp-artifacts 1

🧬 Protein SSP — Artifacts Repository

🔗 Project Links

📂 Repository Layout

📁 Folder Contents

/models — Trained Model Weights & Transforms (~1.5 GB)

/embeddings — Pre-computed LLM Embeddings (~26.7 GB)

/data — Curated Dataset CSVs (~1.1 GB)

/raw_data — Raw Source Data (~2.2 GB)

⚙️ How the App Uses This Repository

🎯 Model Performance Summary

📜 License

Space using Chimera418/protein-ssp-artifacts 1

`/models` — Trained Model Weights & Transforms (~1.5 GB)

`/embeddings` — Pre-computed LLM Embeddings (~26.7 GB)

`/data` — Curated Dataset CSVs (~1.1 GB)

`/raw_data` — Raw Source Data (~2.2 GB)