Spaces:
Running
Running
| title: README | |
| emoji: 🔥 | |
| colorFrom: yellow | |
| colorTo: gray | |
| sdk: static | |
| pinned: false | |
| # Project PhenoSeq: Protein Network Analysis for Phenotypic Outcomes | |
| While demonstrating promising results in basic prediction tasks, the project identified key areas for improvement in protein-phenotype relationship modeling. The findings provide a foundation for future work in protein network analysis and phenotype prediction. | |
| *This project represents a significant step forward in understanding protein-phenotype relationships, while highlighting important areas for future research and development in computational biology.* | |
| ## Project Overview | |
| PhenoSeq is an innovative project focused on understanding how protein networks contribute to organism-scale phenotypes, particularly in cancer growth and organism longevity. The project leverages protein embeddings from ESM (Evolutionary Scale Modeling) combined with graph neural networks to predict phenotypic outcomes through protein-protein interactions (PPIs). | |
| ## Core Objectives | |
| 1. Develop predictive models for understanding biological drivers of complex diseases | |
| 2. Create frameworks for inferring oncogenic potential of genetic mutations | |
| 3. Analyze clinical significance of protein modifications using sequence embeddings | |
| 4. Establish connections between protein networks and phenotypic outcomes | |
| ## Data Sources | |
| The project utilized three major public databases: | |
| - DepMap: CRISPR-based experimental data measuring protein deletion effects on cancer cell proliferation | |
| - TCGA: The Cancer Genome Atlas data | |
| - Longevity Database: Species longevity information | |
| ## Methodological Approach | |
| ### Model Development | |
| The team developed three distinct models: | |
| 1. **Baseline Model** | |
| - Fully connected network predicting CRISPR scores from embeddings | |
| - Achieved correlation of 0.55 with ground truth | |
| - Outperformed K-nearest neighbors baseline | |
| - Performance correlated with training set proximity | |
| 2. **Cell Line-Specific Model** | |
| - Incorporated cell line identity through one-hot embedding | |
| - Included mutation status (wild type vs mutated) | |
| - Achieved 0.44 correlation with ground truth | |
| - Limited success in predicting cell line-specific differences | |
| 3. **PPI-Informed Model** | |
| - Integrated protein-protein interaction data | |
| - Results comparable to cell line-specific model | |
| - Limited additional performance gain from PPI integration | |
| ### Additional Analyses | |
| - Species Longevity Analysis | |
| - Challenges in cross-phylogenetic prediction | |
| - Limited success across different orders of the phylogenetic tree | |
| - TCGA Patient Survival Analysis | |
| - Achieved significant correlations | |
| - Performance below initial expectations | |
| ## Key Findings | |
| 1. ESM3 embeddings contain valuable functional information | |
| 2. Simple models can outperform basic baselines | |
| 3. Current approach limitations in capturing subtle effects | |
| 4. Challenges in predicting mutation-specific impacts | |
|  | |
| ## Future Directions | |
| 1. Integration of additional data types: | |
| - Copy number variation | |
| - Transcriptomic information | |
| 2. Exploration of amino acid level embeddings | |
| 3. Enhanced signal processing methods | |
| 4. Improved model architectures | |
| ## Technical Achievements | |
| - Successful implementation of protein embedding analysis | |
| - Development of multiple predictive models | |
| - Integration of complex biological datasets | |
| - Novel approaches to phenotype prediction | |
| ## Limitations and Challenges | |
| 1. Limited success in cell line-specific predictions | |
| 2. Challenges in cross-phylogenetic predictions | |
| 3. Subtle effect detection limitations | |
| 4. Data integration complexities | |
| ## Impact and Applications | |
| - Enhanced understanding of disease mechanisms | |
| - Improved drug target identification | |
| - Better prediction of genetic mutation effects | |
| - Advanced protein function analysis | |
| # PhenoSeq Longevity Analysis Component | |
| This analysis revealed both the potential and limitations of using protein sequence data for predicting species longevity, highlighting the importance of taxonomic relationships in such predictions. | |
| ## Overview | |
| The longevity analysis component of PhenoSeq investigated the relationship between protein sequences and species lifespan across different taxonomic orders, with a particular focus on Primates, Chiroptera (bats), and Cetacea (whales). | |
| ## Key Findings | |
|  | |
| ### 1. Taxonomic Order Analysis | |
| - The study examined lifespan distributions across multiple orders including: | |
| - Rodentia | |
| - Artiodactyla | |
| - Carnivora | |
| - Primates | |
| - Chiroptera | |
| - Cetacea | |
| - Diprotodontia | |
| - Perissodactyla | |
| ### 2. Prediction Performance | |
| - Mean predictions across orders were relatively successful | |
| - However, predictions within individual orders showed limited accuracy | |
| - High-performing proteins were not well conserved between different orders | |
|  | |
| ### 3. Model Architecture Insights | |
| - Later layers in the neural network did not provide significant additional information | |
| - Training curves showed convergence but with limitations in prediction accuracy | |
| ### 4. Protein Embedding Analysis | |
| - Analysis of protein ALDOB showed that: | |
| - Nearest neighbor species in embedding space typically belonged to the same Order/Family | |
| - Strong taxonomic clustering was observed in the embedding space | |
| ### 5. Hierarchical Prediction Accuracy | |
| Correlation strength increased with taxonomic specificity: | |
| - Order level: r = 0.8 (271 species across 12 orders) | |
| - Family level: r = 0.92 (191 species across 27 families) | |
| - Genus level: r = 0.97 (47 species across 15 genera) | |
|  | |
| ## Technical Limitations | |
| - Limited success in cross-order predictions | |
| - Difficulty in generalizing predictions across distant phylogenetic relationships | |
| - Need for order/family-specific modeling approaches | |
| ## Key Insights | |
| - Strong within-taxon predictions | |
| - Decreasing accuracy with increasing phylogenetic distance | |
| - Need for taxonomic stratification in prediction models | |
| - High predictive power at genus level suggests strong genetic influence on longevity within closely related species | |
|  | |
| # PhenoSeq DepMap Analysis Component | |
| This analysis demonstrated both the potential and current limitations of using protein sequence data to predict cancer-relevant protein functions, highlighting areas for future improvement in protein-phenotype prediction models. | |
|  | |
| ## Overview | |
| The DepMap component investigated protein function in cancer through CRISPR-based knockout experiments, analyzing 9,353 proteins across 1,150 different cell lines to understand their effects on cancer cell growth. | |
| ## Three Models : | |
| 1. **Baseline Model** | |
| - Input: Average protein embedding across all cell lines | |
| - Output: Average CrisprScore across all cell lines | |
| - Architecture: Simple feedforward network using ESM3-open-small embeddings | |
| - Performance: Achieved Pearson correlation of 0.55 | |
| - Outperformed KNN baseline across all K values | |
| 2. **Cell-line-specific Model** | |
| - Predicted CrisprScore effects for each protein-cell line combination | |
| - Performance: Achieved Pearson correlation of 0.44 | |
| - Limited success in predicting protein-specific differences between cell lines | |
| - Poor correlation (r=0.01) for individual proteins like MYC across cancer types | |
| 3. **PPI-informed Model** | |
| - Incorporated protein-protein interaction networks | |
| - Aimed to predict CrisprScore effects by propagating signals through PPI networks | |
| - Results similar to cell-line-specific model | |
| ## Key Findings | |
|  | |
| ### Model Performance | |
| - Baseline model showed strong general prediction capability | |
| - Distance to nearest neighbors in training set affected performance | |
| - Larger networks didn't necessarily improve performance | |
| - Model demonstrated true learning rather than memorization | |
|  | |
| ### Technical Insights | |
| - Hyperparameter sweeps showed similar training patterns across: | |
| - Different numbers of layers | |
| - Various hidden dimensions | |
| - Model struggled with fine-grained predictions of mutation effects | |
| ### Limitations | |
| - Poor performance in predicting effects of small sequence differences | |
| - Limited ability to distinguish between mutations of the same protein | |
| - Challenges in cell-line-specific predictions | |
| ## Technical Details | |
| - CrisprScore distribution showed varied effects of protein deletion | |
| - Different proteins showed distinct patterns of effect across cell lines | |
| - Model performance was consistent across different architectural choices | |
| ## Future Implications | |
| - Need for improved mutation-specific prediction capabilities | |
| - Potential for enhanced protein function understanding | |
| - Opportunity for better cancer-specific protein effect prediction | |