Spaces:
Sleeping
A newer version of the Streamlit SDK is available:
1.53.0
title: Crystallization Component Predictor
emoji: π¬
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: false
license: mit
π¬ Crystallization Component Predictor
An interactive machine learning application for predicting optimal protein crystallization components based on experimental parameters.
π― What Does This App Do?
This tool predicts three critical crystallization parameters:
- Component Name: The chemical compound most likely to produce crystals
- Concentration: The optimal molarity for the component
- pH: The ideal acidity/basicity level for crystallization
π Quick Start
- Select a model (Advanced Baseline recommended)
- Input your crystallization parameters:
- Crystallization method
- Temperature
- pH
- Matthews coefficient
- Solvent content
- Click "Predict Components"
- Review predictions and download results
π Model Performance
| Model | Name Accuracy | Conc RΒ² | pH RΒ² |
|---|---|---|---|
| Simple Baseline | 61.12% | N/A | 95.58% |
| Advanced Baseline β | 64.18% | 47.33% | 99.34% |
| Transformer | 53.85% | 18.72% | 99.27% |
Recommended: Advanced Baseline for best overall performance
π¬ Features
- Two Model Approaches: Choose between Simple and Advanced Baseline
- Interactive UI: Easy-to-use sliders and dropdowns
- Top-5 Predictions: View confidence scores for multiple candidates
- Visual pH Scale: Intuitive pH visualization
- Downloadable Results: Export predictions as CSV
- Performance Charts: Compare model accuracies
π οΈ Technical Details
Simple Baseline
- Random Forest for component classification
- XGBoost for pH regression
- 4 numerical features + TF-IDF of crystallization method
Advanced Baseline (Recommended)
- Ensemble of Random Forest, XGBoost, LightGBM, and CatBoost
- 8 engineered features including interaction terms
- Separate models for name, concentration, and pH
- Log-transformed concentration predictions
Models Included
simple_baseline/: Simple baseline modelsmodel_component_name.pkl: Component classifiermodel_component_ph.pkl: pH regressorlabel_encoder_name.pkl: Label encoderscaler.pkl: Feature scalertfidf.pkl: TF-IDF vectorizer
advanced_baseline/: Advanced baseline modelsmodel_component_name.pkl: Enhanced component classifiermodel_component_conc.pkl: Concentration regressormodel_component_ph.pkl: Enhanced pH regressorlabel_encoder_name.pkl: Label encoderscaler.pkl: Feature scalertfidf.pkl: TF-IDF vectorizer
π¦ Dependencies
- Python 3.9+
- Streamlit
- Scikit-learn
- XGBoost
- LightGBM
- CatBoost
- Pandas
- NumPy
- Joblib
π Use Cases
- Structural Biology: Plan crystallization experiments
- Drug Discovery: Optimize protein crystal conditions
- Research: Explore crystallization parameter space
- Education: Learn about protein crystallization
π Background
Protein crystallization is essential for determining 3D protein structures via X-ray crystallography. This tool uses machine learning trained on historical crystallization data from the Protein Data Bank (PDB) to predict optimal conditions.
Input Parameters Explained
- Crystallization Method: Technique used (vapor diffusion, batch, etc.)
- Temperature: Affects protein stability and crystal growth (typically 277-298K)
- pH: Critical for protein solubility and crystal formation (0-14 scale)
- Matthews Coefficient: Unit cell volume to protein molecular weight ratio (Ε²/Da)
- Solvent Content: Percentage of solvent in crystal lattice (typically 30-70%)
β οΈ Important Notes
- Validation Required: Always validate predictions experimentally
- Research Tool: For research and educational purposes
- Starting Point: Use predictions as a guide, not absolute truth
- Protein-Specific: Results may vary based on your specific protein
π€ Contributing
This is a research project. Feedback and suggestions are welcome!
π License
MIT License - Free to use for research and educational purposes
π Acknowledgments
- Training data derived from Protein Data Bank (PDB)
- Built with Streamlit and ensemble ML models
- Inspired by advances in computational structural biology
π Contact & Support
For questions or issues, please open an issue on the repository.
Note: This tool provides predictions based on historical data. Always conduct proper experimental validation. Crystallization is a complex process influenced by many factors not captured by these models alone.