Spaces:
Sleeping
Sleeping
| title: Crystallization Component Predictor | |
| emoji: π¬ | |
| colorFrom: blue | |
| colorTo: purple | |
| sdk: streamlit | |
| sdk_version: 1.29.0 | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| # π¬ Crystallization Component Predictor | |
| An interactive machine learning application for predicting optimal protein crystallization components based on experimental parameters. | |
| ## π― What Does This App Do? | |
| This tool predicts three critical crystallization parameters: | |
| 1. **Component Name**: The chemical compound most likely to produce crystals | |
| 2. **Concentration**: The optimal molarity for the component | |
| 3. **pH**: The ideal acidity/basicity level for crystallization | |
| ## π Quick Start | |
| 1. Select a model (Advanced Baseline recommended) | |
| 2. Input your crystallization parameters: | |
| - Crystallization method | |
| - Temperature | |
| - pH | |
| - Matthews coefficient | |
| - Solvent content | |
| 3. Click "Predict Components" | |
| 4. Review predictions and download results | |
| ## π Model Performance | |
| | Model | Name Accuracy | Conc RΒ² | pH RΒ² | | |
| |-------|--------------|---------|-------| | |
| | Simple Baseline | 61.12% | N/A | 95.58% | | |
| | **Advanced Baseline** β | **64.18%** | **47.33%** | **99.34%** | | |
| | Transformer | 53.85% | 18.72% | 99.27% | | |
| **Recommended:** Advanced Baseline for best overall performance | |
| ## π¬ Features | |
| - **Two Model Approaches**: Choose between Simple and Advanced Baseline | |
| - **Interactive UI**: Easy-to-use sliders and dropdowns | |
| - **Top-5 Predictions**: View confidence scores for multiple candidates | |
| - **Visual pH Scale**: Intuitive pH visualization | |
| - **Downloadable Results**: Export predictions as CSV | |
| - **Performance Charts**: Compare model accuracies | |
| ## π οΈ Technical Details | |
| ### Simple Baseline | |
| - Random Forest for component classification | |
| - XGBoost for pH regression | |
| - 4 numerical features + TF-IDF of crystallization method | |
| ### Advanced Baseline (Recommended) | |
| - Ensemble of Random Forest, XGBoost, LightGBM, and CatBoost | |
| - 8 engineered features including interaction terms | |
| - Separate models for name, concentration, and pH | |
| - Log-transformed concentration predictions | |
| ### Models Included | |
| - `simple_baseline/`: Simple baseline models | |
| - `model_component_name.pkl`: Component classifier | |
| - `model_component_ph.pkl`: pH regressor | |
| - `label_encoder_name.pkl`: Label encoder | |
| - `scaler.pkl`: Feature scaler | |
| - `tfidf.pkl`: TF-IDF vectorizer | |
| - `advanced_baseline/`: Advanced baseline models | |
| - `model_component_name.pkl`: Enhanced component classifier | |
| - `model_component_conc.pkl`: Concentration regressor | |
| - `model_component_ph.pkl`: Enhanced pH regressor | |
| - `label_encoder_name.pkl`: Label encoder | |
| - `scaler.pkl`: Feature scaler | |
| - `tfidf.pkl`: TF-IDF vectorizer | |
| ## π¦ Dependencies | |
| - Python 3.9+ | |
| - Streamlit | |
| - Scikit-learn | |
| - XGBoost | |
| - LightGBM | |
| - CatBoost | |
| - Pandas | |
| - NumPy | |
| - Joblib | |
| ## π Use Cases | |
| - **Structural Biology**: Plan crystallization experiments | |
| - **Drug Discovery**: Optimize protein crystal conditions | |
| - **Research**: Explore crystallization parameter space | |
| - **Education**: Learn about protein crystallization | |
| ## π Background | |
| Protein crystallization is essential for determining 3D protein structures via X-ray crystallography. This tool uses machine learning trained on historical crystallization data from the Protein Data Bank (PDB) to predict optimal conditions. | |
| ### Input Parameters Explained | |
| - **Crystallization Method**: Technique used (vapor diffusion, batch, etc.) | |
| - **Temperature**: Affects protein stability and crystal growth (typically 277-298K) | |
| - **pH**: Critical for protein solubility and crystal formation (0-14 scale) | |
| - **Matthews Coefficient**: Unit cell volume to protein molecular weight ratio (Ε²/Da) | |
| - **Solvent Content**: Percentage of solvent in crystal lattice (typically 30-70%) | |
| ## β οΈ Important Notes | |
| - **Validation Required**: Always validate predictions experimentally | |
| - **Research Tool**: For research and educational purposes | |
| - **Starting Point**: Use predictions as a guide, not absolute truth | |
| - **Protein-Specific**: Results may vary based on your specific protein | |
| ## π€ Contributing | |
| This is a research project. Feedback and suggestions are welcome! | |
| ## π License | |
| MIT License - Free to use for research and educational purposes | |
| ## π Acknowledgments | |
| - Training data derived from Protein Data Bank (PDB) | |
| - Built with Streamlit and ensemble ML models | |
| - Inspired by advances in computational structural biology | |
| ## π Contact & Support | |
| For questions or issues, please open an issue on the repository. | |
| --- | |
| **Note**: This tool provides predictions based on historical data. Always conduct proper experimental validation. Crystallization is a complex process influenced by many factors not captured by these models alone. | |