--- title: Crystallization Component Predictor emoji: 🔬 colorFrom: blue colorTo: purple sdk: streamlit sdk_version: 1.29.0 app_file: app.py pinned: false license: mit --- # 🔬 Crystallization Component Predictor An interactive machine learning application for predicting optimal protein crystallization components based on experimental parameters. ## 🎯 What Does This App Do? This tool predicts three critical crystallization parameters: 1. **Component Name**: The chemical compound most likely to produce crystals 2. **Concentration**: The optimal molarity for the component 3. **pH**: The ideal acidity/basicity level for crystallization ## 🚀 Quick Start 1. Select a model (Advanced Baseline recommended) 2. Input your crystallization parameters: - Crystallization method - Temperature - pH - Matthews coefficient - Solvent content 3. Click "Predict Components" 4. Review predictions and download results ## 📊 Model Performance | Model | Name Accuracy | Conc R² | pH R² | |-------|--------------|---------|-------| | Simple Baseline | 61.12% | N/A | 95.58% | | **Advanced Baseline** ⭐ | **64.18%** | **47.33%** | **99.34%** | | Transformer | 53.85% | 18.72% | 99.27% | **Recommended:** Advanced Baseline for best overall performance ## 🔬 Features - **Two Model Approaches**: Choose between Simple and Advanced Baseline - **Interactive UI**: Easy-to-use sliders and dropdowns - **Top-5 Predictions**: View confidence scores for multiple candidates - **Visual pH Scale**: Intuitive pH visualization - **Downloadable Results**: Export predictions as CSV - **Performance Charts**: Compare model accuracies ## 🛠️ Technical Details ### Simple Baseline - Random Forest for component classification - XGBoost for pH regression - 4 numerical features + TF-IDF of crystallization method ### Advanced Baseline (Recommended) - Ensemble of Random Forest, XGBoost, LightGBM, and CatBoost - 8 engineered features including interaction terms - Separate models for name, concentration, and pH - Log-transformed concentration predictions ### Models Included - `simple_baseline/`: Simple baseline models - `model_component_name.pkl`: Component classifier - `model_component_ph.pkl`: pH regressor - `label_encoder_name.pkl`: Label encoder - `scaler.pkl`: Feature scaler - `tfidf.pkl`: TF-IDF vectorizer - `advanced_baseline/`: Advanced baseline models - `model_component_name.pkl`: Enhanced component classifier - `model_component_conc.pkl`: Concentration regressor - `model_component_ph.pkl`: Enhanced pH regressor - `label_encoder_name.pkl`: Label encoder - `scaler.pkl`: Feature scaler - `tfidf.pkl`: TF-IDF vectorizer ## 📦 Dependencies - Python 3.9+ - Streamlit - Scikit-learn - XGBoost - LightGBM - CatBoost - Pandas - NumPy - Joblib ## 🎓 Use Cases - **Structural Biology**: Plan crystallization experiments - **Drug Discovery**: Optimize protein crystal conditions - **Research**: Explore crystallization parameter space - **Education**: Learn about protein crystallization ## 📖 Background Protein crystallization is essential for determining 3D protein structures via X-ray crystallography. This tool uses machine learning trained on historical crystallization data from the Protein Data Bank (PDB) to predict optimal conditions. ### Input Parameters Explained - **Crystallization Method**: Technique used (vapor diffusion, batch, etc.) - **Temperature**: Affects protein stability and crystal growth (typically 277-298K) - **pH**: Critical for protein solubility and crystal formation (0-14 scale) - **Matthews Coefficient**: Unit cell volume to protein molecular weight ratio (Ų/Da) - **Solvent Content**: Percentage of solvent in crystal lattice (typically 30-70%) ## ⚠️ Important Notes - **Validation Required**: Always validate predictions experimentally - **Research Tool**: For research and educational purposes - **Starting Point**: Use predictions as a guide, not absolute truth - **Protein-Specific**: Results may vary based on your specific protein ## 🤝 Contributing This is a research project. Feedback and suggestions are welcome! ## 📄 License MIT License - Free to use for research and educational purposes ## 🙏 Acknowledgments - Training data derived from Protein Data Bank (PDB) - Built with Streamlit and ensemble ML models - Inspired by advances in computational structural biology ## 📞 Contact & Support For questions or issues, please open an issue on the repository. --- **Note**: This tool provides predictions based on historical data. Always conduct proper experimental validation. Crystallization is a complex process influenced by many factors not captured by these models alone.