BTP_2026 / README.md
suvradeepp's picture
Upload 34 files
49e8d95 verified
---
title: Crystallization Component Predictor
emoji: πŸ”¬
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: false
license: mit
---
# πŸ”¬ Crystallization Component Predictor
An interactive machine learning application for predicting optimal protein crystallization components based on experimental parameters.
## 🎯 What Does This App Do?
This tool predicts three critical crystallization parameters:
1. **Component Name**: The chemical compound most likely to produce crystals
2. **Concentration**: The optimal molarity for the component
3. **pH**: The ideal acidity/basicity level for crystallization
## πŸš€ Quick Start
1. Select a model (Advanced Baseline recommended)
2. Input your crystallization parameters:
- Crystallization method
- Temperature
- pH
- Matthews coefficient
- Solvent content
3. Click "Predict Components"
4. Review predictions and download results
## πŸ“Š Model Performance
| Model | Name Accuracy | Conc RΒ² | pH RΒ² |
|-------|--------------|---------|-------|
| Simple Baseline | 61.12% | N/A | 95.58% |
| **Advanced Baseline** ⭐ | **64.18%** | **47.33%** | **99.34%** |
| Transformer | 53.85% | 18.72% | 99.27% |
**Recommended:** Advanced Baseline for best overall performance
## πŸ”¬ Features
- **Two Model Approaches**: Choose between Simple and Advanced Baseline
- **Interactive UI**: Easy-to-use sliders and dropdowns
- **Top-5 Predictions**: View confidence scores for multiple candidates
- **Visual pH Scale**: Intuitive pH visualization
- **Downloadable Results**: Export predictions as CSV
- **Performance Charts**: Compare model accuracies
## πŸ› οΈ Technical Details
### Simple Baseline
- Random Forest for component classification
- XGBoost for pH regression
- 4 numerical features + TF-IDF of crystallization method
### Advanced Baseline (Recommended)
- Ensemble of Random Forest, XGBoost, LightGBM, and CatBoost
- 8 engineered features including interaction terms
- Separate models for name, concentration, and pH
- Log-transformed concentration predictions
### Models Included
- `simple_baseline/`: Simple baseline models
- `model_component_name.pkl`: Component classifier
- `model_component_ph.pkl`: pH regressor
- `label_encoder_name.pkl`: Label encoder
- `scaler.pkl`: Feature scaler
- `tfidf.pkl`: TF-IDF vectorizer
- `advanced_baseline/`: Advanced baseline models
- `model_component_name.pkl`: Enhanced component classifier
- `model_component_conc.pkl`: Concentration regressor
- `model_component_ph.pkl`: Enhanced pH regressor
- `label_encoder_name.pkl`: Label encoder
- `scaler.pkl`: Feature scaler
- `tfidf.pkl`: TF-IDF vectorizer
## πŸ“¦ Dependencies
- Python 3.9+
- Streamlit
- Scikit-learn
- XGBoost
- LightGBM
- CatBoost
- Pandas
- NumPy
- Joblib
## πŸŽ“ Use Cases
- **Structural Biology**: Plan crystallization experiments
- **Drug Discovery**: Optimize protein crystal conditions
- **Research**: Explore crystallization parameter space
- **Education**: Learn about protein crystallization
## πŸ“– Background
Protein crystallization is essential for determining 3D protein structures via X-ray crystallography. This tool uses machine learning trained on historical crystallization data from the Protein Data Bank (PDB) to predict optimal conditions.
### Input Parameters Explained
- **Crystallization Method**: Technique used (vapor diffusion, batch, etc.)
- **Temperature**: Affects protein stability and crystal growth (typically 277-298K)
- **pH**: Critical for protein solubility and crystal formation (0-14 scale)
- **Matthews Coefficient**: Unit cell volume to protein molecular weight ratio (Ε²/Da)
- **Solvent Content**: Percentage of solvent in crystal lattice (typically 30-70%)
## ⚠️ Important Notes
- **Validation Required**: Always validate predictions experimentally
- **Research Tool**: For research and educational purposes
- **Starting Point**: Use predictions as a guide, not absolute truth
- **Protein-Specific**: Results may vary based on your specific protein
## 🀝 Contributing
This is a research project. Feedback and suggestions are welcome!
## πŸ“„ License
MIT License - Free to use for research and educational purposes
## πŸ™ Acknowledgments
- Training data derived from Protein Data Bank (PDB)
- Built with Streamlit and ensemble ML models
- Inspired by advances in computational structural biology
## πŸ“ž Contact & Support
For questions or issues, please open an issue on the repository.
---
**Note**: This tool provides predictions based on historical data. Always conduct proper experimental validation. Crystallization is a complex process influenced by many factors not captured by these models alone.