---
title: Crystallization Component Predictor
emoji: 🔬
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: false
license: mit
---

# 🔬 Crystallization Component Predictor

An interactive machine learning application for predicting optimal protein crystallization components based on experimental parameters.

## 🎯 What Does This App Do?

This tool predicts three critical crystallization parameters:
1. **Component Name**: The chemical compound most likely to produce crystals
2. **Concentration**: The optimal molarity for the component
3. **pH**: The ideal acidity/basicity level for crystallization

## 🚀 Quick Start

1. Select a model (Advanced Baseline recommended)
2. Input your crystallization parameters:
   - Crystallization method
   - Temperature
   - pH
   - Matthews coefficient
   - Solvent content
3. Click "Predict Components"
4. Review predictions and download results

## 📊 Model Performance

| Model | Name Accuracy | Conc R² | pH R² |
|-------|--------------|---------|-------|
| Simple Baseline | 61.12% | N/A | 95.58% |
| **Advanced Baseline** ⭐ | **64.18%** | **47.33%** | **99.34%** |
| Transformer | 53.85% | 18.72% | 99.27% |

**Recommended:** Advanced Baseline for best overall performance

## 🔬 Features

- **Two Model Approaches**: Choose between Simple and Advanced Baseline
- **Interactive UI**: Easy-to-use sliders and dropdowns
- **Top-5 Predictions**: View confidence scores for multiple candidates
- **Visual pH Scale**: Intuitive pH visualization
- **Downloadable Results**: Export predictions as CSV
- **Performance Charts**: Compare model accuracies

## 🛠️ Technical Details

### Simple Baseline
- Random Forest for component classification
- XGBoost for pH regression
- 4 numerical features + TF-IDF of crystallization method

### Advanced Baseline (Recommended)
- Ensemble of Random Forest, XGBoost, LightGBM, and CatBoost
- 8 engineered features including interaction terms
- Separate models for name, concentration, and pH
- Log-transformed concentration predictions

### Models Included
- `simple_baseline/`: Simple baseline models
  - `model_component_name.pkl`: Component classifier
  - `model_component_ph.pkl`: pH regressor
  - `label_encoder_name.pkl`: Label encoder
  - `scaler.pkl`: Feature scaler
  - `tfidf.pkl`: TF-IDF vectorizer
  
- `advanced_baseline/`: Advanced baseline models
  - `model_component_name.pkl`: Enhanced component classifier
  - `model_component_conc.pkl`: Concentration regressor
  - `model_component_ph.pkl`: Enhanced pH regressor
  - `label_encoder_name.pkl`: Label encoder
  - `scaler.pkl`: Feature scaler
  - `tfidf.pkl`: TF-IDF vectorizer

## 📦 Dependencies

- Python 3.9+
- Streamlit
- Scikit-learn
- XGBoost
- LightGBM
- CatBoost
- Pandas
- NumPy
- Joblib

## 🎓 Use Cases

- **Structural Biology**: Plan crystallization experiments
- **Drug Discovery**: Optimize protein crystal conditions
- **Research**: Explore crystallization parameter space
- **Education**: Learn about protein crystallization

## 📖 Background

Protein crystallization is essential for determining 3D protein structures via X-ray crystallography. This tool uses machine learning trained on historical crystallization data from the Protein Data Bank (PDB) to predict optimal conditions.

### Input Parameters Explained

- **Crystallization Method**: Technique used (vapor diffusion, batch, etc.)
- **Temperature**: Affects protein stability and crystal growth (typically 277-298K)
- **pH**: Critical for protein solubility and crystal formation (0-14 scale)
- **Matthews Coefficient**: Unit cell volume to protein molecular weight ratio (Ų/Da)
- **Solvent Content**: Percentage of solvent in crystal lattice (typically 30-70%)

## ⚠️ Important Notes

- **Validation Required**: Always validate predictions experimentally
- **Research Tool**: For research and educational purposes
- **Starting Point**: Use predictions as a guide, not absolute truth
- **Protein-Specific**: Results may vary based on your specific protein

## 🤝 Contributing

This is a research project. Feedback and suggestions are welcome!

## 📄 License

MIT License - Free to use for research and educational purposes

## 🙏 Acknowledgments

- Training data derived from Protein Data Bank (PDB)
- Built with Streamlit and ensemble ML models
- Inspired by advances in computational structural biology

## 📞 Contact & Support

For questions or issues, please open an issue on the repository.

---

**Note**: This tool provides predictions based on historical data. Always conduct proper experimental validation. Crystallization is a complex process influenced by many factors not captured by these models alone.