BTP_2026 / README.md
suvradeepp's picture
Upload 34 files
49e8d95 verified

A newer version of the Streamlit SDK is available: 1.53.0

Upgrade
metadata
title: Crystallization Component Predictor
emoji: πŸ”¬
colorFrom: blue
colorTo: purple
sdk: streamlit
sdk_version: 1.29.0
app_file: app.py
pinned: false
license: mit

πŸ”¬ Crystallization Component Predictor

An interactive machine learning application for predicting optimal protein crystallization components based on experimental parameters.

🎯 What Does This App Do?

This tool predicts three critical crystallization parameters:

  1. Component Name: The chemical compound most likely to produce crystals
  2. Concentration: The optimal molarity for the component
  3. pH: The ideal acidity/basicity level for crystallization

πŸš€ Quick Start

  1. Select a model (Advanced Baseline recommended)
  2. Input your crystallization parameters:
    • Crystallization method
    • Temperature
    • pH
    • Matthews coefficient
    • Solvent content
  3. Click "Predict Components"
  4. Review predictions and download results

πŸ“Š Model Performance

Model Name Accuracy Conc RΒ² pH RΒ²
Simple Baseline 61.12% N/A 95.58%
Advanced Baseline ⭐ 64.18% 47.33% 99.34%
Transformer 53.85% 18.72% 99.27%

Recommended: Advanced Baseline for best overall performance

πŸ”¬ Features

  • Two Model Approaches: Choose between Simple and Advanced Baseline
  • Interactive UI: Easy-to-use sliders and dropdowns
  • Top-5 Predictions: View confidence scores for multiple candidates
  • Visual pH Scale: Intuitive pH visualization
  • Downloadable Results: Export predictions as CSV
  • Performance Charts: Compare model accuracies

πŸ› οΈ Technical Details

Simple Baseline

  • Random Forest for component classification
  • XGBoost for pH regression
  • 4 numerical features + TF-IDF of crystallization method

Advanced Baseline (Recommended)

  • Ensemble of Random Forest, XGBoost, LightGBM, and CatBoost
  • 8 engineered features including interaction terms
  • Separate models for name, concentration, and pH
  • Log-transformed concentration predictions

Models Included

  • simple_baseline/: Simple baseline models

    • model_component_name.pkl: Component classifier
    • model_component_ph.pkl: pH regressor
    • label_encoder_name.pkl: Label encoder
    • scaler.pkl: Feature scaler
    • tfidf.pkl: TF-IDF vectorizer
  • advanced_baseline/: Advanced baseline models

    • model_component_name.pkl: Enhanced component classifier
    • model_component_conc.pkl: Concentration regressor
    • model_component_ph.pkl: Enhanced pH regressor
    • label_encoder_name.pkl: Label encoder
    • scaler.pkl: Feature scaler
    • tfidf.pkl: TF-IDF vectorizer

πŸ“¦ Dependencies

  • Python 3.9+
  • Streamlit
  • Scikit-learn
  • XGBoost
  • LightGBM
  • CatBoost
  • Pandas
  • NumPy
  • Joblib

πŸŽ“ Use Cases

  • Structural Biology: Plan crystallization experiments
  • Drug Discovery: Optimize protein crystal conditions
  • Research: Explore crystallization parameter space
  • Education: Learn about protein crystallization

πŸ“– Background

Protein crystallization is essential for determining 3D protein structures via X-ray crystallography. This tool uses machine learning trained on historical crystallization data from the Protein Data Bank (PDB) to predict optimal conditions.

Input Parameters Explained

  • Crystallization Method: Technique used (vapor diffusion, batch, etc.)
  • Temperature: Affects protein stability and crystal growth (typically 277-298K)
  • pH: Critical for protein solubility and crystal formation (0-14 scale)
  • Matthews Coefficient: Unit cell volume to protein molecular weight ratio (Ε²/Da)
  • Solvent Content: Percentage of solvent in crystal lattice (typically 30-70%)

⚠️ Important Notes

  • Validation Required: Always validate predictions experimentally
  • Research Tool: For research and educational purposes
  • Starting Point: Use predictions as a guide, not absolute truth
  • Protein-Specific: Results may vary based on your specific protein

🀝 Contributing

This is a research project. Feedback and suggestions are welcome!

πŸ“„ License

MIT License - Free to use for research and educational purposes

πŸ™ Acknowledgments

  • Training data derived from Protein Data Bank (PDB)
  • Built with Streamlit and ensemble ML models
  • Inspired by advances in computational structural biology

πŸ“ž Contact & Support

For questions or issues, please open an issue on the repository.


Note: This tool provides predictions based on historical data. Always conduct proper experimental validation. Crystallization is a complex process influenced by many factors not captured by these models alone.