--- language: - en tags: - chemistry - fuel - engines - YSI --- # YSI Predictor — Yield Sooting Index Model ## 📌 Overview This repository contains a machine learning model for predicting the **Yield Sooting Index (YSI)** of single-component fuel molecules directly from their **SMILES** representation. **YSI is a soot formation metric** used in combustion science. - **Lower YSI → cleaner combustion** - Highly relevant for **diesel replacement fuels**, **bio-fuels**, and **oxygenated fuels**. This model supports: - molecular design and optimization, - genetic algorithms (e.g., CREM), - Pareto optimization (CN vs YSI), - rapid candidate screening. --- ## 🧠 How It Works The prediction pipeline uses: - **RDKit** — molecule parsing - **Mordred** — 2D/3D molecular descriptors - **FeatureSelector** — dimensionality reduction - **Tree-based regression model** trained on experimental YSI values **Prediction flow:** 1. Input SMILES → RDKit Molecule 2. Mordred descriptors generated 3. Feature selection applied 4. YSI predicted using trained regressor Two model artifacts are included: model.joblib # trained regressor selector.joblib # feature selector used during training --- ## 🧬 Training Data The model was trained using a curated dataset of **experimentally measured YSI values**, covering a diverse set of fuel molecule structures: Includes: - linear alkanes - branched alkanes - cyclic hydrocarbons - aromatics - oxygenated species (ethers, esters) YSI range in dataset: **≈ 3 → 80** --- ## 📊 Performance Performance was evaluated on both training and **held-out test** sets. ### ⭐ Training Performance | Metric | Score | |--------|--------| | RMSE | **6.9661** | | MAE | **4.0581** | | R² | **0.9309** | --- ### 🧭 Test Performance | Metric | Score | |--------|--------| | RMSE | **5.9667** | | MAE | **3.8324** | | R² | **0.9440** | | MAPE | **18.38%** | The **test R² = 0.9440** shows strong predictive accuracy. --- ### 📉 Generalization Check | Metric | Value | |--------------|--------| | Train RMSE | **6.9661** | | Test RMSE | **5.9667** | | Δ (Test − Train) | **−0.9994** | ➡️ The negative Δ indicates **no overfitting**, and even **better test performance** due to more stable distribution. --- ## 🚀 Usage Below is a minimal example showing how to use the model in Python. > The feature calculation must match the training pipeline. ```python import joblib from rdkit import Chem from shared_features import featurize_df, FeatureSelector # Load model & selector model = joblib.load("model.joblib") selector = joblib.load("selector.joblib") def predict_ysi(smiles: str): mol = Chem.MolFromSmiles(smiles) df = featurize_df([smiles]) X = selector.transform(df) y = model.predict(X) return float(y[0]) print(predict_ysi("CCCCCCC"))