SalZa2004
/

YSI_Predictor

Model card Files Files and versions

YSI_Predictor / README.md

SalZa2004's picture

Create README.md

b09c9f5 verified 2 months ago

|

history blame contribute delete

2.89 kB

	---
	language:
	- en
	tags:
	- chemistry
	- fuel
	- engines
	- YSI
	---

	# YSI Predictor — Yield Sooting Index Model

	## 📌 Overview

	This repository contains a machine learning model for predicting the Yield Sooting Index (YSI) of single-component fuel molecules directly from their SMILES representation.

	YSI is a soot formation metric used in combustion science.
	- Lower YSI → cleaner combustion
	- Highly relevant for diesel replacement fuels, bio-fuels, and oxygenated fuels.

	This model supports:
	- molecular design and optimization,
	- genetic algorithms (e.g., CREM),
	- Pareto optimization (CN vs YSI),
	- rapid candidate screening.

	---

	## 🧠 How It Works

	The prediction pipeline uses:
	- RDKit — molecule parsing
	- Mordred — 2D/3D molecular descriptors
	- FeatureSelector — dimensionality reduction
	- Tree-based regression model trained on experimental YSI values

	Prediction flow:
	1. Input SMILES → RDKit Molecule
	2. Mordred descriptors generated
	3. Feature selection applied
	4. YSI predicted using trained regressor

	Two model artifacts are included:

	model.joblib # trained regressor
	selector.joblib # feature selector used during training



	---

	## 🧬 Training Data

	The model was trained using a curated dataset of experimentally measured YSI values, covering a diverse set of fuel molecule structures:

	Includes:
	- linear alkanes
	- branched alkanes
	- cyclic hydrocarbons
	- aromatics
	- oxygenated species (ethers, esters)

	YSI range in dataset: ≈ 3 → 80

	---

	## 📊 Performance

	Performance was evaluated on both training and held-out test sets.

	### ⭐ Training Performance

	\| Metric \| Score \|
	\|--------\|--------\|
	\| RMSE \| 6.9661 \|
	\| MAE \| 4.0581 \|
	\| R² \| 0.9309 \|

	---

	### 🧭 Test Performance

	\| Metric \| Score \|
	\|--------\|--------\|
	\| RMSE \| 5.9667 \|
	\| MAE \| 3.8324 \|
	\| R² \| 0.9440 \|
	\| MAPE \| 18.38% \|

	The test R² = 0.9440 shows strong predictive accuracy.

	---

	### 📉 Generalization Check

	\| Metric \| Value \|
	\|--------------\|--------\|
	\| Train RMSE \| 6.9661 \|
	\| Test RMSE \| 5.9667 \|
	\| Δ (Test − Train) \| −0.9994 \|

	➡️ The negative Δ indicates no overfitting, and even better test performance due to more stable distribution.

	---

	## 🚀 Usage

	Below is a minimal example showing how to use the model in Python.

	> The feature calculation must match the training pipeline.

	```python
	import joblib
	from rdkit import Chem
	from shared_features import featurize_df, FeatureSelector

	# Load model & selector
	model = joblib.load("model.joblib")
	selector = joblib.load("selector.joblib")

	def predict_ysi(smiles: str):
	mol = Chem.MolFromSmiles(smiles)
	df = featurize_df([smiles])
	X = selector.transform(df)
	y = model.predict(X)
	return float(y[0])

	print(predict_ysi("CCCCCCC"))