Spaces:

Younup
/

TrainGreenmorrowMLModel

Running

App Files Files Community

dr-wann commited on Apr 20

Commit

d17a0b5

1 Parent(s): 397c74e

Uploading notebooks and data

Browse files

Files changed (5) hide show

1_explore_and_prepare_data.ipynb +250 -0
2_build_and_train_model.ipynb +414 -0
README.md +1 -1
login.html +2 -2
younup_greenmorrow_data.parquet +3 -0

1_explore_and_prepare_data.ipynb ADDED Viewed

	@@ -0,0 +1,250 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "804cef31-4c45-4456-bb2a-6a27680f7c93",
+   "metadata": {},
+   "source": [
+    "# Explore and prepare data for Machine Learning"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f83ebbe-33dc-4f11-a3a5-414d7f00dd09",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "This file aims to present the **preliminary process for building and training a machine learning model**, that is analysing and preparing the training data.\n",
+    "\n",
+    "This file is a ***notebook***: it is a format in which you can write code, rich text (Markdown), formulae, and add multimedia content. To learn more, visit [the Jupyter project website](https://jupyter.org/).\\\n",
+    "Each piece of content is placed in a *cell*, which can be executed on the fly without having to rerun the entire notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1f93155a-5ede-407f-9731-825b9752899e",
+   "metadata": {},
+   "source": [
+    "### Model objective\n",
+    "\n",
+    "Here, the objective of the trained model is to **predict a plant’s seasonal stage based on the date, location and weather conditions**. This stage is called a *phenological stage*, which may correspond to flowering, fruit ripening, or leaf fall. The phenological stage is generally numbered from 0 to 9, with number 6 corresponding, for example, to flowering:\n",
+    "\n",
+    "![BBCH scale](https://appgeodb.nancy.inrae.fr/biljou/images/referentiel_bbch_gb.png \"The BBCH scale. According to “Phenological stages of monocotyledonous and dicotyledonous crops”. U. Meier. Blackwell Wissenschafts-Verlag Berlin. 2001.\")\n",
+    "\n",
+    "*The BBCH scale. According to “Phenological stages of monocotyledonous and dicotyledonous crops”. U. Meier. Blackwell Wissenschafts-Verlag Berlin. 2001.*\n",
+    "\n",
+    "A stage can be seen as a category (0, 1, …, 8, 9): therefore, the model to be trained must be based on **a classification algorithm**, supervised by training data. A classification algorithm can be used to address many needs, such as diagnosing a patient (are they affected by a given disease based on their blood test results, yes/no?) or identifying an animal in a photo (cat or dog?)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3d8a559-f71c-429a-9fc9-d82748aa7911",
+   "metadata": {},
+   "source": [
+    "## Python package imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fad0ce31-8a2f-4e4e-a99e-32104e288f6b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "pd.set_option('display.max_columns', None)  # Display all columns of a dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f0fb927-4f09-4da9-8533-d09bdd9b8bda",
+   "metadata": {},
+   "source": [
+    "## Data loading and preparation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d6f7826-ac8c-4a6b-9b30-9031b67dd0fe",
+   "metadata": {},
+   "source": [
+    "The first step in building a machine learning model is to **collect data, analyse it to understand it, and prepare it** for model training. Here, we have [a dataset](https://doi.org/10.5281/zenodo.15593446) in Parquet format called `younup_greenmorrow_data.parquet`, which we load using the `pandas` package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "482b8bd2-e30c-4e2b-a7df-b9e041b6c7b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = pd.read_parquet('younup_greenmorrow_data.parquet')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00a0b6d7-6dab-4882-8966-c58c901fc365",
+   "metadata": {},
+   "source": [
+    "### Basic analysis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bf827940-a7ac-4808-aa48-cc55f6542a9f",
+   "metadata": {},
+   "source": [
+    "Some useful `pandas` functions to get started:\n",
+    "- `head()` to display the first rows of the dataset,\n",
+    "- `info()` to obtain information about each column, such as the number of non-null entries or its type (numeric, categorical, etc.),\n",
+    "- `describe()` to view simple descriptive statistics for each column (mean, minimum and maximum values, etc.)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0e971258-17dc-4a20-b67b-4b7771091327",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "154c8e55-d1f9-43dc-a7aa-d8879920fbaa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.info()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "65080519-94c8-4219-8606-8864d1ac1e5c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.describe()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d8738e32-6898-41bb-b53e-3d55cba7861f",
+   "metadata": {},
+   "source": [
+    "We can see that this dataset contains 27,308 entries (rows) and 62 variables (columns). Using the `head()` function, we understand that **each row corresponds to an observation of a plant’s stage at a given date and location**, for which weather conditions are provided across many columns. The `info()` function tells us that no column contains missing values (\"27308 non-null\"), while the `describe()` function informs us (in the `year` column) about the temporal range of the data, from 1958 to 2024."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8d4c1034-ff6d-4bcc-844f-285237511a83",
+   "metadata": {},
+   "source": [
+    "### In-depth analysis"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ff6fc931-2d32-46a0-8662-c0b8c18f4c99",
+   "metadata": {},
+   "source": [
+    "To go further in understanding the dataset, it is possible to explore it graphically or using more advanced statistical tools. One such tool is `ydata-profiling` ([official website](https://docs.profiling.ydata.ai/latest/)), which automatically generates a detailed and interactive analysis report, including a wide range of statistics and visualisations. The report can be viewed directly in the notebook or exported as HTML."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "f7f167e4-49e3-4006-9152-d4b8b944f8a3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from ydata_profiling import ProfileReport\n",
+    "\n",
+    "profile = ProfileReport(data, title='Profiling Report')\n",
+    "profile.to_notebook_iframe()  # Report in the notebook\n",
+    "# profile.to_file(\"test_report.html\")  # Export report as HTML"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8ff73e8c-2e31-45df-9500-501f171a8014",
+   "metadata": {},
+   "source": [
+    "In the case of our dataset, the report generated with `ydata-profiling` highlights warnings about strong correlations between certain columns (for example, `fog` and `mist`), as well as columns that mainly contain zeros, which is not an issue in this case."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7951facc-2835-4b7f-9540-365c33592257",
+   "metadata": {},
+   "source": [
+    "### Selecting columns to retain"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "14c6f9b6-dc29-4e6e-9fe4-91487d907c6d",
+   "metadata": {},
+   "source": [
+    "Since our dataset has already been cleaned as part of Greenmorrow’s R&D work, we only need to select the relevant columns for training the machine learning model. Without going into detail, the `year`, `plant_name`, `plant_genus`, and `plant_species` columns can be discarded initially, as they are more likely to be used as filters rather than as features for predicting a plant’s phenological stage (for example, to refine predictions by period or species). \n",
+    "\n",
+    "In addition, the `pheno_stage_sec` column, which represents the secondary phenological stage (two digits, from 00 to 99), can also be excluded here, as we aim to predict a primary phenological stage (single digit, from 0 to 9), represented by the `pheno_stage_prim` column."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d2eb2ab3-60da-47c2-9a1f-00cd83a46ad7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = data.drop(columns=['year', 'plant_name', 'plant_genus', 'plant_species', 'pheno_stage_sec'])"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fc4d48f6-d721-4e02-a24b-00be4bda804c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6fc3b9ce-7f9f-4a99-a72f-fa1409c32cc4",
+   "metadata": {},
+   "source": [
+    "With the dataset ready, we can move on to building and training the model!"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

2_build_and_train_model.ipynb ADDED Viewed

	@@ -0,0 +1,414 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "804cef31-4c45-4456-bb2a-6a27680f7c93",
+   "metadata": {},
+   "source": [
+    "# Build and train a Machine Learning model"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8f83ebbe-33dc-4f11-a3a5-414d7f00dd09",
+   "metadata": {},
+   "source": [
+    "## Introduction\n",
+    "This file aims to present **the usual process of building and training a machine learning model**.\n",
+    "\n",
+    "This file is a ***notebook***: it is a format in which you can write code, rich text (Markdown), formulas, and add multimedia content. To learn more, see [the website](https://jupyter.org/) of the Jupyter project.\\\n",
+    "Each piece of content is placed in a *cell* that can be executed on the fly, without having to rerun the entire notebook."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1f93155a-5ede-407f-9731-825b9752899e",
+   "metadata": {},
+   "source": [
+    "### Model objective\n",
+    "\n",
+    "Here, the objective of the trained model is to **predict a plant’s seasonal stage based on the date, location and weather conditions**. This stage is called a *phenological stage*, which may correspond to flowering, fruit ripening, or leaf fall. The phenological stage is generally numbered from 0 to 9, with number 6 corresponding, for example, to flowering:\n",
+    "\n",
+    "![BBCH scale](https://appgeodb.nancy.inrae.fr/biljou/images/referentiel_bbch_gb.png \"The BBCH scale. According to “Phenological stages of monocotyledonous and dicotyledonous crops”. U. Meier. Blackwell Wissenschafts-Verlag Berlin. 2001.\")\n",
+    "\n",
+    "*The BBCH scale. According to “Phenological stages of monocotyledonous and dicotyledonous crops”. U. Meier. Blackwell Wissenschafts-Verlag Berlin. 2001.*\n",
+    "\n",
+    "A stage can be seen as a category (0, 1, …, 8, 9): therefore, the model to be trained must be based on **a classification algorithm**, supervised by training data. A classification algorithm can be used to address many needs, such as diagnosing a patient (are they affected by a given disease based on their blood test results, yes/no?) or identifying an animal in a photo (cat or dog?)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3d8a559-f71c-429a-9fc9-d82748aa7911",
+   "metadata": {},
+   "source": [
+    "## Python package imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fad0ce31-8a2f-4e4e-a99e-32104e288f6b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import matplotlib.pyplot as plt\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "from sklearn.compose import ColumnTransformer\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay\n",
+    "from sklearn.model_selection import GridSearchCV, train_test_split\n",
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "\n",
+    "pd.set_option('display.max_columns', None)  # Display all columns of a dataframe"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f0fb927-4f09-4da9-8533-d09bdd9b8bda",
+   "metadata": {},
+   "source": [
+    "## Data loading and preparation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d6f7826-ac8c-4a6b-9b30-9031b67dd0fe",
+   "metadata": {},
+   "source": [
+    "The first step in building a machine learning model is to **collect data, analyse it to understand it, and prepare it** for training the model. Here, we have [a dataset](https://doi.org/10.5281/zenodo.15593446) in Parquet format called `younup_greenmorrow_data.parquet`, which we load using the `pandas` package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "482b8bd2-e30c-4e2b-a7df-b9e041b6c7b7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = pd.read_parquet('younup_greenmorrow_data.parquet')\n",
+    "data = data.drop(columns=['year', 'plant_name', 'plant_genus', 'plant_species', 'pheno_stage_sec'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42f43d9f-9b1b-498d-8523-dc57621afe10",
+   "metadata": {},
+   "source": [
+    "## Model building"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e8abe49e-de88-4398-bc64-407efc0a2eca",
+   "metadata": {},
+   "source": [
+    "In the case of a supervised classification problem, we want our model to be **trained on data for which we know both the observation conditions (date, location, weather) and the target variable (phenological stage)**. To reuse the medical analogy, the training data would consist, for example, of blood measurements (white blood cell count, cholesterol level, etc.) as observation conditions, along with a column indicating \"yes\" or \"no\" depending on whether the person has the corresponding disease."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d344fe7b-c01e-4f6f-b2dd-76e8a13b614b",
+   "metadata": {},
+   "source": [
+    "### Splitting the variables"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "838d943d-6ee8-4088-a5a2-76d3e0e01f0b",
+   "metadata": {},
+   "source": [
+    "Before training the model *stricto sensu*, we therefore need to separate the input variables (`X`) from the target variable (`y`)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7d8b4b53-b37c-4b50-bb0f-6a63801f934e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X = data.drop(columns=['pheno_stage_prim'])\n",
+    "y = data['pheno_stage_prim']"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1e5f550f-dfd9-42a7-a8bf-9ff2fc63acc6",
+   "metadata": {},
+   "source": [
+    "### Splitting the training and test data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "601d9d22-d780-4f79-a234-27be26384c21",
+   "metadata": {},
+   "source": [
+    "Next, we need to set aside part of the original data in order to evaluate the model we will have trained. There are several ways to do this, some quite sophisticated, but the simplest is to **randomly split the dataset into a training set and a test set**, keeping the majority of the data for training (typically 80%)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "07e75d34-a167-4656-9336-ad287a7721a6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "97a194f9-d87e-4d5c-9317-433973d84777",
+   "metadata": {},
+   "source": [
+    "A quick check shows that `X_train` indeed contains 80% of the original data (`X`), and `X_test` 20%:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4cd5ad66-ef6f-40df-b986-0a18239a8077",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "len(X), len(X_train), len(X_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d71456a-8016-413d-b468-e6353df4e429",
+   "metadata": {},
+   "source": [
+    "### Naive model training"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e19f5c9c-cf4d-4c94-a46c-2e7277151d63",
+   "metadata": {},
+   "source": [
+    "The data has been cleaned, selected and properly split: it is now time to train the model! However, several variable transformation steps may be required:\n",
+    "- scaling numerical variables to improve training performance;\n",
+    "- binary encoding (0, 1) of categorical variables that are not already encoded, as the model cannot be trained on string values;\n",
+    "- resampling the data, in cases where one output class (here, a phenological stage) is much more represented in the dataset than another.\n",
+    "\n",
+    "In this case, **we choose to apply only feature scaling to the numerical variables** using the `StandardScaler()` class from `scikit-learn`, which subtracts the mean of each variable and divides it by its standard deviation: this has the effect of bringing all numerical variables in the dataset into the interval $[-1, 1]$. As a good practice, we include this step in a `Pipeline()`, which allows all data transformation and model training steps to be handled through a single object.\n",
+    "\n",
+    "For this example, we choose to train a model based on the logistic regression algorithm (the `LogisticRegression()` class in `scikit-learn`), which is lightweight and easy to interpret, as shown on [this page](https://www.geeksforgeeks.org/machine-learning/understanding-logistic-regression/)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "38ed5cd5-76c1-4905-be97-42df13053ef9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Selection of numerical and categorical variables\n",
+    "num_cols = X_train.select_dtypes(include=['int64', 'float64']).columns\n",
+    "cat_cols = X_train.select_dtypes(include=['object']).columns\n",
+    "\n",
+    "# Definition of preprocessing steps\n",
+    "preprocessor = ColumnTransformer(\n",
+    "    transformers=[\n",
+    "        ('num', StandardScaler(), num_cols),  # Scaling of numerical variables\n",
+    "        ('cat', 'passthrough', cat_cols)  # No transformation of categorical variables (there are none in the dataset)\n",
+    ")\n",
+    "\n",
+    "# Pipeline definition\n",
+    "pipeline = Pipeline(\n",
+    "    steps=[\n",
+    "        ('preprocessor', preprocessor),  # Applying preprocessing steps\n",
+    "        ('classifier', LogisticRegression())  # Calling the model to be trained\n",
+    "    ]\n",
+    ")\n",
+    "\n",
+    "# Model training!\n",
+    "pipeline.fit(X_train, y_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d4f4bef6-7575-4ef3-b29d-dd40554f4a7a",
+   "metadata": {},
+   "source": [
+    "After training the model using the `fit()` function, the pipeline structure is displayed, making it easier to understand the sequence of steps we have discussed. However, an error may appear, informing us that the model failed to converge, meaning it was unable to properly fit the data. This highlights the limitations of a “naive” training approach, which does not include any tuning and simply relies on the default parameters of the algorithm (here, `LogisticRegression()`).\n",
+    "\n",
+    "Nevertheless, we can compute the accuracy (between 0 and 1), which measures the model’s ability to correctly predict a phenological stage on the test data (`X_test`, `y_test`):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1b0752ee-3098-41de-801a-5669924b65e9",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_pred = pipeline.predict(X_test)\n",
+    "accuracy_score(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2d2052df-4f1d-41b0-842c-d795348b281d",
+   "metadata": {},
+   "source": [
+    "We can see that the accuracy is only 66%: this indicates **a model that is able to correctly predict the phenological stage in only about two-thirds of cases**. Let’s see how the model can be improved."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "26d12dc1-0f1e-4cd2-8460-f274e834285e",
+   "metadata": {},
+   "source": [
+    "## Model optimisation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f36750a5-9722-4ece-9c73-10b39ff046c9",
+   "metadata": {},
+   "source": [
+    "To improve a model, it is generally necessary to adjust its parameters, known as *hyperparameters*. For `LogisticRegression()`, there are several of these, as shown in the `scikit-learn` documentation. Here, we choose to vary `C`, `penalty`, `solver`, and `max_iter`.\n",
+    "\n",
+    "To avoid having to test many hyperparameter combinations manually, `scikit-learn` provides automated methods, the most explicit of which is `GridSearchCV()`: this involves **defining a grid of values to test for the chosen hyperparameters**, and specifying the performance metric that `GridSearchCV()` should use to determine which hyperparameter values produce the best model. Naturally, the larger the grid of values to test, the longer the training will take. In addition, warning or error messages may appear when certain tested values are incompatible."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1702100f-35d9-46f5-b590-adbecbc1d25e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Definition of the hyperparameter grid\n",
+    "param_grid = {\n",
+    "    'classifier__C': [0.1, 1, 10],\n",
+    "    'classifier__penalty': ['l2'],\n",
+    "    'classifier__solver': ['lbfgs', 'saga'], \n",
+    "    'classifier__max_iter': [1000, 1500]\n",
+    "}\n",
+    "\n",
+    "# Definition of grid search\n",
+    "grid_search = GridSearchCV(\n",
+    "    pipeline,  # Previously defined pipeline\n",
+    "    param_grid,  # Hyperparameter grid to test\n",
+    "    n_jobs=-1,  # Number of jobs to run in parallel\n",
+    "    scoring='accuracy'  # Performance metric used to select the best model\n",
+    ")\n",
+    "\n",
+    "# Training the model with hyperparameter search (⚠️ this may take around ten minutes)\n",
+    "grid_search.fit(X_train, y_train)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b56b57b-b8ca-4d7f-82a9-5cf0e8bc599d",
+   "metadata": {},
+   "source": [
+    "Similarly to before, we can compute the accuracy (between 0 and 1) using the test data (`X_test`, `y_test`):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8d2c62ee-1da7-4f48-94bf-758e2417e51c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "best_model = grid_search.best_estimator_  # Best model obtained from GridSearchCV()\n",
+    "\n",
+    "y_pred = best_model.predict(X_test)\n",
+    "accuracy_score(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e8587bf-b3cd-4c62-afa0-5b2385c7d728",
+   "metadata": {},
+   "source": [
+    "In this specific case, hyperparameter optimisation does not appear to have improved the model: this may be due to the simplicity of the chosen algorithm relative to the training data, the limited hyperparameter grid defined for this example, or the imbalance in the dataset (phenological stage 6, corresponding to flowering, is over-represented).\n",
+    "\n",
+    "It is nevertheless possible to evaluate the model using other indicators, particularly graphical ones."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "62eeb09f-d0be-4af4-b3fa-9f58e267ce31",
+   "metadata": {},
+   "source": [
+    "## Model evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ecc7c246-a518-4516-ad1e-4e59270d79eb",
+   "metadata": {},
+   "source": [
+    "To gain a better understanding of the model’s performance, the `scikit-learn` **classification report** provides various metrics, both for each predictable class (the phenological stages) and for the overall test set (as averages)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2274eb5a-9e18-40fa-aaba-ac5e2abeb2e3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(classification_report(y_test, y_pred))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "24cb2130-9d00-42e6-aa61-92e31bf34d7e",
+   "metadata": {},
+   "source": [
+    "To go further, we can display a **confusion matrix**, which shows the true values of the target variable compared to the values predicted by the trained model, again using the test data. We can see here that stages 8 and 9 are generally well predicted, but that stage 1 is often confused with stage 6: this information may lead to a deeper analysis of the data in order to make potential corrections."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ec92ea7f-e3ad-4382-a71a-5f6f21df055d",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "_ = ConfusionMatrixDisplay.from_predictions(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ec84a379-0216-4294-8685-b5a3de7bee1f",
+   "metadata": {},
+   "source": [
+    "Other indicators can also be considered to analyse the model’s performance, such as ROC curves, or the importance of each variable in the prediction, among others."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.12.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: TrainGreenmorrowMLModel
-emoji: 💻🐳
 colorFrom: gray
 colorTo: green
 sdk: docker

 ---
 title: TrainGreenmorrowMLModel
+emoji: 🌱🌲
 colorFrom: gray
 colorTo: green
 sdk: docker

@@ -8,10 +8,10 @@
 <div id="jupyter-main-app" class="container">
-    <img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" alt="Hugging Face Logo">
     <h4>Welcome to JupyterLab</h4>
-    <h5>The default token is <span style="color:orange;">huggingface</span></h5>
     {% if login_available %}
     {# login_available means password-login is allowed. Show the form. #}

 <div id="jupyter-main-app" class="container">
+    <img src="https://cdn.prod.website-files.com/625aea37d2019e1dd7782064/62617385c686c2d8d5820bc9_Logo%20Dark.svg" alt="Younup Logo">
     <h4>Welcome to JupyterLab</h4>
+    <h5>The default token is <span style="color:rgb(251, 190, 1);">huggingface</span></h5>
     {% if login_available %}
     {# login_available means password-login is allowed. Show the form. #}

younup_greenmorrow_data.parquet ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bc36880cfc100f28ec3546b664b42e1863668489234d0b8a579363b7d13b34e0
+size 857655