{ "cells": [ { "cell_type": "markdown", "id": "d63d91d0", "metadata": {}, "source": [ "# Analyse de la dérive des données (Data Drift)\n", "\n", "Dans le cadre du déploiement du modèle de scoring crédit, nous mettons en place\n", "une analyse de dérive entre :\n", "\n", "- un **jeu de référence** (période stable / historique)\n", "- un **jeu courant** (période récente, proxy “production”)\n", "\n", "Ici, les jeux `ref_data.csv` et `prod_data.csv` sont issus des **logs de prédiction**\n", "(voir scripts de génération dans `monitoring/`).\n", "\n", "Objectif :\n", "détecter une dérive statistique susceptible d’impacter les performances du modèle." ] }, { "cell_type": "markdown", "id": "095ccab3", "metadata": {}, "source": [ "## Méthodologie\n", "\n", "Nous comparons la distribution de la variable de sortie du modèle\n", "(`probability_default`) entre :\n", "\n", "- un jeu de données de **référence** (début de l’historique)\n", "- un jeu de données **courant** (données plus récentes)\n", "\n", "Le test statistique utilisé est le **test de Kolmogorov–Smirnov (KS)** :\n", "\n", "- $H_0$ : les deux distributions sont identiques\n", "- $H_1$ : les distributions sont différentes\n", "\n", "Seuil de décision : **p-value < 0.05** → drift détecté.\n", "\n", "Remarque : on teste ici uniquement la **sortie du modèle** (drift des prédictions).\n", "En production, on complète généralement par une analyse de drift sur les features d’entrée." ] }, { "cell_type": "markdown", "id": "1d326430", "metadata": {}, "source": [ "## Imports" ] }, { "cell_type": "code", "execution_count": 6, "id": "7e01244e", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from scipy.stats import ks_2samp" ] }, { "cell_type": "markdown", "id": "af52d879", "metadata": {}, "source": [ "## Chargement des données" ] }, { "cell_type": "code", "execution_count": 7, "id": "d445c3ab", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "( probability_default\n", " 0 0.0405\n", " 1 0.0405\n", " 2 0.0229\n", " 3 0.0202,\n", " probability_default\n", " 0 0.2363\n", " 1 0.0422\n", " 2 0.2363\n", " 3 0.2504\n", " 4 0.2391)" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ref_path = \"./ref_data.csv\"\n", "prod_path = \"./prod_data.csv\"\n", "\n", "ref_df = pd.read_csv(ref_path)\n", "prod_df = pd.read_csv(prod_path)\n", "\n", "ref_df.head(), prod_df.head()\n" ] }, { "cell_type": "markdown", "id": "bd8ab615", "metadata": {}, "source": [ "## Sélection de la variable analysée" ] }, { "cell_type": "code", "execution_count": 8, "id": "19846395", "metadata": {}, "outputs": [], "source": [ "feature = \"probability_default\"\n", "\n", "ref_values = ref_df[feature]\n", "prod_values = prod_df[feature]\n" ] }, { "cell_type": "markdown", "id": "e737c5e1", "metadata": {}, "source": [ "## Test de dérive (KS)" ] }, { "cell_type": "code", "execution_count": 9, "id": "d6038d6d", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "📊 Résultats du test KS\n", "KS statistic : 1.0000\n", "P-value : 0.015873\n", "⚠️ Dérive détectée\n" ] } ], "source": [ "ks_stat, p_value = ks_2samp(ref_values, prod_values)\n", "\n", "print(\"📊 Résultats du test KS\")\n", "print(f\"KS statistic : {ks_stat:.4f}\")\n", "print(f\"P-value : {p_value:.6f}\")\n", "\n", "if p_value < 0.05:\n", " print(\"⚠️ Dérive détectée\")\n", "else:\n", " print(\"✅ Pas de dérive détectée\")\n" ] }, { "cell_type": "markdown", "id": "bdb19396", "metadata": {}, "source": [ "## Statistiques descriptives" ] }, { "cell_type": "code", "execution_count": 10, "id": "d8df5ed4", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | Référence | \n", "Production | \n", "
|---|---|---|
| count | \n", "4.000000 | \n", "5.000000 | \n", "
| mean | \n", "0.031025 | \n", "0.200860 | \n", "
| std | \n", "0.010996 | \n", "0.088884 | \n", "
| min | \n", "0.020200 | \n", "0.042200 | \n", "
| 25% | \n", "0.022225 | \n", "0.236300 | \n", "
| 50% | \n", "0.031700 | \n", "0.236300 | \n", "
| 75% | \n", "0.040500 | \n", "0.239100 | \n", "
| max | \n", "0.040500 | \n", "0.250400 | \n", "