Spaces:

perachon
/

credit-scoring-api-v2

Sleeping

File size: 11,045 Bytes

fabeb46
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
0ad7b2b
 
 
 
 
fabeb46
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
0ad7b2b
 
 
 
fabeb46
0ad7b2b
fabeb46
0ad7b2b
 
fabeb46
0ad7b2b
fabeb46
0ad7b2b
 
fabeb46
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
0ad7b2b
 
 
 
 
fabeb46
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
0ad7b2b
 
 
fabeb46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
 
fabeb46
 
 
0ad7b2b
 
fabeb46
 
 
0ad7b2b
 
fabeb46
 
 
 
0ad7b2b
fabeb46
 
 
0ad7b2b
 
fabeb46
 
 
0ad7b2b
 
fabeb46
 
 
0ad7b2b
 
fabeb46
 
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
0ad7b2b
 
 
 
 
 
 
 
fabeb46
 
0ad7b2b
fabeb46
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0ad7b2b
 
 
 
 
fabeb46
0ad7b2b
 
 
 
fabeb46
0ad7b2b
 
fabeb46
 
 
 
 
 
 
 
 
0ad7b2b
 
 
 
fabeb46
 
0ad7b2b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fabeb46

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "d63d91d0",
   "metadata": {},
   "source": [
    "# Analyse de la dérive des données (Data Drift)\n",
    "\n",
    "Dans le cadre du déploiement du modèle de scoring crédit, nous mettons en place\n",
    "une analyse de dérive entre :\n",
    "\n",
    "- un **jeu de référence** (période stable / historique)\n",
    "- un **jeu courant** (période récente, proxy “production”)\n",
    "\n",
    "Ici, les jeux `ref_data.csv` et `prod_data.csv` sont issus des **logs de prédiction**\n",
    "(voir scripts de génération dans `monitoring/`).\n",
    "\n",
    "Objectif :\n",
    "détecter une dérive statistique susceptible d’impacter les performances du modèle."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "095ccab3",
   "metadata": {},
   "source": [
    "## Méthodologie\n",
    "\n",
    "Nous comparons la distribution de la variable de sortie du modèle\n",
    "(`probability_default`) entre :\n",
    "\n",
    "- un jeu de données de **référence** (début de l’historique)\n",
    "- un jeu de données **courant** (données plus récentes)\n",
    "\n",
    "Le test statistique utilisé est le **test de Kolmogorov–Smirnov (KS)** :\n",
    "\n",
    "- $H_0$ : les deux distributions sont identiques\n",
    "- $H_1$ : les distributions sont différentes\n",
    "\n",
    "Seuil de décision : **p-value < 0.05** → drift détecté.\n",
    "\n",
    "Remarque : on teste ici uniquement la **sortie du modèle** (drift des prédictions).\n",
    "En production, on complète généralement par une analyse de drift sur les features d’entrée."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1d326430",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "7e01244e",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from scipy.stats import ks_2samp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af52d879",
   "metadata": {},
   "source": [
    "## Chargement des données"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "d445c3ab",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(   probability_default\n",
       " 0               0.0405\n",
       " 1               0.0405\n",
       " 2               0.0229\n",
       " 3               0.0202,\n",
       "    probability_default\n",
       " 0               0.2363\n",
       " 1               0.0422\n",
       " 2               0.2363\n",
       " 3               0.2504\n",
       " 4               0.2391)"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "ref_path = \"./ref_data.csv\"\n",
    "prod_path = \"./prod_data.csv\"\n",
    "\n",
    "ref_df = pd.read_csv(ref_path)\n",
    "prod_df = pd.read_csv(prod_path)\n",
    "\n",
    "ref_df.head(), prod_df.head()\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bd8ab615",
   "metadata": {},
   "source": [
    "## Sélection de la variable analysée"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "19846395",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature = \"probability_default\"\n",
    "\n",
    "ref_values = ref_df[feature]\n",
    "prod_values = prod_df[feature]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e737c5e1",
   "metadata": {},
   "source": [
    "## Test de dérive (KS)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "d6038d6d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "📊 Résultats du test KS\n",
      "KS statistic : 1.0000\n",
      "P-value      : 0.015873\n",
      "⚠️ Dérive détectée\n"
     ]
    }
   ],
   "source": [
    "ks_stat, p_value = ks_2samp(ref_values, prod_values)\n",
    "\n",
    "print(\"📊 Résultats du test KS\")\n",
    "print(f\"KS statistic : {ks_stat:.4f}\")\n",
    "print(f\"P-value      : {p_value:.6f}\")\n",
    "\n",
    "if p_value < 0.05:\n",
    "    print(\"⚠️ Dérive détectée\")\n",
    "else:\n",
    "    print(\"✅ Pas de dérive détectée\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bdb19396",
   "metadata": {},
   "source": [
    "## Statistiques descriptives"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d8df5ed4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Référence</th>\n",
       "      <th>Production</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>4.000000</td>\n",
       "      <td>5.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>0.031025</td>\n",
       "      <td>0.200860</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>0.010996</td>\n",
       "      <td>0.088884</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>0.020200</td>\n",
       "      <td>0.042200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>0.022225</td>\n",
       "      <td>0.236300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>0.031700</td>\n",
       "      <td>0.236300</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.040500</td>\n",
       "      <td>0.239100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>0.040500</td>\n",
       "      <td>0.250400</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "       Référence  Production\n",
       "count   4.000000    5.000000\n",
       "mean    0.031025    0.200860\n",
       "std     0.010996    0.088884\n",
       "min     0.020200    0.042200\n",
       "25%     0.022225    0.236300\n",
       "50%     0.031700    0.236300\n",
       "75%     0.040500    0.239100\n",
       "max     0.040500    0.250400"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "summary = pd.DataFrame({\n",
    "    \"Référence\": ref_values.describe(),\n",
    "    \"Production\": prod_values.describe()\n",
    "})\n",
    "\n",
    "summary\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5937fcb",
   "metadata": {},
   "source": [
    "## Interprétation des résultats\n",
    "\n",
    "Le test KS indique une **dérive statistiquement significative** entre la référence et le courant :\n",
    "la p-value est **< 0.05**, donc on rejette $H_0$ et on conclut que les distributions sont différentes.\n",
    "\n",
    "Les statistiques descriptives confirment une différence de niveau : la moyenne en production est nettement\n",
    "plus élevée que la moyenne de référence sur cet échantillon.\n",
    "\n",
    "Conclusion (prototype) :\n",
    "- Drift détecté → **à investiguer** (qualité des données, changement de population, changement de process).\n",
    "- Actions typiques : analyser le drift des features, monitorer dans le temps, et envisager réentraînement\n",
    "  si la dérive persiste et se traduit par une baisse de performance.\n",
    "\n",
    "Note importante : les tailles d’échantillon sont faibles ici (quelques lignes), donc l’interprétation doit\n",
    "rester prudente — l’objectif est surtout de démontrer la chaîne de monitoring de bout en bout."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9ea29e10",
   "metadata": {},
   "source": [
    "## Limites et perspectives\n",
    "\n",
    "Limites :\n",
    "- Les données “production” sont un proxy basé sur les logs, pas un flux temps réel.\n",
    "- L’analyse porte uniquement sur la **sortie du modèle** (pas les features).\n",
    "- L’échantillon est petit : les tests peuvent être instables et doivent être confirmés sur plus de données.\n",
    "\n",
    "Perspectives :\n",
    "- Suivi temporel (fenêtres glissantes, comparaison semaine N vs N-1).\n",
    "- Seuils d’alerte automatiques + génération de rapports (ex: Evidently).\n",
    "- Analyse multi-features (drift des entrées + qualité des données).\n",
    "- Réentraînement conditionnel si drift + dégradation de performance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "467869c1",
   "metadata": {},
   "source": [
    "## Rapport Evidently (HTML)\n",
    "\n",
    "Un rapport Evidently peut être généré via `monitoring/generate_evidently_report.py`\n",
    "et sauvegardé dans `monitoring/reports/drift_report.html`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "8daaad77",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "\n",
       "        <iframe\n",
       "            width=\"100%\"\n",
       "            height=\"800\"\n",
       "            src=\"reports\\drift_report.html\"\n",
       "            frameborder=\"0\"\n",
       "            allowfullscreen\n",
       "            \n",
       "        ></iframe>\n",
       "        "
      ],
      "text/plain": [
       "<IPython.lib.display.IFrame at 0x281cc01af90>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "from pathlib import Path\n",
    "from IPython.display import IFrame, display\n",
    "\n",
    "report_path = Path(\"./reports/drift_report.html\")\n",
    "if report_path.exists():\n",
    "    display(IFrame(src=str(report_path), width=\"100%\", height=800))\n",
    "else:\n",
    "    print(\"Rapport introuvable. Génère-le avec: python monitoring/generate_evidently_report.py\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": ".venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}