Spaces:

ASI-Engineer
/

oc_p5

Running

App Files Files Community

ASI-Engineer commited on 22 days ago

Commit

1d67466

verified ·

1 Parent(s): 814e2ba

Upload folder using huggingface_hub

Browse files

Files changed (6) hide show

.flake8 +8 -3
.gitignore +7 -0
README.md +38 -48
app.py +174 -5
main.py +176 -0
requirements.txt +16 -38

.flake8 CHANGED Viewed

@@ -1,5 +1,6 @@
 [flake8]
 # Exclude dirs pour ignorer libs tierces et noise (venv, git, etc.)
 exclude =
     .venv,
     .git,
@@ -9,8 +10,12 @@ exclude =
     .cache,
     .eggs,
     build,
-    dist
 # Max line pour compat Black (default 88 vs PEP8 79)
 max-line-length = 88
-# Ignore E501 si trop strict (optionnel, retire si tu veux fixer lines)
-ignore = E501

 [flake8]
 # Exclude dirs pour ignorer libs tierces et noise (venv, git, etc.)
+ignore = W503, E501
 exclude =
     .venv,
     .git,
     .cache,
     .eggs,
     build,
+    dist,
+    mlruns
 # Max line pour compat Black (default 88 vs PEP8 79)
 max-line-length = 88
+# Ignorer certains warnings pour les scripts d'exemple (non-critique)
+per-file-ignores =
+    examples/*.py:F541,E722,F841
+    tests/test_mlflow_*.py:F401,E402,F811,F541

.gitignore CHANGED Viewed

@@ -35,3 +35,10 @@ Thumbs.db
 secrets.json
 data/raw/  # Pour datasets volumineux en data science (OC_P5)
 notebooks/*.ipynb_checkpoints/

 secrets.json
 data/raw/  # Pour datasets volumineux en data science (OC_P5)
 notebooks/*.ipynb_checkpoints/
+# MLflow (logs seulement, garder DB et runs pour déploiement HF)
+mlflow.db-shm
+mlflow.db-wal
+mlflow_ui.log
+mlflow_comparison.png
+nohup.out

README.md CHANGED Viewed

@@ -1,67 +1,57 @@
 ---
 title: OC P5 - API ML Déployée
-emoji: 🚀
 colorFrom: blue
-colorTo: purple
-sdk: static
 app_file: app.py
 pinned: false
 ---
-# ML Deployment Project
-Déploiement d'un modèle ML pour Futurisys : API FastAPI, PostgreSQL, tests Pytest, CI/CD.
-## Aperçu
-POC pour exposer un modèle ML via API performante, avec traçabilité DB et bonnes pratiques DevOps.
-## Installation
-1. Clone le repo : `git clone https://github.com/ton-username/ml-deployment-project.git`
-2. Installe Poetry (si pas fait) : `curl -sSL https://install.python-poetry.org | python3 -`
-3. Dépendances : `poetry install` (crée/lock .venv avec deps)
-4. Active env : `poetry shell`
-## Utilisation
-- Dev : `poetry run uvicorn src.main:app --reload` (Étape 3 pour API).
-- BDD : `poetry run python scripts/create_db.py` (Étape 4).
-- Tests : `poetry run pytest` (Étape 5).
-## Structure du Projet
-- `src/` : Code core (API, modèle ML).
-- `tests/` : Tests unitaires/fonctionnels (Pytest).
-- `docs/` : Schémas UML, docs API.
-- `scripts/` : Utils init (BDD, data load).
-- `data/` : Datasets (ignorés pour privacy).
-## CI/CD Optimization
-- Pipelines configurés pour exécution <10 min (ex. : lint ~1 min, tests ~3 min, deploy ~2 min). Si >10 min, optimiser via cache Poetry ou jobs parallèles. Temps observés basés sur runs GitHub Actions.
-## CI/CD Détails
-- Pipeline : GitHub Actions pour lint (Flake8/Black), tests (Pytest), deploy HF.
-- Environnements : Dev (branch dev/local tests), Prod (branch main/HF oc_p5).
-- Secrets : HF_TOKEN sécurisé via GitHub Secrets.
-- Standards : Voir [docs/standards.md](./docs/standards.md).
-## Environnements CI/CD
-- Dev : Branch "dev" -> HF space oc_p5-dev pour tests itératifs et validation.
-- Prod : Branch "main" -> HF space oc_p5 pour déploiement stable.
-- Secrets : HF_TOKEN partagé (sécurisé via GitHub Secrets) pour dev/prod.
-## Branches & Conventions
-- `main` : Stable (merges via PR).
-- `main` : pour développement et tests
-- `feature/etapeX` : Fonctionnalités (kebab-case, ex. `feature/etape3-api`).
-- Commits : Conventional (ex. `feat: Add endpoint`).
-## Déploiement & Sécurité
-- Auth/Sec : À venir (JWT pour API, secrets en .env ignoré).
-- Versions : Tags semver (ex. v1.0.0 pour Étape 1).
-## HF Spaces
-- Prod : https://huggingface.co/spaces/ASI-Engineer/oc_p5 (branch dev, pour tests itératifs).
-- Sync auto via GitHub Actions (push déclenche rebuild ~2min, avec HF_TOKEN sécurisé).
-## Documentation
-- Standards Code/ML
-## Licence
-MIT (ou adapte pour Futurisys).

 ---
 title: OC P5 - API ML Déployée
+emoji: 🎯
 colorFrom: blue
+colorTo: green
+sdk: gradio
+sdk_version: 5.9.1
 app_file: app.py
 pinned: false
+license: mit
 ---
+# 🎯 Employee Turnover Prediction - DEV Environment
+Interface Gradio pour tester le modèle de prédiction de départ des employés (turnover).
+## 🚀 Modèle ML
+- **Algorithme**: XGBoost optimisé avec RandomizedSearchCV
+- **Équilibrage**: SMOTE pour gérer le déséquilibre de classes (ratio 5:1)
+- **Tracking**: MLflow pour versioning et reproductibilité
+- **Métriques**: F1-Score optimisé (0.51), Accuracy 79%
+- **Stockage**: [Hugging Face Hub](https://huggingface.co/ASI-Engineer/employee-turnover-model)
+## 📊 Fonctionnalités
+- **Status Checker**: Vérifier l'état du modèle et les métriques
+- **API Simple**: Interface Gradio pour tests rapides
+- **Chargement automatique**: Modèle téléchargé depuis HF Hub au démarrage
+## 🔧 Architecture
+```python
+# Chargement du modèle depuis HF Hub
+model_path = hf_hub_download(
+    repo_id="ASI-Engineer/employee-turnover-model",
+    filename="model/model.pkl"
+)
+model = mlflow.sklearn.load_model(str(Path(model_path).parent))
+```
+## 📈 Métriques
+- **F1-Score**: 0.5136
+- **Accuracy**: 79%
+- **Données**: 1470 échantillons, 50 features
+- **Classes**: {0: 1233, 1: 237} - Ratio 5.20:1
+## 🔗 Liens
+- **Modèle**: [employee-turnover-model](https://huggingface.co/ASI-Engineer/employee-turnover-model)
+- **GitHub**: [OC_P5](https://github.com/chaton59/OC_P5)
+- **CI/CD**: GitHub Actions avec déploiement automatique
+Ce Space est synchronisé automatiquement via CI/CD depuis la branche `dev` du repository GitHub.
+**Repository**: [chaton59/OC_P5](https://github.com/chaton59/OC_P5)

app.py CHANGED Viewed

@@ -1,8 +1,177 @@
-from fastapi import FastAPI
-app = FastAPI()
-@app.get("/")
-def root():
-    return {"status": "Hello World"}

+#!/usr/bin/env python3
+"""
+Interface Gradio pour tester le modèle Employee Turnover en production.
+Déploiement sur Hugging Face Spaces pour tests rapides.
+Version de démonstration - Interface complète en développement.
+"""
+import gradio as gr
+import mlflow
+import mlflow.pyfunc
+from huggingface_hub import hf_hub_download
+# Configuration
+HF_MODEL_REPO = "ASI-Engineer/employee-turnover-model"
+FALLBACK_RUN_ID = "40e43c8e425345bab3d19f27eb8fe5d8"
+def load_model():
+    """
+    Charge le modèle depuis Hugging Face Hub (prod) ou MLflow local (dev).
+    Ordre de priorité:
+    1. HF Hub avec pickle direct (modèle déployé en production)
+    2. MLflow local (développement local)
+    """
+    # Essayer HF Hub en premier (production) - charger directement le pickle
+    try:
+        import joblib
+        # Download model pickle from HF Hub
+        model_path = hf_hub_download(
+            repo_id=HF_MODEL_REPO, filename="model/model.pkl", repo_type="model"
+        )
+        model = joblib.load(model_path)
+        print(f"✅ Modèle chargé depuis HF Hub: {HF_MODEL_REPO}")
+        return model, "HF Hub"
+    except Exception as e:
+        print(f"⚠️ HF Hub non disponible: {e}")
+    # Fallback: MLflow local (développement uniquement)
+    try:
+        mlflow.set_tracking_uri("sqlite:///mlflow.db")
+        # Essayer Model Registry d'abord
+        model = mlflow.pyfunc.load_model("models:/XGBoost_Employee_Turnover/latest")  # type: ignore[attr-defined]
+        print("✅ Modèle chargé depuis MLflow Model Registry")
+        return model, "MLflow Registry"
+    except Exception:
+        try:
+            # Fallback sur run ID
+            model = mlflow.pyfunc.load_model(f"runs:/{FALLBACK_RUN_ID}/model")  # type: ignore[attr-defined]
+            print(f"✅ Modèle chargé depuis MLflow run: {FALLBACK_RUN_ID}")
+            return model, "MLflow Local"
+        except Exception as e2:
+            print(f"❌ Erreur chargement MLflow: {e2}")
+            return None, "Error"
+# Charger le modèle au démarrage
+try:
+    model, model_source = load_model()
+    MODEL_LOADED = model is not None
+except Exception as e:
+    print(f"❌ Erreur lors du chargement du modèle: {e}")
+    MODEL_LOADED = False
+    model = None
+    model_source = "Error"
+def get_model_info():
+    """Retourne les informations sur le modèle."""
+    if not MODEL_LOADED:
+        return {
+            "status": "❌ Modèle non disponible",
+            "error": "Le modèle n'a pas pu être chargé",
+            "solution": "Vérifiez que le modèle est bien enregistré sur HF Hub ou entraîné localement",
+        }
+    try:
+        info = {
+            "status": "✅ Modèle chargé avec succès",
+            "source": model_source,
+            "model_type": type(model).__name__,
+            "features": "~50 features (après preprocessing)",
+            "algorithme": "XGBoost + SMOTE",
+            "hf_hub_repo": HF_MODEL_REPO if model_source == "HF Hub" else "N/A",
+        }
+        # Si MLflow local, ajouter les métriques
+        if model_source == "MLflow Local":
+            mlflow.set_tracking_uri("sqlite:///mlflow.db")
+            client = mlflow.MlflowClient()
+            runs = client.search_runs(
+                experiment_ids=["1"], order_by=["start_time DESC"], max_results=1
+            )
+            if runs:
+                run = runs[0]
+                metrics = run.data.metrics
+                info.update(
+                    {
+                        "run_id": run.info.run_id[:8],
+                        "f1_score": f"{metrics.get('f1_score', 0):.4f}",
+                        "accuracy": f"{metrics.get('accuracy', 0):.4f}",
+                    }
+                )
+        info["info"] = "Interface de prédiction en développement - API FastAPI à venir"
+        return info
+    except Exception as e:
+        return {"status": "✅ Modèle chargé (info limitées)", "error": str(e)}
+# Interface Gradio
+with gr.Blocks(  # type: ignore[attr-defined]
+    title="Employee Turnover Prediction - DEV", theme=gr.themes.Soft()  # type: ignore[attr-defined]
+) as demo:
+    gr.Markdown("# 🎯 Prédiction du Turnover - Employee Attrition")  # type: ignore[attr-defined]
+    gr.Markdown("## Environment DEV - Test de déploiement CI/CD")  # type: ignore[attr-defined]
+    gr.Markdown(  # type: ignore[attr-defined]
+        """
+    ### 📊 Statut du projet
+    Ce Space est synchronisé automatiquement depuis GitHub (branche `dev`).
+    **Actuellement disponible :**
+    - ✅ Pipeline d'entraînement MLflow complet (`main.py`)
+    - ✅ Déploiement automatique CI/CD (GitHub Actions → HF Spaces)
+    - ✅ Tests unitaires et linting automatisés
+    **En développement :**
+    - 🚧 Interface de prédiction interactive
+    - 🚧 API FastAPI avec endpoints de prédiction
+    - 🚧 Intégration PostgreSQL pour tracking des prédictions
+    """
+    )
+    with gr.Row():  # type: ignore[attr-defined]
+        with gr.Column():  # type: ignore[attr-defined]
+            gr.Markdown("### 🔍 Informations sur le modèle")  # type: ignore[attr-defined]
+            check_btn = gr.Button("📊 Vérifier le statut du modèle", variant="primary")  # type: ignore[attr-defined]
+        with gr.Column():  # type: ignore[attr-defined]
+            model_output = gr.JSON(label="Statut")  # type: ignore[attr-defined]
+    check_btn.click(fn=get_model_info, inputs=[], outputs=model_output)
+    gr.Markdown("---")  # type: ignore[attr-defined]
+    gr.Markdown(  # type: ignore[attr-defined]
+        """
+    ### 🛠️ Prochaines étapes (selon etapes.txt)
+    1. **Étape 3** : Développement API FastAPI
+       - Endpoints de prédiction avec validation Pydantic
+       - Chargement dynamique des preprocessing artifacts (scaler, encoders)
+       - Documentation Swagger/OpenAPI automatique
+    2. **Étape 4** : Intégration PostgreSQL
+       - Stockage des inputs/outputs des prédictions
+       - Traçabilité complète des requêtes
+    3. **Étape 5** : Tests unitaires et fonctionnels
+       - Tests des endpoints API
+       - Tests de charge et performance
+       - Couverture de code avec pytest-cov
+    ### 📚 Documentation
+    - **Repository GitHub** : [chaton59/OC_P5](https://github.com/chaton59/OC_P5)
+    - **MLflow Tracking** : Disponible en local (`./scripts/start_mlflow.sh`)
+    - **Métriques** : F1-Score optimisé, gestion classes déséquilibrées (SMOTE)
+    """
+    )
+if __name__ == "__main__":
+    demo.launch(server_name="0.0.0.0", server_port=7860, share=False)

main.py ADDED Viewed

	@@ -0,0 +1,176 @@

+#!/usr/bin/env python3
+"""
+Pipeline principal d'entraînement du modèle Employee Turnover.
+Ce script enchaîne:
+1. Chargement et préprocessing des données
+2. Entraînement du modèle XGBoost avec RandomizedSearchCV et SMOTE
+3. Logging des résultats dans MLflow (params, metrics, artifacts, model)
+4. Sauvegarde des encoders et scaler pour utilisation future
+Usage:
+    python main.py
+Le modèle et les artifacts sont enregistrés dans MLflow pour:
+- Suivi des expérimentations
+- Reproductibilité
+    Déploiement via Model Registry
+"""
+from pathlib import Path
+import joblib
+import mlflow
+import mlflow.sklearn
+from ml_model.preprocess import preprocess_data
+from ml_model.train_model import train_model
+def main():
+    """Pipeline principal d'entraînement."""
+    print("=" * 80)
+    print("🚀 PIPELINE D'ENTRAÎNEMENT - Employee Turnover Prediction")
+    print("=" * 80)
+    print()
+    # Configuration MLflow
+    mlflow.set_tracking_uri("sqlite:///mlflow.db")
+    mlflow.set_experiment("Employee_Turnover_Training")
+    print("📊 Configuration MLflow:")
+    print(f"   Tracking URI: {mlflow.get_tracking_uri()}")
+    print("   Experiment: Employee_Turnover_Training")
+    print()
+    # Chemins des données
+    data_paths = {
+        "sondage_path": "data/extrait_sondage.csv",
+        "eval_path": "data/extrait_eval.csv",
+        "sirh_path": "data/extrait_sirh.csv",
+    }
+    # Vérifier que les fichiers existent
+    for name, path in data_paths.items():
+        if not Path(path).exists():
+            raise FileNotFoundError(f"❌ Fichier manquant: {path}")
+    print("✅ Fichiers de données trouvés")
+    print()
+    # ========================================================================
+    # ÉTAPE 1 : Préprocessing
+    # ========================================================================
+    print("1️⃣  PRÉPROCESSING")
+    print("-" * 80)
+    X, y, scaler, onehot_encoder, ordinal_encoder = preprocess_data(data_paths)
+    print(f"   Forme X: {X.shape}")
+    print(f"   Forme y: {y.shape}")
+    print(f"   Classes: {y.value_counts().to_dict()}")
+    print(f"   Ratio déséquilibre: {(y == 0).sum() / (y == 1).sum():.2f}:1")
+    print()
+    # ========================================================================
+    # ÉTAPE 2 : Entraînement avec MLflow tracking
+    # ========================================================================
+    print("2️⃣  ENTRAÎNEMENT")
+    print("-" * 80)
+    # Entraînement (déjà avec MLflow tracking dans train_model.py)
+    model, best_params, cv_f1 = train_model(X, y)
+    print("   ✅ Modèle entraîné")
+    print(f"   🏆 Meilleur F1 CV: {cv_f1:.4f}")
+    print()
+    # Récupérer le run actif pour sauvegarder les artifacts
+    active_run = mlflow.active_run()
+    if active_run is None:
+        # Si train_model a fermé le run, on en ouvre un nouveau
+        active_run = mlflow.start_run()
+        run_id = active_run.info.run_id
+        should_end_run = True
+    else:
+        run_id = active_run.info.run_id
+        should_end_run = False
+    # Log des infos dataset
+    mlflow.log_param("n_samples", len(X))
+    mlflow.log_param("n_features", X.shape[1])
+    mlflow.log_param("class_ratio", f"{(y == 0).sum()}:{(y == 1).sum()}")
+    # ========================================================================
+    # ÉTAPE 3 : Sauvegarde des artifacts (encoders, scaler)
+    # ========================================================================
+    print("3️⃣  SAUVEGARDE DES ARTIFACTS")
+    print("-" * 80)
+    # Créer dossier temporaire pour artifacts
+    artifacts_dir = Path("artifacts_temp")
+    artifacts_dir.mkdir(exist_ok=True)
+    # Sauvegarder scaler
+    scaler_path = artifacts_dir / "scaler.joblib"
+    joblib.dump(scaler, scaler_path)
+    mlflow.log_artifact(str(scaler_path), artifact_path="preprocessing")
+    print("   ✅ Scaler sauvegardé")
+    # Sauvegarder encoders (onehot et ordinal)
+    onehot_path = artifacts_dir / "onehot_encoder.joblib"
+    joblib.dump(onehot_encoder, onehot_path)
+    mlflow.log_artifact(str(onehot_path), artifact_path="preprocessing")
+    ordinal_path = artifacts_dir / "ordinal_encoder.joblib"
+    joblib.dump(ordinal_encoder, ordinal_path)
+    mlflow.log_artifact(str(ordinal_path), artifact_path="preprocessing")
+    print("   ✅ Encoders sauvegardés (OneHot + Ordinal)")
+    # Log git commit si disponible
+    try:
+        import subprocess
+        git_commit = (
+            subprocess.check_output(["git", "rev-parse", "HEAD"])
+            .strip()
+            .decode("utf-8")
+        )
+        mlflow.set_tag("git_commit", git_commit[:8])
+        print(f"   ✅ Git commit: {git_commit[:8]}")
+    except Exception:
+        pass
+    # Nettoyer artifacts temporaires
+    scaler_path.unlink()
+    onehot_path.unlink()
+    ordinal_path.unlink()
+    artifacts_dir.rmdir()
+    print()
+    # Fermer le run si on l'a ouvert
+    if should_end_run:
+        mlflow.end_run()
+    # ========================================================================
+    # RÉSUMÉ
+    # ========================================================================
+    print("=" * 80)
+    print("✅ ENTRAÎNEMENT TERMINÉ")
+    print("=" * 80)
+    print()
+    print(f"📊 Run ID: {run_id}")
+    print(f"🎯 F1 Score (CV): {cv_f1:.4f}")
+    print("📦 Artifacts sauvegardés dans MLflow")
+    print()
+    print("🌐 Pour visualiser les résultats:")
+    print("   ./scripts/start_mlflow.sh")
+    print("   ou: mlflow ui --backend-store-uri sqlite:///mlflow.db")
+    print()
+    print("📝 Pour charger le modèle:")
+    print(f"   model = mlflow.sklearn.load_model('runs:/{run_id}/model')")
+    print()
+if __name__ == "__main__":
+    main()

requirements.txt CHANGED Viewed

@@ -1,38 +1,16 @@
-annotated-doc==0.0.4 ; python_version >= "3.12"
-annotated-types==0.7.0 ; python_version >= "3.12"
-anyio==4.12.0 ; python_version >= "3.12"
-black==25.11.0 ; python_version >= "3.12"
-click==8.3.1 ; python_version >= "3.12"
-colorama==0.4.6 ; (platform_system == "Windows" or sys_platform == "win32") and python_version >= "3.12"
-coverage==7.12.0 ; python_version >= "3.12"
-fastapi==0.123.4 ; python_version >= "3.12"
-flake8==7.3.0 ; python_version >= "3.12"
-greenlet==3.2.4 ; python_version >= "3.12" and (platform_machine == "aarch64" or platform_machine == "ppc64le" or platform_machine == "x86_64" or platform_machine == "amd64" or platform_machine == "AMD64" or platform_machine == "win32" or platform_machine == "WIN32")
-h11==0.16.0 ; python_version >= "3.12"
-httptools==0.7.1 ; python_version >= "3.12"
-idna==3.11 ; python_version >= "3.12"
-iniconfig==2.3.0 ; python_version >= "3.12"
-mccabe==0.7.0 ; python_version >= "3.12"
-mypy-extensions==1.1.0 ; python_version >= "3.12"
-packaging==25.0 ; python_version >= "3.12"
-pathspec==0.12.1 ; python_version >= "3.12"
-platformdirs==4.5.0 ; python_version >= "3.12"
-pluggy==1.6.0 ; python_version >= "3.12"
-pycodestyle==2.14.0 ; python_version >= "3.12"
-pydantic-core==2.41.5 ; python_version >= "3.12"
-pydantic==2.12.5 ; python_version >= "3.12"
-pyflakes==3.4.0 ; python_version >= "3.12"
-pygments==2.19.2 ; python_version >= "3.12"
-pytest-cov==7.0.0 ; python_version >= "3.12"
-pytest==9.0.1 ; python_version >= "3.12"
-python-dotenv==1.2.1 ; python_version >= "3.12"
-pytokens==0.3.0 ; python_version >= "3.12"
-pyyaml==6.0.3 ; python_version >= "3.12"
-sqlalchemy==2.0.44 ; python_version >= "3.12"
-starlette==0.50.0 ; python_version >= "3.12"
-typing-extensions==4.15.0 ; python_version >= "3.12"
-typing-inspection==0.4.2 ; python_version >= "3.12"
-uvicorn==0.38.0 ; python_version >= "3.12"
-uvloop==0.22.1 ; sys_platform != "win32" and sys_platform != "cygwin" and platform_python_implementation != "PyPy" and python_version >= "3.12"
-watchfiles==1.1.1 ; python_version >= "3.12"
-websockets==15.0.1 ; python_version >= "3.12"

+# Core dependencies
+black==25.11.0
+flake8==7.3.0
+pytest==9.0.1
+pytest-cov==7.0.0
+# ML dependencies
+scikit-learn==1.6.1
+xgboost==2.1.4
+imbalanced-learn==0.13.0
+scipy==1.14.1
+numpy==2.0.2
+pandas==2.2.3
+joblib==1.4.2
+mlflow==3.8.0
+gradio==5.9.1