Spaces:

p2ov
/

streamlit_app

Sleeping

App Files Files Community

p2ov commited on Jul 8, 2025

Commit

12161ea

1 Parent(s): 3f55c93

init project

Browse files

Files changed (18) hide show

Dockerfile +17 -0
README.md +82 -0
aws/s3_credentials.csv +2 -0
data/2024_nox.csv +0 -0
data/2024_o3.csv +0 -0
data/2024_pm10.csv +0 -0
data/2024_pm25.csv +0 -0
data/input_data.csv +7 -0
data/input_data.csv:Zone.Identifier +3 -0
data/sample_output_data.csv +0 -0
data/sample_output_data.csv:Zone.Identifier +3 -0
etl_process.py +100 -0
jenkins/Jenkinsfile +69 -0
jenkins/Jenkinsfile:Zone.Identifier +3 -0
requirements.txt +2 -0
tests/requirements.txt +4 -0
tests/test_etl.py +52 -0
tests/upload_s3.py +19 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,17 @@

+# Use an official Python runtime as a parent image
+FROM python:3.9-slim
+# Set the working directory inside the container
+WORKDIR /app
+# Copy the current directory contents into the container
+COPY . /app
+# Install any needed packages specified in requirements.txt
+RUN pip install pandas
+# Make a volume mount point for the input/output CSV files
+VOLUME ["/app/input_data.csv", "/app/output_data.csv"]
+# Run the application (by default, run the main ETL process)
+CMD ["python", "etl_process.py"]

README.md ADDED Viewed

	@@ -0,0 +1,82 @@

+# Guide de Configuration Jenkins et Pipeline CI/CD
+## 📌 Introduction
+Ce projet implémente un pipeline CI/CD dans Jenkins pour exécuter un processus ETL en conteneur Docker. Le pipeline inclut des tests unitaires et envoie les résultats sur AWS S3.
+## 📂 Structure du projet
+```bash
+├── data/
+│   ├── input_data.csv              # Fichier d'entrée pour l'ETL
+│   ├── sample_output_data.csv      # Résultat attendu du processus ETL (échantillon)
+│
+├── jenkins/
+│   └── Jenkinsfile                 # Pipeline Jenkins pour CI/CD
+│
+├── tests/
+│   ├── Dockerfile                  # Dockerfile pour lancer les tests
+│   ├── requirements.txt            # Dépendances spécifiques aux tests
+│   ├── test_etl.py                 # Tests unitaires du pipeline ETL
+│   └── upload_s3.py                # Script d'upload de test vers S3
+│
+├── Dockerfile                      # Conteneurisation du projet principal
+├── etl_process.py                  # Script principal du processus ETL
+├── requirements.txt                # Dépendances du projet
+└── README.md                       # Documentation du projet
+```
+---
+## 🚀 Étapes de configuration
+### 1️⃣ Configuration de Jenkins
+Assurez-vous que Jenkins est bien installé et fonctionne.
+### 2️⃣ Ajout des Variables d'Environnement
+Les variables AWS pour l'upload sur S3 doivent être ajoutées à Jenkins.
+1. **Aller dans Jenkins** → **Manage Jenkins** → **Manage Credentials**
+2. Sélectionnez **(global)** → **Add Credentials**
+3. Ajoutez les clés AWS en tant que **Secret Text**:
+   - **AWS_ACCESS_KEY_ID**
+   - **AWS_SECRET_ACCESS_KEY**
+Ou ajoutez-les directement dans Jenkins :
+1. **Aller dans Jenkins** → **Manage Jenkins** → **Configure System**
+2. Ajoutez sous **Global Properties** → **Environment Variables** :
+   - `AWS_ACCESS_KEY_ID = VOTRE_CLE`
+   - `AWS_SECRET_ACCESS_KEY = VOTRE_CLE_SECRET`
+### 3️⃣ Configuration du Job Jenkins
+1. **Créer un nouveau job Jenkins** → **Pipeline**
+2. **Sélectionner Pipeline Script from SCM**
+3. **Ajouter le repository GitHub** contenant le `Jenkinsfile`
+4. **Sauvegarder et exécuter**
+---
+## 🏗️ Fonctionnement du Pipeline
+### 🛠️ Étapes du Pipeline :
+1. **Clone du repository** : Récupère le code depuis GitHub.
+2. **Exécution des tests** :
+   - Lance les tests dans un conteneur Docker.
+   - Enregistre les résultats en XML.
+   - Envoie les résultats sur S3.
+3. **Build de l'ETL** :
+   - Construit l'image Docker pour l'ETL.
+4. **Exécution de l'ETL** :
+   - Monte les fichiers CSV et exécute le traitement.
+   - Sauvegarde le résultat dans `data/output_data.csv`.
+---
+Ce pipeline CI/CD garantit l'intégration et le déploiement automatisé du processus ETL en utilisant Jenkins et Docker.
+🔥 N'hésitez pas à adapter les configurations en fonction de votre environnement !

aws/s3_credentials.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ User name,Password,Console sign-in URL
2	+ aws_quality_air_project,eTl0l6\|k,https://020895663224.signin.aws.amazon.com/console

data/2024_nox.csv ADDED Viewed

File without changes

data/2024_o3.csv ADDED Viewed

File without changes

data/2024_pm10.csv ADDED Viewed

File without changes

data/2024_pm25.csv ADDED Viewed

File without changes

data/input_data.csv ADDED Viewed

	@@ -0,0 +1,7 @@

+id,name,age,city,salary
+1,John Doe,28,New York,70000
+2,Jane Smith,34,Los Angeles,80000
+3,Bob Johnson,45,Chicago,90000
+4,Alice Williams,29,San Francisco,85000
+5,Charlie Brown,NaN,Houston,65000
+6,Eve Davis,38,Boston,95000

data/input_data.csv:Zone.Identifier ADDED Viewed

	@@ -0,0 +1,3 @@

+[ZoneTransfer]
+ZoneId=3
+ReferrerUrl=C:\Users\poove\Downloads\paycare.zip

data/sample_output_data.csv ADDED Viewed

File without changes

data/sample_output_data.csv:Zone.Identifier ADDED Viewed

	@@ -0,0 +1,3 @@

+[ZoneTransfer]
+ZoneId=3
+ReferrerUrl=C:\Users\poove\Downloads\paycare.zip

etl_process.py ADDED Viewed

	@@ -0,0 +1,100 @@

+import pandas as pd
+import os
+# Step 1: Extract
+def extract_data(file_path):
+    """Extracts data from a CSV file."""
+    try:
+        data = pd.read_csv(file_path)
+        print("Data extraction successful.")
+        return data
+    except Exception as e:
+        print(f"Error in data extraction: {e}")
+        return None
+# Step 2: Transform
+def transform_data(data):
+    """Transforms the data by cleaning and adding new features."""
+    try:
+        # Drop rows with missing values
+        data_cleaned = data.dropna().copy()
+        # Add a new column for Tax (assuming a flat 10% tax rate on salary)
+        # data_cleaned["tax"] = data_cleaned["salary"] * 0.1
+        data_cleaned.loc[:, "tax"] = data_cleaned["salary"] * 0.1
+        # Calculate net salary after tax
+        # data_cleaned["net_salary"] = data_cleaned["salary"] - data_cleaned["tax"]
+        data_cleaned.loc[:, "net_salary"] = data_cleaned["salary"] - data_cleaned["tax"]
+        # data_cleaned["net_salary"] = model.predict(X)
+        print("Data transformation successful.")
+        return data_cleaned
+    except Exception as e:
+        print(f"Error in data transformation: {e}")
+        return None
+# # Step 3: Load
+# def load_data(data, output_file_path):
+#     """Loads the transformed data into a new CSV file."""
+#     try:
+#         data.to_csv(output_file_path, index=False)
+#         print(f"Data loaded successfully to {output_file_path}.")
+#     except Exception as e:
+#         print(f"Error in data loading: {e}")
+# # Main ETL function
+# def etl_process(input_file, output_file):
+#     data = extract_data(input_file)
+#     if data is not None:
+#         transformed_data = transform_data(data)
+#         if transformed_data is not None:
+#             load_data(transformed_data, output_file)
+# if __name__ == "__main__":
+#     input_file = "input_data.csv"
+#     output_file = "output_data.csv"
+#     etl_process(input_file, output_file)
+# Step 3: Load
+def load_data(data, output_file_path):
+    """Loads the transformed data into a new CSV file."""
+    try:
+        # Assurer que le dossier `data/` existe
+        output_dir = os.path.dirname(output_file_path)
+        if not os.path.exists(output_dir):
+            os.makedirs(output_dir)
+            print(f"📂 Created missing directory: {output_dir}")
+        # Sauvegarde du fichier
+        data.to_csv(output_file_path, index=False)
+        print(f"✅ Data loaded successfully to {output_file_path}.")
+    except Exception as e:
+        print(f"❌ Error in data loading: {e}")
+# Main ETL function
+def etl_process(input_file, output_file):
+    print("🚀 Starting ETL Process...")
+    data = extract_data(input_file)
+    if data is not None:
+        transformed_data = transform_data(data)
+        if transformed_data is not None:
+            load_data(transformed_data, output_file)
+    print("✅ ETL Process Completed!")
+if __name__ == "__main__":
+    input_file = "data/input_data.csv"  # Assurez-vous que le fichier est bien là
+    output_file = "data/output_data.csv"  # Sauvegarde bien dans `data/`
+    etl_process(input_file, output_file)

jenkins/Jenkinsfile ADDED Viewed

	@@ -0,0 +1,69 @@

+pipeline {
+    agent any
+    environment {
+        TEST_IMAGE = 'paycare-tests'
+        ETL_IMAGE = 'paycare-etl'
+        AWS_ACCESS_KEY_ID = credentials('AWS_ACCESS_KEY_ID')
+        AWS_SECRET_ACCESS_KEY = credentials('AWS_SECRET_ACCESS_KEY')
+        AWS_DEFAULT_REGION = 'eu-north-1'
+    }
+    stages {
+        stage('Clone Repository') {
+            steps {
+                git branch: 'main', url: 'https://github.com/semarmehdi/paycare.git'
+            }
+        }
+        stage('Build Test Container') {
+            steps {
+                sh 'docker build -t ${TEST_IMAGE} -f tests/Dockerfile .'
+            }
+        }
+        stage('Run Unit Tests') {
+            steps {
+                sh '''
+                    docker run --rm \
+                        --env AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID} \
+                        --env AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY} \
+                        --env AWS_DEFAULT_REGION=${AWS_DEFAULT_REGION} \
+                        paycare-tests
+            '''
+            }
+        }
+        stage('Build ETL Container') {
+            steps {
+                sh 'docker build -t ${ETL_IMAGE} .'
+            }
+        }
+        stage('Run ETL in Docker') {
+            steps {
+                script {
+                    sh '''
+                        docker run --rm \
+                        -v ${WORKSPACE}/data:/app/data \
+                        ${ETL_IMAGE}
+                    '''
+                    sh 'ls -l ${WORKSPACE}/data'
+                }
+            }
+        }
+    }
+    post {
+        success {
+            echo '✅ ETL Pipeline completed successfully!'
+            archiveArtifacts artifacts: 'data/output_data.csv', fingerprint: true
+        }
+        failure {
+            echo '❌ ETL Pipeline failed.'
+        }
+    }
+}

jenkins/Jenkinsfile:Zone.Identifier ADDED Viewed

	@@ -0,0 +1,3 @@

+[ZoneTransfer]
+ZoneId=3
+ReferrerUrl=C:\Users\poove\Downloads\paycare.zip

requirements.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ pandas
2	+ pytest

tests/requirements.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+pandas==2.0.3
+numpy==1.24.4
+pytest
+boto3

tests/test_etl.py ADDED Viewed

	@@ -0,0 +1,52 @@

+import pytest
+import pandas as pd
+from io import StringIO
+from etl_process import extract_data, transform_data, load_data
+# Test for data extraction
+def test_extract_data():
+    csv_data = StringIO(
+        "employee_id,employee_name,salary\n101,Alice,5000\n102,Bob,6000"
+    )
+    data = pd.read_csv(csv_data)
+    assert data is not None
+    assert len(data) == 2
+# Test for data transformation
+def test_transform_data():
+    data = pd.DataFrame(
+        {
+            "employee_id": [101, 102],
+            "employee_name": ["Alice", "Bob"],
+            "salary": [5000, 6000],
+        }
+    )
+    transformed_data = transform_data(data)
+    assert "tax" in transformed_data.columns
+    assert "net_salary" in transformed_data.columns
+    assert transformed_data["tax"][0] == 500  # 10% of 5000
+    assert transformed_data["net_salary"][0] == 4500  # 5000 - 500
+# Test for data loading
+def test_load_data(tmpdir):
+    data = pd.DataFrame(
+        {
+            "employee_id": [101],
+            "employee_name": ["Alice"],
+            "salary": [5000],
+            "tax": [500],
+            "net_salary": [4500],
+        }
+    )
+    output_file = tmpdir.join("output_data.csv")
+    load_data(data, str(output_file))
+    loaded_data = pd.read_csv(output_file)
+    assert len(loaded_data) == 1
+    assert loaded_data["employee_name"][0] == "Alice"
+    assert loaded_data["net_salary"][0] == 4500

tests/upload_s3.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import boto3
+# Initialisation du client S3
+s3_client = boto3.client("s3")
+# Définition des paramètres
+# file_name = "unit-tests.xml"  # Nom du fichier XML généré par pytest
+file_name = "results.xml"
+bucket_name = "jedhamehdi"
+s3_key = "test-results/unit-tests.xml"  # Chemin dans le bucket
+# Upload du fichier
+try:
+    s3_client.upload_file(file_name, bucket_name, s3_key)
+    print(
+        f"✅ Fichier '{file_name}' envoyé avec succès sur 's3://{bucket_name}/{s3_key}'"
+    )
+except Exception as e:
+    print(f"❌ Erreur lors de l'upload sur S3 : {e}")