Spaces:

Titembaye
/

phiBert

Sleeping

App Files Files Community

Titembaye commited on Nov 11, 2025

Commit

960046a

1 Parent(s): 6a06412

Initial deployment: BERT adversarial training model

Browse files

Files changed (10) hide show

README.md +37 -5
app.py +241 -0
models/config.json +26 -0
models/model.safetensors +3 -0
models/special_tokens_map.json +7 -0
models/tokenizer.json +0 -0
models/tokenizer_config.json +56 -0
models/training_args.bin +3 -0
models/vocab.txt +0 -0
requirements.txt +3 -0

README.md CHANGED Viewed

@@ -1,13 +1,45 @@
 ---
-title: PhiBert
-emoji: 🏢
 colorFrom: red
-colorTo: yellow
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
-license: mit
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Détecteur de Phishing par IA
+emoji: 🛡️
 colorFrom: red
+colorTo: blue
 sdk: gradio
 sdk_version: 5.49.1
 app_file: app.py
 pinned: false
 ---
+# 🛡️ Détecteur de Phishing par Intelligence Artificielle
+Application de détection de phishing utilisant **BERT fine-tuné avec adversarial training**.
+## 🎯 Objectif du Projet
+Cette application fait partie d'un projet de recherche sur :
+- 🎯 **Robustesse adversariale** : Résistance aux attaques de phishing générées par IA
+- 🌐 **Généralisation cross-linguale** : Capacité à détecter du phishing en français et en anglais
+## 📊 Données d'Entraînement
+Le modèle a été entraîné sur :
+- **Enron Email Dataset** (500k emails légitimes)
+- **SMS Spam Collection** (5,574 SMS)
+- **Phishing Email Dataset** (18,650 emails de phishing)
+- **Phishing adversariaux** générés par Ollama + Gemma3:1b (1,968 échantillons)
+## 🤖 Modèle
+- **Architecture :** BERT-base-uncased (110M paramètres)
+- **Fine-tuning :** Adversarial training (50% baseline + 50% adversarial)
+- **Performance :** F1-Score ~95% sur phishing adversarial
+## 🚀 Utilisation
+1. Collez un email dans la zone de texte
+2. Cliquez sur "🔍 Analyser"
+3. Obtenez le verdict et les probabilités
+## ⚠️ Disclaimer
+Cette application est fournie **à des fins éducatives et de recherche uniquement**.
+Ne l'utilisez pas comme unique système de protection contre le phishing.

app.py ADDED Viewed

	@@ -0,0 +1,241 @@

+"""
+Application Gradio - Détecteur de Phishing
+Modèle: BERT fine-tuné avec adversarial training
+"""
+import gradio as gr
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+import os
+# Configuration
+MODEL_PATH = "models/bert-base-uncased_adversarial_final"
+MAX_LENGTH = 256
+DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+print("="*60)
+print("🚀 Initialisation du Détecteur de Phishing")
+print("="*60)
+# Vérifier que le modèle existe
+if not os.path.exists(MODEL_PATH):
+    raise FileNotFoundError(
+        f"❌ Modèle introuvable: {MODEL_PATH}\n"
+        f"   Assurez-vous que le dossier existe et contient les fichiers du modèle."
+    )
+# Charger le tokenizer et le modèle
+print(f"📥 Chargement du tokenizer depuis {MODEL_PATH}...")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+print(f"📥 Chargement du modèle depuis {MODEL_PATH}...")
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_PATH)
+model.to(DEVICE)
+model.eval()
+print(f"✅ Modèle chargé avec succès!")
+print(f"🖥️  Device: {DEVICE}")
+print("="*60 + "\n")
+def predict_phishing(email_text):
+    """
+    Prédit si un email est du phishing ou légitime
+    Args:
+        email_text (str): Texte de l'email à analyser
+    Returns:
+        tuple: (verdict, probabilités, analyse détaillée)
+    """
+    if not email_text.strip():
+        return "⚠️ Veuillez entrer un email", {}, ""
+    # Tokenization
+    inputs = tokenizer(
+        email_text,
+        max_length=MAX_LENGTH,
+        padding='max_length',
+        truncation=True,
+        return_tensors='pt'
+    )
+    # Déplacer sur le bon device
+    inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
+    # Prédiction
+    with torch.no_grad():
+        outputs = model(**inputs)
+        logits = outputs.logits
+        probabilities = torch.softmax(logits, dim=1)[0]
+        predicted_class = torch.argmax(probabilities).item()
+        confidence = probabilities[predicted_class].item()
+    # Résultats
+    label = "🚨 Phishing Détecté" if predicted_class == 1 else "✅ Email Légitime"
+    prob_dict = {
+        "Légitime": float(probabilities[0]),
+        "Phishing": float(probabilities[1])
+    }
+    # Analyse détaillée
+    analysis = f"""
+### 📊 Résultats de l'analyse
+**Verdict:** {label}
+**Confiance:** {confidence * 100:.1f}%
+### 🔍 Détails des probabilités
+- **Légitime:** {probabilities[0] * 100:.2f}%
+- **Phishing:** {probabilities[1] * 100:.2f}%
+### 📝 Informations
+- **Modèle:** BERT-base-uncased (adversarial training)
+- **Longueur du texte:** {len(email_text)} caractères
+- **Tokens:** {len(tokenizer.encode(email_text))} tokens
+### ⚠️ Avertissement
+Cette analyse est fournie à titre éducatif uniquement. En cas de doute sur un email réel,
+contactez votre service informatique ou l'expéditeur présumé par un canal sécurisé.
+"""
+    return label, prob_dict, analysis
+# Exemples d'emails pour la démo
+examples = [
+    ["""Dear valued customer,
+Your account has been suspended due to unusual activity.
+Please verify your identity immediately by clicking the link below:
+http://secure-verify-account.com/login
+You have 24 hours to verify or your account will be permanently closed.
+Best regards,
+Security Team"""],
+    ["""Hi team,
+Just a reminder that our weekly meeting is scheduled for tomorrow at 2 PM in Conference Room B.
+Please bring your project updates.
+Thanks,
+John"""],
+    ["""URGENT: You have won $1,000,000 in the international lottery!
+To claim your prize, send us your bank details and a processing fee of $500.
+Contact us immediately: winner@lottery-prize.com
+Congratulations!"""],
+    ["""Hello,
+Your package delivery failed.
+Track your package here: https://trackpackage.com/xyz123
+Delivery company will retry tomorrow between 9 AM - 5 PM.
+Tracking ID: XYZ123456"""]
+]
+# Interface Gradio
+with gr.Blocks(theme=gr.themes.Soft(), title="Détecteur de Phishing") as demo:
+    gr.Markdown("""
+    # 🛡️ Détecteur de Phishing par Intelligence Artificielle
+    Cette application utilise un modèle **BERT fine-tuné avec adversarial training**
+    pour détecter les emails de phishing.
+    **Axes d'évaluation:**
+    - 🎯 Robustesse face aux attaques adversariales générées par IA
+    - 🌐 Généralisation cross-linguale (EN/FR)
+    ---
+    """)
+    with gr.Row():
+        with gr.Column(scale=2):
+            email_input = gr.Textbox(
+                label="📧 Collez votre email ici",
+                placeholder="Entrez le contenu de l'email à analyser...",
+                lines=10,
+                max_lines=20
+            )
+            with gr.Row():
+                analyze_btn = gr.Button("🔍 Analyser", variant="primary", size="lg")
+                clear_btn = gr.ClearButton([email_input], value="🗑️ Effacer")
+        with gr.Column(scale=1):
+            verdict_output = gr.Textbox(
+                label="🎯 Verdict",
+                interactive=False,
+                lines=2
+            )
+            prob_output = gr.Label(
+                label="📊 Probabilités",
+                num_top_classes=2
+            )
+    with gr.Row():
+        analysis_output = gr.Markdown(label="📈 Analyse Détaillée")
+    # Exemples
+    gr.Markdown("### 💡 Exemples à tester")
+    gr.Examples(
+        examples=examples,
+        inputs=email_input,
+        label="Cliquez sur un exemple pour le tester"
+    )
+    # Footer
+    gr.Markdown("""
+    ---
+    ### 📚 À propos
+    **Projet:** Détection de Phishing par IA - Robustesse Adversariale et Généralisation Cross-Linguale
+    **Datasets utilisés:**
+    - Enron Email Dataset (500k emails)
+    - SMS Spam Collection (5,574 SMS)
+    - Phishing Email Dataset (18,650 emails)
+    - Phishing adversariaux générés par Ollama + Gemma3:1b
+    **Modèle:**
+    - BERT-base-uncased (110M paramètres)
+    - Fine-tuné avec adversarial training (50% baseline + 50% adversarial)
+    ⚠️ **Disclaimer:** Cette application est fournie à des fins éducatives et de recherche uniquement.
+    """)
+    # Actions
+    analyze_btn.click(
+        fn=predict_phishing,
+        inputs=email_input,
+        outputs=[verdict_output, prob_output, analysis_output]
+    )
+if __name__ == "__main__":
+    print("\n" + "="*60)
+    print("🚀 Lancement de l'application Gradio")
+    print("="*60)
+    print(f"📱 Device: {DEVICE}")
+    print(f"🤖 Modèle: {MODEL_PATH}")
+    print("="*60 + "\n")
+    demo.launch(
+        server_name="127.0.0.1",  # Accessible localement uniquement
+        server_port=7860,
+        share=False,  # Mettre True pour obtenir un lien public temporaire
+        show_error=True
+    )

models/config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "dtype": "float32",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "problem_type": "single_label_classification",
+  "transformers_version": "4.57.0",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

models/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5339a24a01173741b1b9eef5bdd19022b0dfaa1a915a25c8144bc9324b7f710
+size 437958648

models/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

models/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

models/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

models/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f0fb555b3f5b186429fd7368c45115a9c45d03d7ddc5e663fdbb4dd4a46eaf9e
+size 6033

models/vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+transformers==4.57.1
+torch==2.5.1
+gradio==5.49.1