moraeslucas commited on
Commit
fdff15a
·
verified ·
1 Parent(s): 05524f6

First Commit with 25 files

Browse files
.gitignore ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Ambientes virtuais
2
+ venv/
3
+ env/
4
+
5
+ # Ficheiros temporários e compilados
6
+ __pycache__/
7
+ *.py[cod]
8
+ *.log
9
+
10
+ # Ficheiros do sistema
11
+ .DS_Store
12
+ Thumbs.db
13
+
14
+ # Ficheiros de IDEs
15
+ .vscode/
16
+ .idea/
17
+
18
+ # Ficheiros de testes ou exemplos temporários
19
+ *.tmp
20
+ *.bak
21
+
22
+ # Chaves ou configs privadas
23
+ *.env
.gradio/flagged/dataset1.csv ADDED
@@ -0,0 +1,55 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ email_text,output 0,Explicação,timestamp
2
+ "'Subject: Congratulations! You've Won a Gift Card 🎁
3
+
4
+ Hi there,
5
+
6
+ You’ve been selected as a lucky winner of a $100 Amazon gift card! To claim your prize, simply fill in your details at the link below:
7
+
8
+ http://amaz0n-prize-now.ru
9
+
10
+ Hurry — this offer expires in 24 hours!
11
+
12
+ Best of luck,
13
+ Rewards Center","{""label"": ""Phishing \ud83d\udea8"", ""confidences"": null}","🧠 Modelo BERT: Phishing (99.98%)
14
+ 🔎 Heurística: Score = 4.8
15
+ • [prize] Matched 'gift.?card' (global, weight=0.9)
16
+ • [prize] Matched 'congratulations' (en, weight=0.7)
17
+ • [prize] Matched 'winner' (en, weight=0.8)
18
+ • [prize] Matched 'prize' (en, weight=0.8)
19
+ • [urgency] Matched '24 hours' (global, weight=0.6)
20
+ • Contains suspicious link(s): http://amaz0n-prize-now.ru
21
+
22
+ 🗝️ Palavras-chave: Amazon gift card, Gift Card, Congratulations, Amazon gift, Subject",2025-07-21 15:33:49.523444
23
+ "'Subject: Congratulations! You've Won a Gift Card 🎁
24
+
25
+ Hi there,
26
+
27
+ You’ve been selected as a lucky winner of a $100 Amazon gift card! To claim your prize, simply fill in your details at the link below:
28
+
29
+ http://amaz0n-prize-now.ru
30
+
31
+ Hurry — this offer expires in 24 hours!
32
+
33
+ Best of luck,
34
+ Rewards Center","{""label"": ""Phishing \ud83d\udea8"", ""confidences"": null}","🧠 Modelo BERT: Phishing (99.98%)
35
+ 🔎 Heurística: Score = 4.8
36
+ • [prize] Matched 'gift.?card' (global, weight=0.9)
37
+ • [prize] Matched 'congratulations' (en, weight=0.7)
38
+ • [prize] Matched 'winner' (en, weight=0.8)
39
+ • [prize] Matched 'prize' (en, weight=0.8)
40
+ • [urgency] Matched '24 hours' (global, weight=0.6)
41
+ • Contains suspicious link(s): http://amaz0n-prize-now.ru
42
+
43
+ 🗝️ Palavras-chave: Amazon gift card, Gift Card, Congratulations, Amazon gift, Subject",2025-07-21 15:33:53.036101
44
+ "'Subject: Meeting Rescheduled
45
+
46
+ Hi John,
47
+
48
+ The client meeting has been rescheduled to Thursday at 3 PM. Please let me know if that works for you.
49
+
50
+ Best,
51
+ Maria","{""label"": ""Leg\u00edtimo \u2705"", ""confidences"": null}","🧠 Modelo BERT: Legitimate (99.99%)
52
+ 🔎 Heurística: Score = 0.0
53
+
54
+
55
+ 🗝️ Palavras-chave: Rescheduled Hi John, Meeting Rescheduled, rescheduled to Thursday, client meeting, John",2025-07-21 15:34:16.300979
README.md CHANGED
@@ -1,13 +1,120 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: PhisHunter
3
- emoji: 👁
4
- colorFrom: blue
5
- colorTo: gray
6
- sdk: gradio
7
- sdk_version: 5.38.2
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
1
+ \# PhishHunter
2
+
3
+
4
+
5
+ PhishHunter is an open-source NLP-based email classification tool that detects phishing attempts and explains why an email might be suspicious.
6
+
7
+
8
+
9
+ \## 🔧 Technologies Used
10
+
11
+
12
+
13
+ \- \[Hugging Face Transformers](https://huggingface.co/)
14
+
15
+ \- \[Gradio](https://gradio.app/)
16
+
17
+ \- \[NLTK](https://www.nltk.org/)
18
+
19
+ \- \[YAKE](https://github.com/LIAAD/yake)
20
+
21
+ \- \[LangDetect](https://pypi.org/project/langdetect/)
22
+
23
+ \- \[extract-msg](https://pypi.org/project/extract-msg/)
24
+
25
+ \- Python 3.8+
26
+
27
+
28
+
29
+ > \*\*Note:\*\* Although `spaCy` is listed in requirements, it is not actively used in the codebase.
30
+
31
+
32
+
33
+ \## 🚀 Getting Started
34
+
35
+
36
+
37
+ 1\. Clone the repository or download the project folder:
38
+
39
+
40
+
41
+   ```bash
42
+
43
+   git clone https://github.com/SEU\_UTILIZADOR/phishhunter.git
44
+
45
+   cd phishhunter
46
+
47
+   ```
48
+
49
+
50
+
51
+ 2\. Create and activate a virtual environment:
52
+
53
+
54
+
55
+   ```bash
56
+
57
+   python -m venv venv
58
+
59
+   venv\\Scripts\\activate # Windows
60
+
61
+   # ou
62
+
63
+   source venv/bin/activate # Linux/macOS
64
+
65
+   ```
66
+
67
+
68
+
69
+ 3\. Install dependencies:
70
+
71
+
72
+
73
+   ```bash
74
+
75
+   pip install -r requirements.txt
76
+
77
+   ```
78
+
79
+
80
+
81
+ 4\. Run the app:
82
+
83
+
84
+
85
+   ```bash
86
+
87
+   python app\_improved.py
88
+
89
+   ```
90
+
91
+
92
+
93
+ \## 📦 Features
94
+
95
+
96
+
97
+ \- Classify email text using a fine-tuned BERT model
98
+
99
+ \- Heuristic-based rules per language (via `rules.yaml`)
100
+
101
+ \- Language detection for multilingual support
102
+
103
+ \- Keyword extraction
104
+
105
+ \- URL verification via VirusTotal API
106
+
107
+ \- Gradio-based interface for easy use
108
+
109
+
110
+
111
  ---
 
 
 
 
 
 
 
 
 
 
112
 
113
+
114
+
115
+ \## 📜 License
116
+
117
+
118
+
119
+ This project is licensed under the MIT License.
120
+
app.py ADDED
@@ -0,0 +1,111 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import torch
3
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
4
+ import gradio as gr
5
+ from utils.heuristics import load_rules, explain_email, extract_keywords
6
+ from langdetect import detect
7
+ from pathlib import Path
8
+ import re
9
+ from utils.virustotal import check_url_virustotal
10
+ import extract_msg
11
+
12
+ # Modelo
13
+ model_name = "ElSlay/BERT-Phishing-Email-Model"
14
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
15
+ model = AutoModelForSequenceClassification.from_pretrained(model_name)
16
+
17
+ # Heurísticas
18
+ rules = load_rules("rules.yaml")
19
+
20
+ # Classificação com BERT
21
+ def classify_email(email_text):
22
+ inputs = tokenizer(email_text, return_tensors="pt", truncation=True, padding=True)
23
+ outputs = model(**inputs)
24
+ probs = torch.softmax(outputs.logits, dim=1).detach().numpy()[0]
25
+ labels = ["Legitimate", "Phishing"]
26
+ prediction = labels[probs.argmax()]
27
+ confidence = probs.max()
28
+ return prediction, confidence
29
+
30
+ # Análise completa do email
31
+ def analyze_email(file_input=None, text_input=None):
32
+ email_text = None
33
+
34
+ if text_input:
35
+ email_text = text_input
36
+ elif file_input:
37
+ path = file_input.name if hasattr(file_input, "name") else file_input
38
+ if path.endswith(".txt"):
39
+ with open(path, "r", encoding="utf-8") as f:
40
+ email_text = f.read()
41
+ elif path.endswith(".msg"):
42
+ msg = extract_msg.Message(path)
43
+ email_text = f"{msg.subject or ''}\n{msg.body or ''}"
44
+ else:
45
+ return "Unsupported file type."
46
+ else:
47
+ return "No input provided."
48
+
49
+ if not email_text:
50
+ return "Could not extract text from file."
51
+
52
+ # Classificação
53
+ label, confidence = classify_email(email_text)
54
+
55
+ # Heurística
56
+ explanations, score = explain_email(email_text, rules)
57
+
58
+ # Keywords
59
+ lang = detect(email_text)
60
+ keywords = extract_keywords(email_text, lang)
61
+ keywords_text = "Top keywords: " + ", ".join(keywords)
62
+
63
+ # Explicação heurística
64
+ explanation_text = "📌 Explanation:\n• " + "\n• ".join(explanations)
65
+
66
+ # Verificação VirusTotal
67
+ urls = re.findall(r"http[s]?://\S+", email_text)
68
+ vt_results = []
69
+ for url in urls:
70
+ stats = check_url_virustotal(url)
71
+ if "error" in stats:
72
+ vt_results.append(f"URL: {url} | VT: {stats['error']}")
73
+ else:
74
+ vt_results.append(f"URL: {url} | Malicious: {stats.get('malicious', 0)}, Suspicious: {stats.get('suspicious', 0)}, Harmless: {stats.get('harmless', 0)}")
75
+ vt_text = "\n".join(vt_results) if vt_results else "No URLs found."
76
+
77
+ return f"Classification: {label} ({confidence:.2%})\n\n{explanation_text}\n\nScore: {score}\n\n{keywords_text}\n\nVirusTotal Results:\n{vt_text}"
78
+
79
+ # Carregar exemplos
80
+ def update_text_from_example(example_name):
81
+ return example_emails[example_name]
82
+
83
+ def load_example_files():
84
+ examples_path = Path("examples")
85
+ files = sorted(examples_path.glob("*.txt"))
86
+ return {file.name: file.read_text(encoding="utf-8") for file in files}
87
+
88
+ example_emails = load_example_files()
89
+
90
+ # Interface Gradio
91
+ with gr.Blocks() as demo:
92
+ gr.Markdown("## 🛡️ PhishHunter – Email Phishing Detector")
93
+
94
+ with gr.Row():
95
+ dropdown = gr.Dropdown(
96
+ choices=list(example_emails.keys()),
97
+ label="Load Example Email",
98
+ info="Select a sample email to test",
99
+ )
100
+ textbox = gr.Textbox(lines=15, label="Paste or load email content")
101
+ filebox = gr.File(label="Upload email (.txt or .msg)", file_types=[".txt", ".msg"])
102
+
103
+ dropdown.change(fn=update_text_from_example, inputs=dropdown, outputs=textbox)
104
+
105
+ output = gr.Textbox(label="Classification & Explanation")
106
+
107
+ btn = gr.Button("Analyze")
108
+ btn.click(fn=analyze_email, inputs=[filebox, textbox], outputs=output)
109
+
110
+ if __name__ == "__main__":
111
+ demo.launch()
examples/Alerta Novas mensagens de Self-service.msg ADDED
Binary file (80.4 kB). View file
 
examples/Email_Phis_URL_mal.txt ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ Hi,
2
+
3
+ it's not importante, but click in link ...
4
+
5
+ http://www.eicar.org/download/eicar.com
6
+
7
+
examples/Leg_email_1.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Subject: Meeting Rescheduled
2
+
3
+ Hi John,
4
+
5
+ The client meeting has been rescheduled to Thursday at 3 PM. Please let me know if that works for you.
6
+
7
+ Best,
8
+ Maria
examples/Leg_email_2.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Subject: Invoice for Your Purchase
2
+
3
+ Dear Customer,
4
+
5
+ Thank you for your recent order. Attached is the invoice for your purchase. If you have any questions, feel free to contact our support team.
6
+
7
+ Best regards,
8
+ Online Store Team
examples/email_5_real.txt ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Subject: [Internal] Firewall Maintenance Notification
2
+
3
+ Dear team,
4
+
5
+ Please note that the firewall will be updated on Saturday at 02:00 AM. Expect short-term service interruptions during the maintenance window.
6
+
7
+ If you experience any issues, contact the IT team.
8
+
9
+ Regards,
10
+ IT Operations
examples/email_5_urg_leg.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ Subject: Action Required – Submit Report by EOD
2
+
3
+ Hi Ana,
4
+
5
+ Just a quick reminder to submit the monthly expense report by the end of the day. Let me know if you need any help.
6
+
7
+ Thanks,
8
+ Carla
examples/email_6_phs_congrats.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Subject: Congratulations! You've Won a Gift Card 🎁
2
+
3
+ Hi there,
4
+
5
+ You’ve been selected as a lucky winner of a $100 Amazon gift card! To claim your prize, simply fill in your details at the link below:
6
+
7
+ http://amaz0n-prize-now.ru
8
+
9
+ Hurry — this offer expires in 24 hours!
10
+
11
+ Best of luck,
12
+ Rewards Center
examples/email_7_ph_microst.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Subject: Microsoft Security Alert
2
+
3
+ Dear user,
4
+
5
+ We have noticed a new sign-in to your Microsoft account from an unrecognized device. If this was not you, please secure your account immediately:
6
+
7
+ https://account.microsoft-security-check.com
8
+
9
+ Ignoring this alert may result in temporary suspension for your protection.
10
+
11
+ Microsoft Account Team
examples/phishing_email_1.txt ADDED
File without changes
examples/phishing_email_2.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Subject: Urgent Account Verification Required
2
+
3
+ Dear user,
4
+
5
+ We have detected unusual activity on your account. To secure your data, please verify your identity immediately by clicking the link below:
6
+
7
+ http://secure-account-verify.com/login
8
+
9
+ Failure to comply within 24 hours will result in permanent suspension of your account.
10
+
11
+ Sincerely,
12
+ Security Team
examples/phishing_email_3.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ Subject: Important Security Update
2
+
3
+ Your email account needs to be updated urgently. Please download the attached file and follow the instructions to avoid service interruption.
4
+
5
+ Best regards,
6
+ IT Support
init.bat ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ REM setup.bat
2
+ pip install -r requirements.txt
3
+ python -m spacy download en_core_web_sm
requirements.txt ADDED
@@ -0,0 +1,112 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ aiofiles==24.1.0
2
+ annotated-types==0.7.0
3
+ anyio==4.9.0
4
+ audioop-lts==0.2.1
5
+ beautifulsoup4==4.13.4
6
+ blis==1.3.0
7
+ Brotli==1.1.0
8
+ catalogue==2.0.10
9
+ certifi==2025.7.14
10
+ cffi==1.17.1
11
+ charset-normalizer==3.4.2
12
+ click==8.2.1
13
+ cloudpathlib==0.21.1
14
+ colorama==0.4.6
15
+ colorclass==2.2.2
16
+ compressed-rtf==1.0.7
17
+ confection==0.1.5
18
+ cryptography==45.0.5
19
+ cymem==2.0.11
20
+ easygui==0.98.3
21
+ ebcdic==1.1.1
22
+ en_core_web_sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl#sha256=1932429db727d4bff3deed6b34cfc05df17794f4a52eeb26cf8928f7c1a0fb85
23
+ extract-msg==0.54.1
24
+ fastapi==0.116.1
25
+ ffmpy==0.6.0
26
+ filelock==3.18.0
27
+ fsspec==2025.7.0
28
+ gradio==5.38.0
29
+ gradio_client==1.11.0
30
+ groovy==0.1.2
31
+ h11==0.16.0
32
+ httpcore==1.0.9
33
+ httpx==0.28.1
34
+ huggingface-hub==0.33.4
35
+ idna==3.10
36
+ jellyfish==1.2.0
37
+ Jinja2==3.1.6
38
+ joblib==1.5.1
39
+ langcodes==3.5.0
40
+ langdetect==1.0.9
41
+ language_data==1.3.0
42
+ lark==1.1.9
43
+ marisa-trie==1.2.1
44
+ markdown-it-py==3.0.0
45
+ MarkupSafe==3.0.2
46
+ mdurl==0.1.2
47
+ mpmath==1.3.0
48
+ msoffcrypto-tool==5.4.2
49
+ murmurhash==1.0.13
50
+ networkx==3.5
51
+ nltk==3.9.1
52
+ numpy==2.3.1
53
+ olefile==0.47
54
+ oletools==0.60.2
55
+ orjson==3.11.0
56
+ packaging==25.0
57
+ pandas==2.3.1
58
+ pcodedmp==1.2.6
59
+ pillow==11.3.0
60
+ preshed==3.0.10
61
+ pycparser==2.22
62
+ pydantic==2.11.7
63
+ pydantic_core==2.33.2
64
+ pydub==0.25.1
65
+ Pygments==2.19.2
66
+ pyparsing==3.2.3
67
+ python-dateutil==2.9.0.post0
68
+ python-multipart==0.0.20
69
+ pytz==2025.2
70
+ PyYAML==6.0.2
71
+ red-black-tree-mod==1.22
72
+ regex==2024.11.6
73
+ requests==2.32.4
74
+ rich==14.0.0
75
+ RTFDE==0.1.2.1
76
+ ruff==0.12.4
77
+ safehttpx==0.1.6
78
+ safetensors==0.5.3
79
+ segtok==1.5.11
80
+ semantic-version==2.10.0
81
+ setuptools==80.9.0
82
+ shellingham==1.5.4
83
+ six==1.17.0
84
+ smart_open==7.3.0.post1
85
+ sniffio==1.3.1
86
+ soupsieve==2.7
87
+ spacy==3.8.7
88
+ spacy-legacy==3.0.12
89
+ spacy-loggers==1.0.5
90
+ srsly==2.5.1
91
+ starlette==0.47.1
92
+ sympy==1.14.0
93
+ tabulate==0.9.0
94
+ thinc==8.3.6
95
+ tokenizers==0.21.2
96
+ tomlkit==0.13.3
97
+ torch==2.7.1
98
+ tqdm==4.67.1
99
+ transformers==4.53.2
100
+ typer==0.16.0
101
+ typing-inspection==0.4.1
102
+ typing_extensions==4.14.1
103
+ tzdata==2025.2
104
+ tzlocal==5.3.1
105
+ urllib3==2.5.0
106
+ uvicorn==0.35.0
107
+ wasabi==1.1.3
108
+ weasel==0.4.1
109
+ websockets==15.0.1
110
+ win_unicode_console==0.5
111
+ wrapt==1.17.2
112
+ yake==0.6.0
rules.yaml ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ keywords:
2
+ impersonation:
3
+ en:
4
+ - { term: "Microsoft", weight: 0.5 }
5
+ - { term: "DocuSign", weight: 0.7 }
6
+ - { term: "PayPal", weight: 0.8 }
7
+ pt:
8
+ - { term: "DocuSign", weight: 0.7 }
9
+ - { term: "Caixa", weight: 0.6 }
10
+ - { term: "Segurança Social", weight: 0.8 }
11
+
12
+ prize:
13
+ en:
14
+ - { term: "congratulations", weight: 0.7 }
15
+ - { term: "winner", weight: 0.8 }
16
+ - { term: "prize", weight: 0.8 }
17
+ pt:
18
+ - { term: "par(á|a)b(é|e)ns", weight: 0.9 }
19
+ - { term: "pr(é|e)mio", weight: 0.8 }
20
+ - { term: "ganhador", weight: 0.7 }
21
+ global:
22
+ - { term: "gift.?card", weight: 0.9 }
23
+
24
+ sensitive_info:
25
+ en:
26
+ - { term: "password", weight: 0.9 }
27
+ - { term: "login", weight: 0.8 }
28
+ - { term: "verify your account", weight: 1.0 }
29
+ - { term: "credentials", weight: 0.9 }
30
+ pt:
31
+ - { term: "senha", weight: 0.9 }
32
+ - { term: "palavra.?passe", weight: 0.8 }
33
+ - { term: "credenciais", weight: 0.9 }
34
+ - { term: "verifique sua conta", weight: 1.0 }
35
+ global: []
36
+
37
+ urgency:
38
+ en:
39
+ - { term: "urgent", weight: 0.9 }
40
+ - { term: "action required", weight: 0.8 }
41
+ pt:
42
+ - { term: "urgente", weight: 0.9 }
43
+ - { term: "ação imediata", weight: 0.8 }
44
+ global:
45
+ - { term: "immediately", weight: 0.7 }
46
+ - { term: "24 hours", weight: 0.6 }
47
+
utils/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+
utils/heuristics.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import re
3
+ import yaml
4
+ import yake
5
+ import spacy
6
+ from langdetect import detect
7
+
8
+ # Carregar regras heurísticas com pesos
9
+ def load_rules(filepath="rules_weighted.yaml"):
10
+ with open(filepath, "r", encoding="utf-8") as f:
11
+ return yaml.safe_load(f)
12
+
13
+ # Aplicar regras com base no idioma e calcular score
14
+ def apply_heuristics(email_text, rules):
15
+ reasons = []
16
+ total_score = 0.0
17
+ lower = email_text.lower()
18
+ lang = detect(lower)
19
+
20
+ # Regras de negação que reduzem o score
21
+ negations = [
22
+ "não é urgente",
23
+ "sem urgência",
24
+ "não necessita ação",
25
+ "não requer ação imediata",
26
+ "sem necessidade imediata"
27
+ ]
28
+ for neg in negations:
29
+ if neg in lower:
30
+ reasons.append(f"Found negation: '{neg}' (reduces score)")
31
+ total_score -= 0.5
32
+
33
+ for category, keywords in rules.get("keywords", {}).items():
34
+ # Global keywords
35
+ for entry in keywords.get("global", []):
36
+ pattern = entry["term"]
37
+ weight = entry.get("weight", 1.0)
38
+ if re.search(pattern, lower, re.IGNORECASE):
39
+ reasons.append(f"[{category}] Matched '{pattern}' (global, weight={weight})")
40
+ total_score += weight
41
+
42
+ # Language-specific keywords
43
+ for entry in keywords.get(lang, []):
44
+ pattern = entry["term"]
45
+ weight = entry.get("weight", 1.0)
46
+ if re.search(pattern, lower, re.IGNORECASE):
47
+ reasons.append(f"[{category}] Matched '{pattern}' ({lang}, weight={weight})")
48
+ total_score += weight
49
+
50
+ # Heurística de links
51
+ urls = re.findall(r"http[s]?://\S+", email_text)
52
+ if urls:
53
+ reasons.append(f"Contains suspicious link(s): {', '.join(urls)}")
54
+ total_score += 1.0
55
+
56
+ return reasons, total_score, lang
57
+
58
+ # Extração de palavras-chave com YAKE
59
+ def extract_keywords(email_text, lang="en"):
60
+ extractor = yake.KeywordExtractor(lan=lang, top=5)
61
+ keywords = extractor.extract_keywords(email_text)
62
+ return [kw for kw, score in keywords]
63
+
64
+ # Explicação combinada
65
+ def explain_email(email_text, rules):
66
+ reasons, score, lang = apply_heuristics(email_text, rules)
67
+ return reasons, score
utils/heuristics_old1.py ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ import yaml
3
+ import spacy
4
+ import yake
5
+ from langdetect import detect
6
+
7
+ # Carregar regras heurísticas a partir de ficheiro YAML
8
+ def load_rules(filepath="rules.yaml"):
9
+ with open(filepath, "r", encoding="utf-8") as f:
10
+ return yaml.safe_load(f)
11
+
12
+ # Aplicar regras com base no idioma
13
+ def apply_heuristics(email_text, rules):
14
+ reasons = []
15
+ lower = email_text.lower()
16
+ lang = detect(lower)
17
+
18
+ for category, keywords in rules.get("keywords", {}).items():
19
+ # Aplica regras globais (sem idioma)
20
+ for kw in keywords.get("global", []):
21
+ if kw in lower:
22
+ reasons.append(f"Contains keyword '{kw}' related to {category} (global)")
23
+ break
24
+
25
+ # Aplica regras específicas do idioma
26
+ for kw in keywords.get(lang, []):
27
+ if kw in lower:
28
+ reasons.append(f"Contains keyword '{kw}' related to {category} ({lang})")
29
+ break
30
+
31
+ # Heurística de links
32
+ urls = re.findall(r"http[s]?://\S+", email_text)
33
+ if urls:
34
+ reasons.append(f"Contains suspicious link(s): {', '.join(urls)}")
35
+
36
+ return reasons, lang
37
+
38
+ # Extrair palavras-chave com YAKE (para info complementar)
39
+ def extract_keywords(email_text, lang="en"):
40
+ extractor = yake.KeywordExtractor(lan=lang, top=5)
41
+ keywords = extractor.extract_keywords(email_text)
42
+ return [kw for kw, score in keywords]
43
+
44
+ # Explicação combinada
45
+ def explain_email(email_text, rules):
46
+ reasons, lang = apply_heuristics(email_text, rules)
47
+ return reasons
utils/heuristics_old2.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ import re
3
+ import yaml
4
+ import yake
5
+ import spacy
6
+ from langdetect import detect
7
+
8
+ # Carregar regras heurísticas com pesos
9
+ def load_rules(filepath="rules_weighted.yaml"):
10
+ with open(filepath, "r", encoding="utf-8") as f:
11
+ return yaml.safe_load(f)
12
+
13
+ # Aplicar regras com base no idioma e calcular score
14
+ def apply_heuristics(email_text, rules):
15
+ reasons = []
16
+ total_score = 0.0
17
+ lower = email_text.lower()
18
+ lang = detect(lower)
19
+
20
+ for category, keywords in rules.get("keywords", {}).items():
21
+ # Global keywords
22
+ for entry in keywords.get("global", []):
23
+ pattern = entry["term"]
24
+ weight = entry.get("weight", 1.0)
25
+ if re.search(pattern, lower, re.IGNORECASE):
26
+ reasons.append(f"[{category}] Matched '{pattern}' (global, weight={weight})")
27
+ total_score += weight
28
+
29
+ # Language-specific keywords
30
+ for entry in keywords.get(lang, []):
31
+ pattern = entry["term"]
32
+ weight = entry.get("weight", 1.0)
33
+ if re.search(pattern, lower, re.IGNORECASE):
34
+ reasons.append(f"[{category}] Matched '{pattern}' ({lang}, weight={weight})")
35
+ total_score += weight
36
+
37
+ # Heurística de links
38
+ urls = re.findall(r"http[s]?://\S+", email_text)
39
+ if urls:
40
+ reasons.append(f"Contains suspicious link(s): {', '.join(urls)}")
41
+ total_score += 1.0 # peso fixo para presença de links
42
+
43
+ return reasons, total_score, lang
44
+
45
+ # Extração de palavras-chave com YAKE
46
+ def extract_keywords(email_text, lang="en"):
47
+ extractor = yake.KeywordExtractor(lan=lang, top=5)
48
+ keywords = extractor.extract_keywords(email_text)
49
+ return [kw for kw, score in keywords]
50
+
51
+ # Explicação combinada
52
+ def explain_email(email_text, rules):
53
+ reasons, score, lang = apply_heuristics(email_text, rules)
54
+ return reasons, score
utils/keyword_extractor.py ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ import yake
2
+
3
+ # Extrai palavras-chave com YAKE
4
+ def extract_keywords(text, lang="en", max_keywords=5):
5
+ try:
6
+ extractor = yake.KeywordExtractor(lan=lang, top=max_keywords)
7
+ keywords = extractor.extract_keywords(text)
8
+ return [kw for kw, _ in keywords]
9
+ except Exception:
10
+ return []
utils/lang_detect.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from langdetect import detect, DetectorFactory
2
+
3
+ # Garante resultados consistentes
4
+ DetectorFactory.seed = 0
5
+
6
+ def detect_language(text):
7
+ try:
8
+ lang = detect(text)
9
+ return lang
10
+ except Exception:
11
+ return "unknown"
utils/virustotal.py ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import requests
2
+ import os
3
+
4
+ # Insira sua API Key do VirusTotal aqui ou defina a variável de ambiente VT_API_KEY
5
+ VT_API_KEY = os.getenv("VT_API_KEY", "02f4bfdb0435f5235201013bd18fe7d5b0793f5fd37952eedec138b0560cdd68")
6
+
7
+ VT_BASE_URL = "https://www.virustotal.com/api/v3"
8
+
9
+ def check_url_virustotal(url):
10
+ headers = {
11
+ "x-apikey": VT_API_KEY
12
+ }
13
+ # Primeiro, enviar a URL para análise
14
+ response = requests.post(f"{VT_BASE_URL}/urls", headers=headers, data={"url": url})
15
+ if response.status_code == 200:
16
+ analysis_id = response.json()["data"]["id"]
17
+ # Buscar o resultado da análise
18
+ analysis_url = f"{VT_BASE_URL}/analyses/{analysis_id}"
19
+ analysis_response = requests.get(analysis_url, headers=headers)
20
+ if analysis_response.status_code == 200:
21
+ stats = analysis_response.json()["data"]["attributes"]["stats"]
22
+ # Exemplo: {'harmless': 70, 'malicious': 1, 'suspicious': 0, ...}
23
+ return stats
24
+ else:
25
+ return {"error": f"Erro ao buscar análise: {analysis_response.status_code}"}
26
+ else:
27
+ return {"error": f"Erro ao enviar URL: {response.status_code}"}
28
+