Spaces:

rb1337
/

Phishing-Detection-System

Runtime error

App Files Files Community

rb1337 commited on Feb 26

Commit

2cc7f91

verified ·

1 Parent(s): 539c8ef

Upload 50 files

Browse files

Files changed (50) hide show

README.md +115 -12
requirements.txt +43 -43
scripts/__pycache__/extract_combined_features.cpython-313.pyc +0 -0
scripts/data_collection/crawl_tranco_subpages.py +199 -0
scripts/data_collection/download_html.py +637 -0
scripts/data_collection/download_legitimate_html.py +286 -0
scripts/feature_extraction/__pycache__/extract_combined_features.cpython-313.pyc +0 -0
scripts/feature_extraction/__pycache__/html_features.cpython-313.pyc +0 -0
scripts/feature_extraction/__pycache__/url_features.cpython-313.pyc +0 -0
scripts/feature_extraction/__pycache__/url_features_optimized.cpython-313.pyc +0 -0
scripts/feature_extraction/__pycache__/url_features_v2.cpython-313.pyc +0 -0
scripts/feature_extraction/__pycache__/url_features_v3.cpython-313.pyc +0 -0
scripts/feature_extraction/extract_combined_features.py +347 -0
scripts/feature_extraction/html/__pycache__/feature_engineering.cpython-313.pyc +0 -0
scripts/feature_extraction/html/__pycache__/html_feature_extractor.cpython-313.pyc +0 -0
scripts/feature_extraction/html/extract_features.py +322 -0
scripts/feature_extraction/html/feature_engineering.py +127 -0
scripts/feature_extraction/html/html_feature_extractor.py +510 -0
scripts/feature_extraction/html/v1/__pycache__/html_features.cpython-313.pyc +0 -0
scripts/feature_extraction/html/v1/extract_html_features_simple.py +305 -0
scripts/feature_extraction/html/v1/html_features.py +382 -0
scripts/feature_extraction/url/__pycache__/url_features_v3.cpython-313.pyc +0 -0
scripts/feature_extraction/url/url_features_diagnostic.py +51 -0
scripts/feature_extraction/url/url_features_v1.py +626 -0
scripts/feature_extraction/url/url_features_v2.py +1396 -0
scripts/feature_extraction/url/url_features_v3.py +866 -0
scripts/phishing_analysis/analysis.py +144 -0
scripts/phishing_analysis/phishing_analysis.py +85 -0
scripts/phishing_analysis/phishing_type_analysis.csv +0 -0
scripts/predict_combined.py +274 -0
scripts/predict_html.py +303 -0
scripts/predict_url.py +367 -0
scripts/predict_url_cnn.py +332 -0
scripts/testing/data_leakage_test.py +291 -0
scripts/testing/test_feature_alignment.py +123 -0
scripts/testing/test_normalization.py +105 -0
scripts/testing/test_server.py +255 -0
scripts/utils/analyze_dataset.py +42 -0
scripts/utils/balance_dataset.py +30 -0
scripts/utils/clean_urls.py +49 -0
scripts/utils/merge_datasets.py +19 -0
scripts/utils/remove_duplicates.py +23 -0
server/__pycache__/app.cpython-313.pyc +0 -0
server/app.py +819 -0
server/static/index.html +50 -0
server/static/models.html +1130 -0
server/static/script.js +509 -0
server/static/style.css +1325 -0
start_server.bat +37 -0
start_server.sh +35 -0

README.md CHANGED Viewed

@@ -1,12 +1,115 @@
----
-title: Phishing Detection System
-emoji: 🚀
-colorFrom: pink
-colorTo: purple
-sdk: docker
-pinned: false
-license: mit
-short_description: Phishing-Detection-System
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Phishing Detection System
+Machine learning system for detecting phishing websites using URL features and classical ML algorithms.
+## Features
+- **URL Feature Extraction**: Fast analysis of URL structure, lexical patterns, and security indicators
+- **Multiple ML Models**: Logistic Regression, Random Forest, XGBoost
+- **Interactive Prediction**: Test any URL with trained models
+- **Data Collection**: Scripts for downloading phishing and legitimate URL datasets
+## Quick Start
+### 1. Clone the Repository
+```bash
+git clone <your-repo-url>
+cd src
+```
+### 2. Create Virtual Environment Windows
+```powershell
+python -m venv venv
+.\venv\Scripts\Activate.ps1
+```
+### 3. Install Dependencies
+```bash
+pip install --upgrade pip
+pip install -r requirements.txt
+```
+Or install in development mode:
+```bash
+pip install -e .
+```
+## Usage
+Windows
+```bash
+./start_server.bat
+```
+Linux/Mac
+```bash
+chmod +x start_server.sh
+./start_server.sh
+```
+Start ngrok to expose local server:
+```bash
+ngrok http 8000
+```
+### Extract URL Features
+```bash
+python scripts/feature_extraction/url_features.py
+```
+### Extract HTML Features
+```bash
+python scripts/extract_html_features_simple.py
+```
+### Train URL Models (not required, files are pre-trained)
+**Logistic Regression (Baseline):**
+```bash
+python models/baseline/logistic_regression.py
+```
+**Random Forest:**
+```bash
+python models/classical/random_forest.py
+```
+**XGBoost:**
+```bash
+python models/classical/xgboost.py
+```
+### Train HTML Models
+**XGBoost:**
+```bash
+python models/html_enhanced/xgboost_html.py
+```
+**Random Forest:**
+```bash
+python models/html_enhanced/random_forest_html_optimalized.py
+```
+### Predict URLs with Trained Models
+```bash
+./start_server.bat
+```
+## Models Performance
+Results are saved in `results/reports/`
+## Dataset Sources
+- **PhishTank**: Verified phishing URLs database
+- **Majestic Million**: Top 1M websites (legitimate)
+- **Kaggle**: Phishing datasets
+## Author
+Robert Smrek

requirements.txt CHANGED Viewed

@@ -1,43 +1,43 @@
-# Core Data Science Libraries
-numpy>=1.24.0
-pandas>=2.0.0
-scipy>=1.10.0
-# Machine Learning
-scikit-learn>=1.3.0
-xgboost>=2.0.0
-optuna
-tensorflow
-# Web Scraping & URL Analysis
-beautifulsoup4>=4.12.0
-lxml>=4.9.0
-requests>=2.31.0
-urllib3>=2.0.0
-tldextract>=3.4.0
-# Data Visualization
-matplotlib>=3.7.0
-seaborn>=0.12.0
-# Progress & Utilities
-tqdm>=4.65.0
-joblib>=1.3.0
-colorama>=0.4.6
-# Jupyter & Notebooks (optional)
-jupyter>=1.0.0
-ipykernel>=6.23.0
-notebook>=6.5.0
-# Testing (optional)
-pytest>=7.4.0
-pytest-cov>=4.1.0
-# Web Framework
-fastapi==0.109.0
-uvicorn[standard]==0.27.0
-python-multipart==0.0.6
-# CORS
-python-dotenv==1.0.0

+# Core Data Science Libraries
+numpy>=1.24.0
+pandas>=2.0.0
+scipy>=1.10.0
+# Machine Learning
+scikit-learn>=1.3.0
+xgboost>=2.0.0
+optuna
+tensorflow
+# Web Scraping & URL Analysis
+beautifulsoup4>=4.12.0
+lxml>=4.9.0
+requests>=2.31.0
+urllib3>=2.0.0
+tldextract>=3.4.0
+# Data Visualization
+matplotlib>=3.7.0
+seaborn>=0.12.0
+# Progress & Utilities
+tqdm>=4.65.0
+joblib>=1.3.0
+colorama>=0.4.6
+# Jupyter & Notebooks (optional)
+jupyter>=1.0.0
+ipykernel>=6.23.0
+notebook>=6.5.0
+# Testing (optional)
+pytest>=7.4.0
+pytest-cov>=4.1.0
+# Web Framework
+fastapi==0.109.0
+uvicorn[standard]==0.27.0
+python-multipart==0.0.6
+# CORS
+python-dotenv==1.0.0

scripts/__pycache__/extract_combined_features.cpython-313.pyc ADDED Viewed

Binary file (17 kB). View file

scripts/data_collection/crawl_tranco_subpages.py ADDED Viewed

	@@ -0,0 +1,199 @@

+#!/usr/bin/env python3
+"""
+Script to crawl subpages from Tranco URLs:
+- Reads URLs from tranco_processed.csv
+- Crawls each domain to find up to 10 subpages
+- Creates new dataset with subpage URLs and label 0
+"""
+import pandas as pd
+import requests
+from bs4 import BeautifulSoup
+from urllib.parse import urljoin, urlparse
+import time
+import os
+from tqdm import tqdm
+import logging
+from concurrent.futures import ThreadPoolExecutor, as_completed
+import threading
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s'
+)
+logger = logging.getLogger(__name__)
+def get_domain(url):
+    """Extract domain from URL"""
+    parsed = urlparse(url)
+    return f"{parsed.scheme}://{parsed.netloc}"
+def is_same_domain(url, base_url):
+    """Check if URL belongs to the same domain as base_url"""
+    return urlparse(url).netloc == urlparse(base_url).netloc
+def crawl_subpages(base_url, max_subpages=10, timeout=10):
+    """
+    Crawl a website to find subpages
+    Args:
+        base_url: Base URL to crawl
+        max_subpages: Maximum number of subpages to collect
+        timeout: Request timeout in seconds
+    Returns:
+        List of subpage URLs
+    """
+    subpages = set()
+    headers = {
+        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+    }
+    try:
+        # Get the main page
+        response = requests.get(base_url, headers=headers, timeout=timeout, allow_redirects=True)
+        response.raise_for_status()
+        # Parse HTML
+        soup = BeautifulSoup(response.content, 'html.parser')
+        # Find all links
+        links = soup.find_all('a', href=True)
+        for link in links:
+            if len(subpages) >= max_subpages:
+                break
+            href = link['href']
+            # Convert relative URLs to absolute
+            full_url = urljoin(base_url, str(href))
+            # Only include URLs from the same domain
+            if is_same_domain(full_url, base_url):
+                # Remove fragments
+                parsed = urlparse(full_url)
+                clean_url = f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
+                if parsed.query:
+                    clean_url += f"?{parsed.query}"
+                # Avoid duplicates and the base URL itself
+                if clean_url != base_url and clean_url not in subpages:
+                    subpages.add(clean_url)
+        return list(subpages)[:max_subpages]
+    except requests.exceptions.Timeout:
+        logger.warning(f"Timeout while crawling {base_url}")
+        return []
+    except requests.exceptions.RequestException as e:
+        logger.warning(f"Error crawling {base_url}: {str(e)}")
+        return []
+    except Exception as e:
+        logger.warning(f"Unexpected error crawling {base_url}: {str(e)}")
+        return []
+def crawl_dataset(input_file, output_file, max_subpages_per_url=10, max_urls=None, delay=1, num_threads=10):
+    """
+    Crawl all URLs in dataset to find subpages
+    Args:
+        input_file: Path to input CSV file
+        output_file: Path to output CSV file
+        max_subpages_per_url: Maximum subpages to collect per URL
+        max_urls: Maximum number of URLs to process (None for all)
+        delay: Delay between requests in seconds
+        num_threads: Number of concurrent threads for crawling
+    """
+    # Read input file
+    logger.info(f"Reading {input_file}...")
+    df = pd.read_csv(input_file)
+    if max_urls:
+        df = df.head(max_urls)
+        logger.info(f"Processing first {max_urls} URLs")
+    logger.info(f"Dataset contains {len(df)} URLs")
+    logger.info(f"Using {num_threads} threads for concurrent crawling")
+    # Collect all subpages
+    all_subpages = []
+    lock = threading.Lock()
+    def process_url(row):
+        """Process a single URL with delay"""
+        base_url = row['url']
+        logger.info(f"Crawling {base_url}...")
+        subpages = crawl_subpages(base_url, max_subpages=max_subpages_per_url)
+        results = []
+        if subpages:
+            logger.info(f"Found {len(subpages)} subpages for {base_url}")
+            for subpage in subpages:
+                results.append({
+                    'url': subpage,
+                    'label': 0,  # Legitimate
+                    # 'source_url': base_url
+                })
+        else:
+            logger.warning(f"No subpages found for {base_url}")
+        # Delay to be respectful to servers
+        time.sleep(delay)
+        return results
+    # Use ThreadPoolExecutor for concurrent crawling
+    with ThreadPoolExecutor(max_workers=num_threads) as executor:
+        # Submit all tasks
+        future_to_url = {executor.submit(process_url, row): row['url']
+                        for _, row in df.iterrows()}
+        # Process completed tasks with progress bar
+        with tqdm(total=len(df), desc="Crawling URLs") as pbar:
+            for future in as_completed(future_to_url):
+                try:
+                    results = future.result()
+                    with lock:
+                        all_subpages.extend(results)
+                except Exception as e:
+                    url = future_to_url[future]
+                    logger.error(f"Error processing {url}: {str(e)}")
+                finally:
+                    pbar.update(1)
+    # Create DataFrame with all subpages
+    result_df = pd.DataFrame(all_subpages)
+    logger.info(f"\nTotal subpages collected: {len(result_df)}")
+    logger.info(f"Saving to {output_file}...")
+    # Save to CSV
+    result_df.to_csv(output_file, index=False)
+    logger.info("Crawling complete!")
+    logger.info(f"\nFirst few rows:\n{result_df.head(10)}")
+    logger.info(f"\nDataset statistics:")
+    logger.info(f"Total URLs: {len(result_df)}")
+    logger.info(f"Unique source domains: {result_df['source_url'].nunique()}")
+    return result_df
+if __name__ == "__main__":
+    # Define paths
+    script_dir = os.path.dirname(os.path.abspath(__file__))
+    project_root = os.path.dirname(script_dir)
+    input_file = os.path.join(project_root, 'data', 'raw', 'tranco_processed2.csv')
+    output_file = os.path.join(project_root, 'data', 'raw', 'tranco_subpages2.csv')
+    # Crawl dataset
+    # Process first 100 URLs for testing (remove max_urls=100 to process all)
+    crawl_dataset(
+        input_file=input_file,
+        output_file=output_file,
+        max_subpages_per_url=10,
+        # max_urls=100,
+        delay=1,
+        num_threads=10  # Adjust based on your needs (10-20 is usually good)
+    )

scripts/data_collection/download_html.py ADDED Viewed

	@@ -0,0 +1,637 @@

+"""
+Download HTML Content from Verified Online Phishing URLs
+This script downloads HTML content from phishing URLs that are verified and online.
+Saves HTML files for later feature extraction.
+"""
+import pandas as pd
+import requests
+from requests.adapters import HTTPAdapter
+from urllib3.util.retry import Retry
+from pathlib import Path
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+import time
+import hashlib
+import logging
+from datetime import datetime
+from bs4 import BeautifulSoup
+import re
+import urllib3
+import random
+from collections import defaultdict
+from threading import Lock
+import json
+# Disable SSL warnings (expected when downloading phishing sites with invalid certificates)
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger("html_downloader")
+class HTMLDownloader:
+    """Optimized HTML downloader with retry, checkpointing, and rate limiting."""
+    def __init__(self, output_dir='data/html', max_workers=20, timeout=8, checkpoint_interval=100):
+        """
+        Initialize optimized HTML downloader.
+        Args:
+            output_dir: Base directory to save HTML files
+            max_workers: Number of parallel download threads (increased to 20)
+            timeout: Request timeout in seconds (reduced to 8s for faster failure)
+            checkpoint_interval: Save progress every N URLs
+        """
+        self.output_dir = Path(output_dir)
+        self.legit_dir = self.output_dir / 'legitimate'
+        self.phishing_dir = self.output_dir / 'phishing'
+        self.legit_dir.mkdir(parents=True, exist_ok=True)
+        self.phishing_dir.mkdir(parents=True, exist_ok=True)
+        self.max_workers = max_workers
+        self.timeout = timeout
+        self.checkpoint_interval = checkpoint_interval
+        # Stats
+        self.stats = {
+            'total': 0,
+            'success': 0,
+            'failed': 0,
+            'timeout': 0,
+            'error': 0,
+            'retried': 0,
+            'http_fallback': 0
+        }
+        # User agents rotation (avoid blocks)
+        self.user_agents = [
+            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+            'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:122.0) Gecko/20100101 Firefox/122.0',
+            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Edge/120.0.0.0',
+        ]
+        # Domain rate limiting (delay per domain)
+        self.domain_last_access = defaultdict(float)
+        self.domain_lock = Lock()
+        self.min_domain_delay = 0.5  # 500ms between requests to same domain
+        # Session pool for connection reuse
+        self.sessions = []
+        for _ in range(max_workers):
+            session = self._create_session()
+            self.sessions.append(session)
+        # Checkpoint file
+        self.checkpoint_file = self.output_dir / 'download_checkpoint.json'
+        self.completed_urls = self._load_checkpoint()
+    def _create_session(self):
+        """Create optimized requests session with retry and compression."""
+        session = requests.Session()
+        # Retry strategy: 3 retries with exponential backoff
+        retry_strategy = Retry(
+            total=3,
+            backoff_factor=0.5,  # 0.5s, 1s, 2s
+            status_forcelist=[429, 500, 502, 503, 504],
+            allowed_methods=["GET", "HEAD"]
+        )
+        adapter = HTTPAdapter(
+            max_retries=retry_strategy,
+            pool_connections=100,
+            pool_maxsize=100,
+            pool_block=False
+        )
+        session.mount("http://", adapter)
+        session.mount("https://", adapter)
+        # Enable compression
+        session.headers.update({
+            'Accept-Encoding': 'gzip, deflate, br',
+            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+            'Accept-Language': 'en-US,en;q=0.9',
+            'Connection': 'keep-alive',
+        })
+        return session
+    def _get_random_user_agent(self):
+        """Get random user agent to avoid detection."""
+        return random.choice(self.user_agents)
+    def _load_checkpoint(self):
+        """Load checkpoint of already downloaded URLs."""
+        if self.checkpoint_file.exists():
+            try:
+                with open(self.checkpoint_file, 'r') as f:
+                    data = json.load(f)
+                    completed = set(data.get('completed_urls', []))
+                    logger.info(f"Loaded checkpoint: {len(completed):,} URLs already downloaded")
+                    return completed
+            except Exception as e:
+                logger.warning(f"Failed to load checkpoint: {e}")
+        return set()
+    def _save_checkpoint(self, results):
+        """Save checkpoint of completed URLs."""
+        try:
+            completed = [r['url'] for r in results if r['status'] == 'success']
+            self.completed_urls.update(completed)
+            with open(self.checkpoint_file, 'w') as f:
+                json.dump({
+                    'completed_urls': list(self.completed_urls),
+                    'timestamp': datetime.now().isoformat(),
+                    'total_completed': len(self.completed_urls)
+                }, f)
+        except Exception as e:
+            logger.warning(f"Failed to save checkpoint: {e}")
+    def _rate_limit_domain(self, url):
+        """Apply per-domain rate limiting."""
+        try:
+            from urllib.parse import urlparse
+            domain = urlparse(url).netloc
+            with self.domain_lock:
+                last_access = self.domain_last_access[domain]
+                now = time.time()
+                time_since_last = now - last_access
+                if time_since_last < self.min_domain_delay:
+                    sleep_time = self.min_domain_delay - time_since_last
+                    time.sleep(sleep_time)
+                self.domain_last_access[domain] = time.time()
+        except:
+            pass  # If rate limiting fails, continue anyway
+    def _url_to_filename(self, url):
+        """Convert URL to safe filename using hash."""
+        url_hash = hashlib.md5(url.encode()).hexdigest()
+        return f"{url_hash}.html"
+    def _optimize_html(self, html_content):
+        """
+        Aggressively optimize HTML for feature extraction.
+        Removes unnecessary content while preserving structure:
+        - Comments, excessive whitespace
+        - Inline styles (keeps style tags for counting)
+        - Large script/style content (keeps tags for counting)
+        - Base64 embedded images (huge size, not needed for features)
+        Args:
+            html_content: Raw HTML content
+        Returns:
+            Optimized HTML string (typically 60-80% smaller)
+        """
+        try:
+            # Quick regex cleanup before parsing (faster than BeautifulSoup for some tasks)
+            # Remove HTML comments
+            html_content = re.sub(r'<!--.*?-->', '', html_content, flags=re.DOTALL)
+            # Remove base64 embedded images (can be huge, not needed for features)
+            html_content = re.sub(r'data:image/[^;]+;base64,[A-Za-z0-9+/=]+', 'data:image', html_content)
+            # Parse HTML (use lxml parser if available, it's faster)
+            try:
+                soup = BeautifulSoup(html_content, 'lxml')
+            except:
+                soup = BeautifulSoup(html_content, 'html.parser')
+            # Remove inline styles (but keep style tags for counting)
+            for tag in soup.find_all(style=True):
+                del tag['style']
+            # Truncate large script/style content (keep tags for counting, trim content)
+            for script in soup.find_all('script'):
+                if script.string and len(script.string) > 500:
+                    script.string = script.string[:500] + '...'
+            for style in soup.find_all('style'):
+                if style.string and len(style.string) > 500:
+                    style.string = style.string[:500] + '...'
+            # Normalize whitespace in text nodes
+            for text in soup.find_all(string=True):
+                if text.parent.name not in ['script', 'style']: # type: ignore
+                    normalized = re.sub(r'\s+', ' ', str(text).strip())
+                    if normalized:
+                        text.replace_with(normalized)
+            # Convert back to string
+            optimized = str(soup)
+            # Final cleanup: remove excessive blank lines
+            optimized = re.sub(r'\n\s*\n+', '\n', optimized)
+            return optimized
+        except Exception as e:
+            logger.warning(f"HTML optimization failed: {e}, returning original")
+            # Fallback: at least remove comments and excessive whitespace
+            html_content = re.sub(r'<!--.*?-->', '', html_content, flags=re.DOTALL)
+            html_content = re.sub(r'\n\s*\n+', '\n', html_content)
+            return html_content
+    def download_single_url(self, url, label, url_id=None, session=None):
+        """
+        Download HTML with retry logic and HTTP fallback.
+        Args:
+            url: URL to download
+            label: Label (0=legitimate, 1=phishing)
+            url_id: Optional ID from dataset
+            session: Requests session (for connection pooling)
+        Returns:
+            Dictionary with download result
+        """
+        result = {
+            'url': url,
+            'label': label,
+            'url_id': url_id,
+            'status': 'failed',
+            'error': None,
+            'filename': None,
+            'size': 0,
+            'original_size': 0
+        }
+        # Skip if already downloaded
+        if url in self.completed_urls:
+            result['status'] = 'skipped'
+            result['error'] = 'Already downloaded'
+            return result
+        # Apply rate limiting
+        self._rate_limit_domain(url)
+        # Use provided session or create temporary one
+        if session is None:
+            session = self._create_session()
+        # Add scheme if missing (default HTTPS)
+        original_url = url
+        if not url.startswith(('http://', 'https://')):
+            url = 'https://' + url
+        attempts = [url]
+        # If HTTPS, also try HTTP as fallback
+        if url.startswith('https://'):
+            http_url = url.replace('https://', 'http://', 1)
+            attempts.append(http_url)
+        # Try each URL variant
+        for attempt_num, attempt_url in enumerate(attempts):
+            try:
+                # Random user agent for each attempt
+                headers = {'User-Agent': self._get_random_user_agent()}
+                # Download with timeout and retries (handled by session)
+                response = session.get(
+                    attempt_url,
+                    headers=headers,
+                    timeout=(3, self.timeout),  # (connect timeout, read timeout)
+                    allow_redirects=True,
+                    verify=False,  # Phishing sites often have invalid SSL
+                    stream=False  # We need full content
+                )
+                # Check if successful
+                if response.status_code == 200:
+                    # Check content type (skip if not HTML)
+                    content_type = response.headers.get('Content-Type', '')
+                    if 'text/html' not in content_type.lower() and 'application/xhtml' not in content_type.lower():
+                        result['status'] = 'failed'
+                        result['error'] = f'Non-HTML content: {content_type}'
+                        continue
+                    # Get HTML content
+                    html_content = response.text
+                    result['original_size'] = len(html_content)
+                    # Skip if too small (likely error page)
+                    if len(html_content) < 200:
+                        result['status'] = 'failed'
+                        result['error'] = 'HTML too small (< 200 bytes)'
+                        continue
+                    # Optimize HTML for feature extraction
+                    optimized_html = self._optimize_html(html_content)
+                    # Save to appropriate directory
+                    filename = self._url_to_filename(original_url)
+                    target_dir = self.legit_dir if label == 0 else self.phishing_dir
+                    filepath = target_dir / filename
+                    with open(filepath, 'w', encoding='utf-8', errors='ignore') as f:
+                        f.write(optimized_html)
+                    result['status'] = 'success'
+                    result['filename'] = filename
+                    result['size'] = len(optimized_html)
+                    result['target_dir'] = str(target_dir.name)
+                    result['compression_ratio'] = f"{(1 - len(optimized_html) / max(result['original_size'], 1)) * 100:.1f}%"
+                    if attempt_num > 0:
+                        result['http_fallback'] = True
+                        self.stats['http_fallback'] += 1
+                    self.stats['success'] += 1
+                    return result  # Success!
+                else:
+                    result['error'] = f"HTTP {response.status_code}"
+                    if attempt_num == len(attempts) - 1:  # Last attempt
+                        result['status'] = 'failed'
+                        self.stats['failed'] += 1
+            except requests.Timeout:
+                result['error'] = 'Timeout'
+                if attempt_num == len(attempts) - 1:
+                    result['status'] = 'timeout'
+                    self.stats['timeout'] += 1
+            except requests.RequestException as e:
+                result['error'] = f"{type(e).__name__}: {str(e)[:80]}"
+                if attempt_num == len(attempts) - 1:
+                    result['status'] = 'error'
+                    self.stats['error'] += 1
+            except Exception as e:
+                result['error'] = f"Unknown: {str(e)[:80]}"
+                if attempt_num == len(attempts) - 1:
+                    result['status'] = 'error'
+                    self.stats['error'] += 1
+        return result
+    def download_batch(self, urls_df, label_column='label', id_column=None, resume=True):
+        """
+        Download HTML with checkpointing and session pooling.
+        Args:
+            urls_df: DataFrame with URLs
+            label_column: Column name for labels
+            id_column: Optional column name for IDs
+            resume: Resume from checkpoint if available
+        Returns:
+            DataFrame with download results
+        """
+        self.stats['total'] = len(urls_df)
+        # Filter already downloaded URLs if resuming
+        if resume and self.completed_urls:
+            url_column = 'url' if 'url' in urls_df.columns else 'URL'
+            urls_df = urls_df[~urls_df[url_column].isin(self.completed_urls)].copy()
+            skipped = self.stats['total'] - len(urls_df)
+            if skipped > 0:
+                logger.info(f"Resuming: {skipped:,} URLs already downloaded, {len(urls_df):,} remaining")
+        logger.info(f"Starting optimized download of {len(urls_df):,} URLs...")
+        logger.info(f"Workers: {self.max_workers} | Timeout: {self.timeout}s | Checkpoint: every {self.checkpoint_interval} URLs")
+        logger.info(f"Output: {self.output_dir.absolute()}")
+        logger.info(f"Features: Session pooling, retry logic, HTTP fallback, rate limiting, compression")
+        results = []
+        session_idx = 0
+        # Prepare tasks
+        tasks = []
+        for idx, row in urls_df.iterrows():
+            url = row['url'] if 'url' in row else row['URL']
+            label = row[label_column] if label_column in row else 1
+            url_id = row[id_column] if id_column and id_column in row else idx
+            tasks.append((url, label, url_id))
+        # Download in parallel with progress bar and checkpointing
+        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
+            # Submit tasks with session pooling
+            future_to_task = {}
+            for url, label, url_id in tasks:
+                # Round-robin session assignment
+                session = self.sessions[session_idx % len(self.sessions)]
+                session_idx += 1
+                future = executor.submit(self.download_single_url, url, label, url_id, session)
+                future_to_task[future] = (url, label, url_id)
+            # Process completed tasks with progress bar
+            with tqdm(total=len(tasks), desc="Downloading", unit="url") as pbar:
+                checkpoint_counter = 0
+                for future in as_completed(future_to_task):
+                    result = future.result()
+                    results.append(result)
+                    pbar.update(1)
+                    checkpoint_counter += 1
+                    # Save checkpoint periodically
+                    if checkpoint_counter >= self.checkpoint_interval:
+                        self._save_checkpoint(results)
+                        checkpoint_counter = 0
+                    # Update progress bar with detailed stats
+                    pbar.set_postfix({
+                        'OK': self.stats['success'],
+                        'Fail': self.stats['failed'],
+                        'Timeout': self.stats['timeout'],
+                        'HTTP↓': self.stats['http_fallback']
+                    })
+        # Final checkpoint save
+        self._save_checkpoint(results)
+        # Create results DataFrame
+        results_df = pd.DataFrame(results)
+        # Print summary
+        self._print_summary(results_df)
+        return results_df
+    def _print_summary(self, results_df):
+        """Print detailed download summary with optimization metrics."""
+        logger.info("\n" + "="*80)
+        logger.info("DOWNLOAD SUMMARY")
+        logger.info("="*80)
+        total = self.stats['total']
+        success = self.stats['success']
+        logger.info(f"\nTotal URLs processed: {total:,}")
+        logger.info(f"  ✓ Successful:  {success:,} ({success/max(total,1)*100:.1f}%)")
+        logger.info(f"  ✗ Failed:      {self.stats['failed']:,}")
+        logger.info(f"  ⏱ Timeout:     {self.stats['timeout']:,}")
+        logger.info(f"  ⚠ Error:       {self.stats['error']:,}")
+        logger.info(f"  ↓ HTTP Fallback: {self.stats['http_fallback']:,}")
+        # Detailed stats if we have results
+        if not results_df.empty and 'status' in results_df.columns:
+            # Success by label
+            if 'label' in results_df.columns:
+                success_by_label = results_df[results_df['status'] == 'success'].groupby('label').size()
+                if not success_by_label.empty:
+                    logger.info(f"\nSuccessful downloads by type:")
+                    for label, count in success_by_label.items():
+                        label_name = 'Phishing' if label == 1 else 'Legitimate'
+                        logger.info(f"  {label_name}: {count:,}")
+            # Size statistics
+            successful = results_df[results_df['status'] == 'success']
+            if not successful.empty and 'size' in successful.columns:
+                total_optimized = successful['size'].sum()
+                total_original = successful.get('original_size', successful['size']).sum()
+                logger.info(f"\nStorage statistics:")
+                logger.info(f"  Original size:  {total_original/1024/1024:.2f} MB")
+                logger.info(f"  Optimized size: {total_optimized/1024/1024:.2f} MB")
+                if total_original > 0:
+                    saved = (1 - total_optimized / total_original) * 100
+                    logger.info(f"  Space saved:    {saved:.1f}%")
+            # Error breakdown
+            failed = results_df[results_df['status'] != 'success']
+            if not failed.empty and 'error' in failed.columns:
+                error_counts = failed['error'].value_counts().head(5)
+                if not error_counts.empty:
+                    logger.info(f"\nTop failure reasons:")
+                    for error, count in error_counts.items():
+                        logger.info(f"  {error}: {count:,}")
+        logger.info("="*80)
+def main():
+    """Main function to download HTML from verified online phishing URLs."""
+    import argparse
+    parser = argparse.ArgumentParser(description='Download HTML content from URLs and organize by label')
+    parser.add_argument('--input', type=str, default='data/processed/clean_dataset.csv',
+                        help='Input CSV file with URLs (must have url,label,type columns)')
+    parser.add_argument('--output', type=str, default='data/html',
+                        help='Base output directory (will create legitimate/ and phishing/ subdirectories)')
+    parser.add_argument('--workers', type=int, default=20,
+                        help='Number of parallel download workers (default: 20)')
+    parser.add_argument('--timeout', type=int, default=8,
+                        help='Request timeout in seconds (default: 8s)')
+    parser.add_argument('--checkpoint', type=int, default=100,
+                        help='Save progress every N URLs (default: 100)')
+    parser.add_argument('--resume', action='store_true', default=True,
+                        help='Resume from checkpoint (default: True)')
+    parser.add_argument('--no-resume', dest='resume', action='store_false',
+                        help='Start fresh, ignore checkpoint')
+    parser.add_argument('--limit', type=int, default=None,
+                        help='Limit number of URLs to download (for testing)')
+    parser.add_argument('--balance', action='store_true',
+                        help='Download equal number of legitimate and phishing URLs')
+    args = parser.parse_args()
+    logger.info("="*80)
+    logger.info("HTML CONTENT DOWNLOADER - Phishing Detection")
+    logger.info("="*80)
+    # Load URLs
+    script_dir = Path(__file__).parent.parent.parent
+    input_path = (script_dir / args.input).resolve()
+    logger.info(f"\nLoading URLs from: {input_path}")
+    df = pd.read_csv(input_path)
+    logger.info(f"Loaded: {len(df):,} URLs")
+    # Show columns
+    logger.info(f"Columns: {list(df.columns)}")
+    # Verify required columns
+    if 'url' not in df.columns and 'URL' not in df.columns:
+        logger.error("No 'url' or 'URL' column found in dataset!")
+        return
+    if 'label' not in df.columns:
+        logger.error("No 'label' column found in dataset!")
+        return
+    # Show label distribution
+    logger.info(f"\nLabel distribution in dataset:")
+    label_counts = df['label'].value_counts()
+    for label, count in label_counts.items():
+        label_name = 'Legitimate' if label == 0 else 'Phishing'
+        logger.info(f"  {label_name} (label={label}): {count:,}")
+    # Balance dataset if requested
+    if args.balance:
+        min_count = label_counts.min()
+        df_balanced = pd.concat([
+            df[df['label'] == 0].sample(n=min(min_count, len(df[df['label'] == 0])), random_state=42),
+            df[df['label'] == 1].sample(n=min(min_count, len(df[df['label'] == 1])), random_state=42)
+        ]).sample(frac=1, random_state=42).reset_index(drop=True)
+        df = df_balanced
+        logger.info(f"\nBalanced dataset to {min_count:,} samples per class")
+        logger.info(f"Total URLs after balancing: {len(df):,}")
+    # Limit for testing
+    if args.limit:
+        df = df.head(args.limit)
+        logger.info(f"Limited to first {args.limit:,} URLs for testing")
+    # Initialize optimized downloader
+    output_dir = (script_dir / args.output).resolve()
+    downloader = HTMLDownloader(
+        output_dir=output_dir,
+        max_workers=args.workers,
+        timeout=args.timeout,
+        checkpoint_interval=args.checkpoint
+    )
+    # Download HTML content with checkpointing
+    results_df = downloader.download_batch(
+        df,
+        label_column='label' if 'label' in df.columns else None,  # type: ignore
+        id_column='phish_id' if 'phish_id' in df.columns else None,  # type: ignore
+        resume=args.resume
+    )
+    # Save results
+    results_file = output_dir / f'download_results_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
+    results_df.to_csv(results_file, index=False)
+    logger.info(f"\n✓ Results saved to: {results_file}")
+    # Save metadata mapping (URL to filename)
+    metadata = results_df[results_df['status'] == 'success'][['url', 'label', 'filename', 'url_id']]
+    metadata_file = output_dir / 'html_metadata.csv'
+    metadata.to_csv(metadata_file, index=False)
+    logger.info(f"✓ Metadata saved to: {metadata_file}")
+    logger.info("\n" + "="*80)
+    logger.info("✓ HTML DOWNLOAD COMPLETE!")
+    logger.info("="*80)
+    logger.info(f"\nFiles saved to:")
+    logger.info(f"  Legitimate: {output_dir / 'legitimate'}")
+    logger.info(f"  Phishing: {output_dir / 'phishing'}")
+    logger.info(f"\nHTML files have been optimized for feature extraction:")
+    logger.info(f"  - Comments removed")
+    logger.info(f"  - Whitespace normalized")
+    logger.info(f"  - Inline styles removed")
+    logger.info(f"  - Structure preserved for feature extraction")
+    logger.info("="*80)
+if __name__ == "__main__":
+    main()

scripts/data_collection/download_legitimate_html.py ADDED Viewed

	@@ -0,0 +1,286 @@

+"""
+Download HTML Content from Legitimate URLs
+Downloads HTML from top-1m legitimate URLs for training
+"""
+import pandas as pd
+from pathlib import Path
+import requests
+import hashlib
+import logging
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from tqdm import tqdm
+from datetime import datetime
+import warnings
+# Disable SSL warnings
+warnings.filterwarnings('ignore', message='Unverified HTTPS request')
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger(__name__)
+class LegitimateHTMLDownloader:
+    """Download HTML content from legitimate URLs."""
+    def __init__(self, output_dir='data/html_legitimate', max_workers=10, timeout=10):
+        self.output_dir = Path(output_dir)
+        self.output_dir.mkdir(parents=True, exist_ok=True)
+        self.max_workers = max_workers
+        self.timeout = timeout
+        # Stats
+        self.stats = {
+            'total': 0,
+            'success': 0,
+            'failed': 0,
+            'timeout': 0,
+            'error': 0
+        }
+        # Headers to mimic browser
+        self.headers = {
+            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
+        }
+    def _url_to_filename(self, url):
+        """Convert URL to safe filename using MD5 hash."""
+        url_hash = hashlib.md5(url.encode()).hexdigest()
+        return f"{url_hash}.html"
+    def download_single_url(self, url, url_id=None):
+        """
+        Download HTML from a single URL.
+        Args:
+            url: URL to download
+            url_id: Optional ID from dataset
+        Returns:
+            Dictionary with download result
+        """
+        result = {
+            'url': url,
+            'url_id': url_id,
+            'status': 'failed',
+            'filename': None,
+            'size': 0,
+            'error': None
+        }
+        self.stats['total'] += 1
+        try:
+            # Add https:// if missing
+            if not url.startswith(('http://', 'https://')):
+                url = 'https://' + url
+            # Download with timeout
+            response = requests.get(
+                url,
+                headers=self.headers,
+                timeout=self.timeout,
+                allow_redirects=True,
+                verify=False
+            )
+            # Check if successful
+            if response.status_code == 200:
+                # Save HTML content
+                filename = self._url_to_filename(url)
+                filepath = self.output_dir / filename
+                with open(filepath, 'w', encoding='utf-8', errors='ignore') as f:
+                    f.write(response.text)
+                result['status'] = 'success'
+                result['filename'] = filename
+                result['size'] = len(response.text)
+                self.stats['success'] += 1
+            else:
+                result['status'] = 'failed'
+                result['error'] = f"HTTP {response.status_code}"
+                self.stats['failed'] += 1
+        except requests.Timeout:
+            result['status'] = 'timeout'
+            result['error'] = 'Timeout'
+            self.stats['timeout'] += 1
+        except requests.RequestException as e:
+            result['status'] = 'error'
+            result['error'] = f"Request error: {str(e)[:100]}"
+            self.stats['error'] += 1
+        except Exception as e:
+            result['status'] = 'error'
+            result['error'] = f"Unknown error: {str(e)[:100]}"
+            self.stats['error'] += 1
+        return result
+    def download_batch(self, urls_df, id_column=None):
+        """
+        Download HTML content from multiple URLs in parallel.
+        Args:
+            urls_df: DataFrame with URLs
+            id_column: Optional column name for ID
+        Returns:
+            DataFrame with results
+        """
+        logger.info(f"Starting download of {len(urls_df):,} URLs...")
+        logger.info(f"Using {self.max_workers} parallel workers")
+        logger.info(f"Timeout: {self.timeout}s per URL")
+        logger.info(f"Output directory: {self.output_dir.absolute()}")
+        results = []
+        # Prepare tasks
+        tasks = []
+        for idx, row in urls_df.iterrows():
+            # Handle different column names
+            if 'URL' in row:
+                url = row['URL']
+            elif 'url' in row:
+                url = row['url']
+            elif 'domain' in row:
+                url = row['domain']
+            else:
+                # Assume second column is URL/domain
+                url = row.iloc[1]
+            url_id = row[id_column] if id_column and id_column in row else idx
+            tasks.append((url, url_id))
+        # Download in parallel with progress bar
+        with ThreadPoolExecutor(max_workers=self.max_workers) as executor:
+            # Submit all tasks
+            future_to_task = {
+                executor.submit(self.download_single_url, url, url_id): (url, url_id)
+                for url, url_id in tasks
+            }
+            # Process completed tasks with progress bar
+            with tqdm(total=len(tasks), desc="Downloading HTML", unit="url") as pbar:
+                for future in as_completed(future_to_task):
+                    result = future.result()
+                    results.append(result)
+                    pbar.update(1)
+                    # Update progress bar description with stats
+                    pbar.set_postfix({
+                        'Success': self.stats['success'],
+                        'Failed': self.stats['failed'] + self.stats['timeout'] + self.stats['error']
+                    })
+        # Create results DataFrame
+        results_df = pd.DataFrame(results)
+        # Print summary
+        self._print_summary(results_df)
+        return results_df
+    def _print_summary(self, results_df):
+        """Print download summary statistics."""
+        logger.info("\n" + "="*80)
+        logger.info("DOWNLOAD SUMMARY")
+        logger.info("="*80)
+        logger.info(f"\nTotal URLs processed: {self.stats['total']:,}")
+        logger.info(f"  ✓ Successful:  {self.stats['success']:,} ({self.stats['success']/max(self.stats['total'],1)*100:.1f}%)")
+        logger.info(f"  ✗ Failed:      {self.stats['failed']:,}")
+        logger.info(f"  ⏱ Timeout:     {self.stats['timeout']:,}")
+        logger.info(f"  ⚠ Error:       {self.stats['error']:,}")
+        # Only show detailed stats if we have results
+        if not results_df.empty and 'status' in results_df.columns:
+            # Total size downloaded
+            successful_downloads = results_df[results_df['status'] == 'success']
+            if not successful_downloads.empty:
+                total_size = successful_downloads['size'].sum()
+                logger.info(f"\nTotal HTML downloaded: {total_size/1024/1024:.2f} MB")
+        logger.info("="*80)
+def main():
+    """Main function to download HTML from legitimate URLs."""
+    import argparse
+    parser = argparse.ArgumentParser(description='Download HTML content from legitimate URLs')
+    parser.add_argument('--input', type=str, default='data/raw/legitimate.csv',
+                       help='Input CSV file with legitimate URLs (default: top-1m.csv with 1M URLs)')
+    parser.add_argument('--output', type=str, default='data/html_legitimate',
+                       help='Output directory for HTML files')
+    parser.add_argument('--limit', type=int, default=50000,
+                       help='Number of URLs to download (default: 50000)')
+    parser.add_argument('--workers', type=int, default=20,
+                       help='Number of parallel workers (default: 20)')
+    parser.add_argument('--timeout', type=int, default=10,
+                       help='Timeout per URL in seconds (default: 10)')
+    args = parser.parse_args()
+    # Print header
+    logger.info("="*80)
+    logger.info("LEGITIMATE HTML DOWNLOADER - Phishing Detection")
+    logger.info("="*80)
+    # Load URLs
+    script_dir = Path(__file__).parent.parent
+    input_path = (script_dir / args.input).resolve()
+    logger.info(f"\nLoading URLs from: {input_path}")
+    df = pd.read_csv(input_path)
+    logger.info(f"Loaded: {len(df):,} URLs")
+    # Show columns
+    logger.info(f"Columns: {list(df.columns)}")
+    # Limit number of URLs
+    if args.limit:
+        df = df.head(args.limit)
+        logger.info(f"Limited to first {args.limit:,} URLs")
+    # Initialize downloader
+    output_dir = (script_dir / args.output).resolve()
+    downloader = LegitimateHTMLDownloader(
+        output_dir=output_dir,
+        max_workers=args.workers,
+        timeout=args.timeout
+    )
+    # Download
+    results_df = downloader.download_batch(
+        df,
+        id_column='id' if 'id' in df.columns else None
+    )
+    # Save results
+    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+    results_path = output_dir / f'download_results_{timestamp}.csv'
+    results_df.to_csv(results_path, index=False)
+    logger.info(f"\n✓ Results saved to: {results_path}")
+    # Save successful downloads metadata
+    successful = results_df[results_df['status'] == 'success']
+    if len(successful) > 0:
+        metadata_path = output_dir / 'html_metadata.csv'
+        successful[['url', 'filename', 'size']].to_csv(metadata_path, index=False)
+        logger.info(f"✓ Metadata saved to: {metadata_path}")
+    logger.info("\n" + "="*80)
+    logger.info("✓ LEGITIMATE HTML DOWNLOAD COMPLETE!")
+    logger.info("="*80)
+if __name__ == '__main__':
+    main()

scripts/feature_extraction/__pycache__/extract_combined_features.cpython-313.pyc ADDED Viewed

Binary file (17 kB). View file

scripts/feature_extraction/__pycache__/html_features.cpython-313.pyc ADDED Viewed

Binary file (21.8 kB). View file

scripts/feature_extraction/__pycache__/url_features.cpython-313.pyc ADDED Viewed

Binary file (61.2 kB). View file

scripts/feature_extraction/__pycache__/url_features_optimized.cpython-313.pyc ADDED Viewed

Binary file (51 kB). View file

scripts/feature_extraction/__pycache__/url_features_v2.cpython-313.pyc ADDED Viewed

Binary file (61.3 kB). View file

scripts/feature_extraction/__pycache__/url_features_v3.cpython-313.pyc ADDED Viewed

Binary file (50.8 kB). View file

scripts/feature_extraction/extract_combined_features.py ADDED Viewed

	@@ -0,0 +1,347 @@

+"""
+Combined URL + HTML Feature Extraction from clean_dataset.csv
+Reads URLs from clean_dataset.csv, extracts URL features and downloads HTML
+to extract HTML features, combines them into a single feature dataset.
+Produces a balanced combined_features.csv.
+Usage:
+    python scripts/feature_extraction/extract_combined_features.py
+    python scripts/feature_extraction/extract_combined_features.py --workers 20 --timeout 15
+    python scripts/feature_extraction/extract_combined_features.py --limit 1000 --no-balance
+"""
+import argparse
+import logging
+import random
+import sys
+import time
+import warnings
+from concurrent.futures import ThreadPoolExecutor, as_completed
+from pathlib import Path
+from threading import Lock
+import numpy as np
+import pandas as pd
+import requests
+import urllib3
+from tqdm import tqdm
+# Suppress SSL warnings (phishing sites often have invalid certs)
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+warnings.filterwarnings('ignore', message='.*Unverified HTTPS.*')
+# ---------------------------------------------------------------------------
+# Project setup
+# ---------------------------------------------------------------------------
+PROJECT_ROOT = Path(__file__).resolve().parents[2]  # src/
+sys.path.insert(0, str(PROJECT_ROOT))
+from scripts.feature_extraction.url.url_features_v3 import URLFeatureExtractorOptimized
+from scripts.feature_extraction.html.html_feature_extractor import HTMLFeatureExtractor
+from scripts.feature_extraction.html.feature_engineering import engineer_features
+# ---------------------------------------------------------------------------
+# Logging
+# ---------------------------------------------------------------------------
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S',
+)
+logger = logging.getLogger('extract_combined')
+# ---------------------------------------------------------------------------
+# Constants
+# ---------------------------------------------------------------------------
+HEADERS = {
+    'User-Agent': (
+        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
+        'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36'
+    ),
+    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
+    'Accept-Language': 'en-US,en;q=0.5',
+}
+CHECKPOINT_FILE = PROJECT_ROOT / 'data' / 'features' / '_combined_checkpoint.csv'
+# ---------------------------------------------------------------------------
+# Feature extraction for a single URL (runs in thread)
+# ---------------------------------------------------------------------------
+def extract_single(
+    url: str,
+    label: int,
+    url_extractor: URLFeatureExtractorOptimized,
+    html_extractor: HTMLFeatureExtractor,
+    timeout: int = 10,
+) -> dict | None:
+    """
+    Extract URL + HTML features for a single URL.
+    Returns:
+        Combined feature dict with url, label, and all features,
+        or None on total failure.
+    """
+    result = {'url': url, 'label': label}
+    # --- 1. URL features (always succeeds) ---
+    try:
+        url_feats = url_extractor.extract_features(url)
+        for k, v in url_feats.items():
+            result[f'url_{k}'] = v
+    except Exception as e:
+        logger.debug(f"URL feature error for {url}: {e}")
+        return None
+    # --- 2. Download HTML & extract HTML features ---
+    html_ok = False
+    try:
+        resp = requests.get(
+            url, timeout=timeout, verify=False, headers=HEADERS,
+            allow_redirects=True,
+        )
+        if resp.status_code == 200 and len(resp.text) > 200:
+            raw_feats = html_extractor.extract_features(resp.text)
+            # Apply feature engineering
+            raw_df = pd.DataFrame([raw_feats])
+            eng_df = engineer_features(raw_df)
+            eng_row = eng_df.iloc[0].to_dict()
+            for k, v in eng_row.items():
+                result[f'html_{k}'] = v
+            html_ok = True
+    except Exception:
+        pass
+    if not html_ok:
+        # Fill HTML features with zeros
+        dummy_html = html_extractor.extract_features('')
+        dummy_df = pd.DataFrame([dummy_html])
+        eng_df = engineer_features(dummy_df)
+        for k in eng_df.columns:
+            result[f'html_{k}'] = 0
+    return result
+# ---------------------------------------------------------------------------
+# Batch extraction with threading + checkpointing
+# ---------------------------------------------------------------------------
+def extract_all(
+    df: pd.DataFrame,
+    max_workers: int = 10,
+    timeout: int = 10,
+    checkpoint_every: int = 500,
+) -> pd.DataFrame:
+    """
+    Extract combined features for all URLs using thread pool.
+    Args:
+        df: DataFrame with 'url' and 'label' columns.
+        max_workers: Parallel download threads.
+        timeout: HTTP timeout per URL (seconds).
+        checkpoint_every: Save intermediate results every N rows.
+    Returns:
+        DataFrame with combined features.
+    """
+    url_extractor = URLFeatureExtractorOptimized()
+    html_extractor = HTMLFeatureExtractor()
+    urls = df['url'].tolist()
+    labels = df['label'].tolist()
+    total = len(urls)
+    # --- Load checkpoint if exists ---
+    done_urls = set()
+    results = []
+    if CHECKPOINT_FILE.exists():
+        ckpt = pd.read_csv(CHECKPOINT_FILE)
+        done_urls = set(ckpt['url'].tolist())
+        results = ckpt.to_dict('records')
+        logger.info(f"Resuming from checkpoint: {len(done_urls):,} URLs already done")
+    remaining = [(u, l) for u, l in zip(urls, labels) if u not in done_urls]
+    logger.info(f"Remaining URLs to process: {len(remaining):,} / {total:,}")
+    if not remaining:
+        logger.info("All URLs already processed!")
+        return pd.DataFrame(results)
+    lock = Lock()
+    n_success = 0
+    n_html_fail = 0
+    n_fail = 0
+    t_start = time.perf_counter()
+    def _worker(url_label):
+        u, l = url_label
+        return extract_single(u, l, url_extractor, html_extractor, timeout)
+    with ThreadPoolExecutor(max_workers=max_workers) as pool:
+        futures = {pool.submit(_worker, item): item for item in remaining}
+        with tqdm(total=len(remaining), desc='Extracting', unit='url') as pbar:
+            for future in as_completed(futures):
+                pbar.update(1)
+                result = future.result()
+                with lock:
+                    if result is not None:
+                        results.append(result)
+                        n_success += 1
+                        # Check if HTML was zero-filled
+                        if result.get('html_num_tags', 0) == 0:
+                            n_html_fail += 1
+                    else:
+                        n_fail += 1
+                    # Checkpoint
+                    if len(results) % checkpoint_every == 0:
+                        _save_checkpoint(results)
+    elapsed = time.perf_counter() - t_start
+    speed = len(remaining) / elapsed if elapsed > 0 else 0
+    logger.info(f"\nExtraction complete in {elapsed:.1f}s ({speed:.0f} URLs/sec)")
+    logger.info(f"  Successful: {n_success:,}")
+    logger.info(f"  HTML download failed (zero-filled): {n_html_fail:,}")
+    logger.info(f"  Total failures (skipped): {n_fail:,}")
+    # Final checkpoint
+    _save_checkpoint(results)
+    return pd.DataFrame(results)
+def _save_checkpoint(results: list):
+    """Save intermediate results to checkpoint file."""
+    CHECKPOINT_FILE.parent.mkdir(parents=True, exist_ok=True)
+    pd.DataFrame(results).to_csv(CHECKPOINT_FILE, index=False)
+# ---------------------------------------------------------------------------
+# Balance dataset
+# ---------------------------------------------------------------------------
+def balance_dataset(df: pd.DataFrame, random_state: int = 42) -> pd.DataFrame:
+    """Undersample majority class to balance the dataset."""
+    counts = df['label'].value_counts()
+    min_count = counts.min()
+    logger.info(f"Balancing: {counts.to_dict()} → {min_count:,} per class")
+    balanced = (
+        df.groupby('label', group_keys=False)
+          .apply(lambda g: g.sample(n=min_count, random_state=random_state))
+    )
+    balanced = balanced.sample(frac=1, random_state=random_state).reset_index(drop=True)
+    return balanced
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(
+        description='Extract combined URL + HTML features from clean_dataset.csv')
+    parser.add_argument('--input', type=str,
+                        default='data/processed/clean_dataset.csv',
+                        help='Input CSV with url,label columns')
+    parser.add_argument('--output', type=str,
+                        default='data/features/combined_features.csv',
+                        help='Output CSV path')
+    parser.add_argument('--workers', type=int, default=10,
+                        help='Parallel download threads (default: 10)')
+    parser.add_argument('--timeout', type=int, default=10,
+                        help='HTTP timeout in seconds (default: 10)')
+    parser.add_argument('--limit', type=int, default=None,
+                        help='Limit total URLs (for testing)')
+    parser.add_argument('--checkpoint-every', type=int, default=500,
+                        help='Save checkpoint every N URLs (default: 500)')
+    parser.add_argument('--no-balance', action='store_true',
+                        help='Do not balance the output dataset')
+    args = parser.parse_args()
+    input_path = (PROJECT_ROOT / args.input).resolve()
+    output_path = (PROJECT_ROOT / args.output).resolve()
+    logger.info("=" * 70)
+    logger.info("COMBINED URL + HTML FEATURE EXTRACTION")
+    logger.info("=" * 70)
+    logger.info(f"  Input:    {input_path}")
+    logger.info(f"  Output:   {output_path}")
+    logger.info(f"  Workers:  {args.workers}")
+    logger.info(f"  Timeout:  {args.timeout}s")
+    logger.info(f"  Balance:  {'YES' if not args.no_balance else 'NO'}")
+    # --- Load dataset ---
+    df = pd.read_csv(input_path)
+    logger.info(f"\nLoaded {len(df):,} URLs")
+    logger.info(f"  Label distribution: {df['label'].value_counts().to_dict()}")
+    if args.limit:
+        # Stratified limit
+        per_class = args.limit // 2
+        df = (
+            df.groupby('label', group_keys=False)
+              .apply(lambda g: g.sample(n=min(per_class, len(g)), random_state=42))
+        )
+        df = df.reset_index(drop=True)
+        logger.info(f"  Limited to: {len(df):,} URLs")
+    # --- Extract features ---
+    features_df = extract_all(
+        df,
+        max_workers=args.workers,
+        timeout=args.timeout,
+        checkpoint_every=args.checkpoint_every,
+    )
+    if features_df.empty:
+        logger.error("No features extracted!")
+        sys.exit(1)
+    logger.info(f"\nExtracted features: {features_df.shape}")
+    logger.info(f"  Label distribution: {features_df['label'].value_counts().to_dict()}")
+    # --- Balance ---
+    if not args.no_balance:
+        features_df = balance_dataset(features_df)
+        logger.info(f"  After balancing: {features_df.shape}")
+        logger.info(f"  Label dist: {features_df['label'].value_counts().to_dict()}")
+    # --- Reorder columns: url, label first, then sorted features ---
+    meta_cols = ['url', 'label']
+    feature_cols = sorted([c for c in features_df.columns if c not in meta_cols])
+    features_df = features_df[meta_cols + feature_cols]
+    # --- Clean up infinities / NaNs ---
+    features_df = features_df.replace([np.inf, -np.inf], 0)
+    features_df = features_df.fillna(0)
+    # --- Save ---
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    features_df.to_csv(output_path, index=False)
+    # --- Cleanup checkpoint ---
+    if CHECKPOINT_FILE.exists():
+        CHECKPOINT_FILE.unlink()
+        logger.info("Checkpoint file cleaned up")
+    # --- Summary ---
+    logger.info("\n" + "=" * 70)
+    logger.info("EXTRACTION COMPLETE")
+    logger.info("=" * 70)
+    logger.info(f"  Total samples:  {len(features_df):,}")
+    logger.info(f"    Legitimate:   {(features_df['label'] == 0).sum():,}")
+    logger.info(f"    Phishing:     {(features_df['label'] == 1).sum():,}")
+    logger.info(f"  Total features: {len(feature_cols)}")
+    url_feats = [c for c in feature_cols if c.startswith('url_')]
+    html_feats = [c for c in feature_cols if c.startswith('html_')]
+    logger.info(f"    URL features:  {len(url_feats)}")
+    logger.info(f"    HTML features: {len(html_feats)}")
+    logger.info(f"  Output: {output_path}")
+    logger.info("=" * 70)
+if __name__ == '__main__':
+    main()

scripts/feature_extraction/html/__pycache__/feature_engineering.cpython-313.pyc ADDED Viewed

Binary file (5.64 kB). View file

scripts/feature_extraction/html/__pycache__/html_feature_extractor.cpython-313.pyc ADDED Viewed

Binary file (25.4 kB). View file

scripts/feature_extraction/html/extract_features.py ADDED Viewed

	@@ -0,0 +1,322 @@

+"""
+Parallel HTML Feature Extraction Pipeline
+Processes ~80k HTML files using multiprocessing for CPU-bound parsing.
+Integrates quality filtering INTO the same parse pass (no double-parsing).
+Includes checkpointing, progress tracking, and balanced output.
+Usage:
+    python scripts/feature_extraction/html/extract_features.py
+    python scripts/feature_extraction/html/extract_features.py --no-filter
+    python scripts/feature_extraction/html/extract_features.py --workers 8
+"""
+import argparse
+import json
+import logging
+import sys
+import time
+from concurrent.futures import ProcessPoolExecutor, as_completed
+from pathlib import Path
+import pandas as pd
+from tqdm import tqdm
+# ---------------------------------------------------------------------------
+# Resolve project root so imports work regardless of cwd
+# ---------------------------------------------------------------------------
+PROJECT_ROOT = Path(__file__).resolve().parents[3]  # src/
+sys.path.insert(0, str(PROJECT_ROOT))
+from scripts.feature_extraction.html.html_feature_extractor import HTMLFeatureExtractor
+# ---------------------------------------------------------------------------
+# Logging
+# ---------------------------------------------------------------------------
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S',
+)
+logger = logging.getLogger('extract_features')
+# ---------------------------------------------------------------------------
+# Quality filter constants
+# ---------------------------------------------------------------------------
+MIN_FILE_SIZE = 800           # bytes
+MIN_TAGS = 8
+MIN_WORDS = 30
+ERROR_PATTERNS = [
+    'page not found', '404 not found', '403 forbidden',
+    'access denied', 'server error', 'not available',
+    'domain for sale', 'website expired', 'coming soon',
+    'under construction', 'parked domain', 'buy this domain',
+    'domain has expired', 'this site can',
+]
+# ---------------------------------------------------------------------------
+# Worker function – runs in a subprocess
+# ---------------------------------------------------------------------------
+def _process_file(args: tuple) -> dict | None:
+    """
+    Process a single HTML file: read → (optionally filter) → extract.
+    Designed to run inside ProcessPoolExecutor – must be picklable
+    (top-level function, not a method).
+    Args:
+        args: (file_path_str, label, apply_filter)
+    Returns:
+        Feature dict with 'filename' and 'label' added, or None on skip/error.
+    """
+    file_path_str, label, apply_filter = args
+    try:
+        path = Path(file_path_str)
+        # --- Read file ---
+        raw = path.read_text(encoding='utf-8', errors='ignore')
+        # --- Quick pre-filter (before expensive parse) ---
+        if apply_filter and len(raw) < MIN_FILE_SIZE:
+            return None
+        # --- Parse once with lxml ---
+        from bs4 import BeautifulSoup
+        try:
+            soup = BeautifulSoup(raw, 'lxml')
+        except Exception:
+            soup = BeautifulSoup(raw, 'html.parser')
+        # --- Quality filter (uses the already-parsed soup) ---
+        if apply_filter:
+            if not soup.find('body'):
+                return None
+            all_tags = soup.find_all()
+            if len(all_tags) < MIN_TAGS:
+                return None
+            text = soup.get_text(separator=' ', strip=True).lower()
+            words = text.split()
+            if len(words) < MIN_WORDS:
+                return None
+            # Check error-page patterns (first 2000 chars only)
+            text_head = text[:2000]
+            for pat in ERROR_PATTERNS:
+                if pat in text_head:
+                    return None
+            # Must have at least some content elements
+            has_content = (
+                len(soup.find_all('a')) > 0 or
+                len(soup.find_all('form')) > 0 or
+                len(soup.find_all('input')) > 0 or
+                len(soup.find_all('img')) > 0 or
+                len(soup.find_all('div')) > 3
+            )
+            if not has_content:
+                return None
+        # --- Extract features (re-parses internally with cache) ---
+        extractor = HTMLFeatureExtractor()
+        features = extractor.extract_features(raw)
+        features['filename'] = path.name
+        features['label'] = label
+        return features
+    except Exception:
+        return None
+# ---------------------------------------------------------------------------
+# Directory processor
+# ---------------------------------------------------------------------------
+def extract_from_directory(
+    html_dir: Path,
+    label: int,
+    apply_filter: bool = True,
+    max_workers: int = 6,
+    limit: int | None = None,
+) -> list[dict]:
+    """
+    Extract features from all .html files in a directory using multiprocessing.
+    Args:
+        html_dir: Directory with .html files
+        label: 0 = legitimate, 1 = phishing
+        apply_filter: Apply quality filter
+        max_workers: Number of parallel workers
+        limit: Max files to return (None = all)
+    Returns:
+        List of feature dictionaries
+    """
+    html_files = sorted(html_dir.glob('*.html'))
+    total = len(html_files)
+    label_name = 'Phishing' if label == 1 else 'Legitimate'
+    logger.info(f"\n{'='*60}")
+    logger.info(f"Processing {label_name}: {total:,} files")
+    logger.info(f"  Directory: {html_dir}")
+    logger.info(f"  Quality filter: {'ON' if apply_filter else 'OFF'}")
+    logger.info(f"  Workers: {max_workers}")
+    logger.info(f"{'='*60}")
+    # Build task list
+    tasks = [(str(f), label, apply_filter) for f in html_files]
+    results = []
+    n_filtered = 0
+    t0 = time.perf_counter()
+    with ProcessPoolExecutor(max_workers=max_workers) as pool:
+        futures = {pool.submit(_process_file, t): t for t in tasks}
+        with tqdm(total=total, desc=f'{label_name}', unit='file') as pbar:
+            for future in as_completed(futures):
+                pbar.update(1)
+                result = future.result()
+                if result is None:
+                    n_filtered += 1
+                else:
+                    results.append(result)
+                    if limit and len(results) >= limit:
+                        # Cancel remaining futures
+                        for f in futures:
+                            f.cancel()
+                        break
+    elapsed = time.perf_counter() - t0
+    speed = total / elapsed if elapsed > 0 else 0
+    logger.info(f"  Extracted: {len(results):,} quality samples")
+    logger.info(f"  Filtered out: {n_filtered:,} ({n_filtered/max(total,1)*100:.1f}%)")
+    logger.info(f"  Time: {elapsed:.1f}s ({speed:.0f} files/sec)")
+    return results
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    parser = argparse.ArgumentParser(
+        description='Extract HTML features for phishing detection (parallel)')
+    parser.add_argument('--phishing-dir', type=str, nargs='+',
+                        default=['data/html/phishing', 'data/html/phishing_v1'],
+                        help='Directories with phishing HTML files')
+    parser.add_argument('--legit-dir', type=str, nargs='+',
+                        default=['data/html/legitimate', 'data/html/legitimate_v1'],
+                        help='Directories with legitimate HTML files')
+    parser.add_argument('--output', type=str, default='data/features/html_features.csv',
+                        help='Output CSV path')
+    parser.add_argument('--workers', type=int, default=6,
+                        help='Number of parallel workers (default: 6)')
+    parser.add_argument('--no-filter', action='store_true',
+                        help='Disable quality filtering')
+    parser.add_argument('--limit', type=int, default=None,
+                        help='Limit samples per class (for testing)')
+    parser.add_argument('--no-balance', action='store_true',
+                        help='Do not balance classes')
+    args = parser.parse_args()
+    apply_filter = not args.no_filter
+    # Resolve paths relative to project root
+    phishing_dirs = [(PROJECT_ROOT / d).resolve() for d in args.phishing_dir]
+    legit_dirs = [(PROJECT_ROOT / d).resolve() for d in args.legit_dir]
+    output_path = (PROJECT_ROOT / args.output).resolve()
+    logger.info("=" * 70)
+    logger.info("HTML FEATURE EXTRACTION PIPELINE")
+    logger.info("=" * 70)
+    for d in phishing_dirs:
+        logger.info(f"  Phishing dir:  {d}")
+    for d in legit_dirs:
+        logger.info(f"  Legitimate dir: {d}")
+    logger.info(f"  Output:        {output_path}")
+    logger.info(f"  Workers:       {args.workers}")
+    logger.info(f"  Quality filter: {'ON' if apply_filter else 'OFF'}")
+    # Validate directories
+    for d in phishing_dirs:
+        if not d.exists():
+            logger.warning(f"Phishing directory not found (skipping): {d}")
+    for d in legit_dirs:
+        if not d.exists():
+            logger.warning(f"Legitimate directory not found (skipping): {d}")
+    # ---- Extract features ----
+    t_start = time.perf_counter()
+    phishing_features = []
+    for d in phishing_dirs:
+        if d.exists():
+            phishing_features.extend(extract_from_directory(
+                d, label=1, apply_filter=apply_filter,
+                max_workers=args.workers, limit=args.limit))
+    legit_features = []
+    for d in legit_dirs:
+        if d.exists():
+            legit_features.extend(extract_from_directory(
+                d, label=0, apply_filter=apply_filter,
+                max_workers=args.workers, limit=args.limit))
+    # ---- Balance ----
+    if not args.no_balance:
+        min_count = min(len(phishing_features), len(legit_features))
+        logger.info(f"\nBalancing to {min_count:,} per class")
+        # Shuffle before truncating to get random sample
+        import random
+        random.seed(42)
+        random.shuffle(phishing_features)
+        random.shuffle(legit_features)
+        phishing_features = phishing_features[:min_count]
+        legit_features = legit_features[:min_count]
+    # ---- Build DataFrame ----
+    all_features = phishing_features + legit_features
+    if not all_features:
+        logger.error("No features extracted!")
+        sys.exit(1)
+    df = pd.DataFrame(all_features)
+    # Reorder columns: metadata first, then sorted features
+    meta_cols = ['filename', 'label']
+    feature_cols = sorted([c for c in df.columns if c not in meta_cols])
+    df = df[meta_cols + feature_cols]
+    # Shuffle rows
+    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
+    # ---- Save ----
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    df.to_csv(output_path, index=False)
+    elapsed = time.perf_counter() - t_start
+    # ---- Summary ----
+    logger.info("\n" + "=" * 70)
+    logger.info("EXTRACTION COMPLETE")
+    logger.info("=" * 70)
+    logger.info(f"  Total samples: {len(df):,}")
+    logger.info(f"    Phishing:    {(df['label']==1).sum():,}")
+    logger.info(f"    Legitimate:  {(df['label']==0).sum():,}")
+    logger.info(f"  Features:      {len(feature_cols)}")
+    logger.info(f"  Total time:    {elapsed:.1f}s")
+    logger.info(f"  Output:        {output_path}")
+    logger.info("=" * 70)
+    # Quick stats
+    numeric = df[feature_cols].describe().T[['mean', 'std', 'min', 'max']]
+    logger.info(f"\nFeature statistics (sample):")
+    logger.info(numeric.head(15).to_string())
+if __name__ == '__main__':
+    main()

scripts/feature_extraction/html/feature_engineering.py ADDED Viewed

	@@ -0,0 +1,127 @@

+"""
+Shared Feature Engineering for HTML-based Phishing Detection
+Creates derived features from raw HTML features to improve model performance.
+Used by both XGBoost and Random Forest training pipelines.
+"""
+import numpy as np
+import pandas as pd
+import logging
+logger = logging.getLogger(__name__)
+def engineer_features(X: pd.DataFrame) -> pd.DataFrame:
+    """
+    Create engineered features from raw HTML features.
+    Adds ratio features, interaction terms and risk scores
+    that capture phishing-specific patterns.
+    Args:
+        X: DataFrame with raw feature columns (no 'label'/'filename')
+    Returns:
+        DataFrame with original + engineered features (inf replaced by 0)
+    """
+    X = X.copy()
+    # ---- Ratio features (division guarded by +1) ----
+    X['forms_to_inputs_ratio'] = X['num_forms'] / (X['num_input_fields'] + 1)
+    X['external_to_total_links'] = X['num_external_links'] / (X['num_links'] + 1)
+    X['scripts_to_tags_ratio'] = X['num_scripts'] / (X['num_tags'] + 1)
+    X['hidden_to_visible_inputs'] = X['num_hidden_fields'] / (X['num_input_fields'] + 1)
+    X['password_to_inputs_ratio'] = X['num_password_fields'] / (X['num_input_fields'] + 1)
+    X['empty_to_total_links'] = X['num_empty_links'] / (X['num_links'] + 1)
+    X['images_to_tags_ratio'] = X['num_images'] / (X['num_tags'] + 1)
+    X['iframes_to_tags_ratio'] = X['num_iframes'] / (X['num_tags'] + 1)
+    # ---- Interaction features (suspicious combinations) ----
+    X['forms_with_passwords'] = X['num_forms'] * X['num_password_fields']
+    X['external_scripts_links'] = X['num_external_links'] * X['num_external_scripts']
+    X['urgency_with_forms'] = X['num_urgency_keywords'] * X['num_forms']
+    X['brand_with_forms'] = X['num_brand_mentions'] * X['num_forms']
+    X['iframes_with_scripts'] = X['num_iframes'] * X['num_scripts']
+    X['hidden_with_external'] = X['num_hidden_fields'] * X['num_external_form_actions']
+    # ---- Content density features ----
+    X['content_density'] = (X['text_length'] + 1) / (X['num_divs'] + X['num_spans'] + 1)
+    X['form_density'] = X['num_forms'] / (X['num_divs'] + 1)
+    X['scripts_per_form'] = X['num_scripts'] / (X['num_forms'] + 1)
+    X['links_per_word'] = X['num_links'] / (X['num_words'] + 1)
+    # ---- Risk scores ----
+    X['phishing_risk_score'] = (
+        X['num_urgency_keywords'] * 2 +
+        X['num_brand_mentions'] * 2 +
+        X['num_password_fields'] * 3 +
+        X['num_iframes'] * 2 +
+        X.get('num_hidden_iframes', 0) * 4 +
+        X.get('num_anchor_text_mismatch', 0) * 3 +
+        X.get('num_suspicious_tld_links', 0) * 2 +
+        X.get('has_login_form', 0) * 3
+    )
+    X['form_risk_score'] = (
+        X['num_password_fields'] * 3 +
+        X['num_external_form_actions'] * 2 +
+        X['num_empty_form_actions'] +
+        X['num_hidden_fields']
+    )
+    X['obfuscation_score'] = (
+        X['has_eval'] +
+        X['has_unescape'] +
+        X['has_escape'] +
+        X['has_document_write'] +
+        X.get('has_base64', 0) +
+        X.get('has_atob', 0) +
+        X.get('has_fromcharcode', 0)
+    )
+    X['legitimacy_score'] = (
+        X['has_title'] +
+        X.get('has_description', 0) +
+        X.get('has_viewport', 0) +
+        X.get('has_favicon', 0) +
+        X.get('has_copyright', 0) +
+        X.get('has_author', 0) +
+        (X['num_meta_tags'] > 3).astype(int) +
+        (X['num_css_files'] > 0).astype(int)
+    )
+    # ---- Boolean aggregations ----
+    X['has_suspicious_elements'] = (
+        (X.get('has_meta_refresh', 0) == 1) |
+        (X['num_iframes'] > 0) |
+        (X['num_hidden_fields'] > 3) |
+        (X.get('has_location_replace', 0) == 1)
+    ).astype(int)
+    # ---- Clean up ----
+    X = X.replace([np.inf, -np.inf], 0)
+    X = X.fillna(0)
+    return X
+def get_engineered_feature_names() -> list[str]:
+    """Return names of features added by engineer_features()."""
+    return [
+        # Ratios (8)
+        'forms_to_inputs_ratio', 'external_to_total_links',
+        'scripts_to_tags_ratio', 'hidden_to_visible_inputs',
+        'password_to_inputs_ratio', 'empty_to_total_links',
+        'images_to_tags_ratio', 'iframes_to_tags_ratio',
+        # Interactions (6)
+        'forms_with_passwords', 'external_scripts_links',
+        'urgency_with_forms', 'brand_with_forms',
+        'iframes_with_scripts', 'hidden_with_external',
+        # Content density (4)
+        'content_density', 'form_density', 'scripts_per_form', 'links_per_word',
+        # Risk scores (4)
+        'phishing_risk_score', 'form_risk_score',
+        'obfuscation_score', 'legitimacy_score',
+        # Boolean (1)
+        'has_suspicious_elements',
+    ]

scripts/feature_extraction/html/html_feature_extractor.py ADDED Viewed

	@@ -0,0 +1,510 @@

+"""
+Optimized HTML Feature Extractor for Phishing Detection
+Extracts 67 features from HTML content with single-parse efficiency.
+Uses cached tag lookups to avoid redundant find_all() calls.
+"""
+import re
+from urllib.parse import urlparse
+from bs4 import BeautifulSoup
+import logging
+logger = logging.getLogger(__name__)
+# Suspicious TLDs commonly used in phishing
+SUSPICIOUS_TLDS = {
+    '.tk', '.ml', '.ga', '.cf', '.gq', '.top', '.xyz', '.buzz',
+    '.club', '.online', '.site', '.icu', '.work', '.click', '.link',
+    '.info', '.pw', '.cc', '.ws', '.bid', '.stream', '.racing',
+}
+# Brand keywords phishers commonly impersonate
+BRAND_KEYWORDS = [
+    'paypal', 'amazon', 'google', 'microsoft', 'apple', 'facebook',
+    'netflix', 'ebay', 'instagram', 'twitter', 'linkedin', 'yahoo',
+    'bank', 'visa', 'mastercard', 'americanexpress', 'chase', 'wells',
+    'citibank', 'dhl', 'fedex', 'ups', 'usps', 'dropbox', 'adobe',
+    'spotify', 'whatsapp', 'telegram', 'steam', 'coinbase', 'binance',
+]
+# Urgency / social engineering keywords
+URGENCY_KEYWORDS = [
+    'urgent', 'verify', 'suspended', 'locked', 'confirm',
+    'security', 'alert', 'warning', 'expire', 'limited',
+    'immediately', 'click here', 'act now', 'unusual activity',
+    'unauthorized', 'restricted', 'risk', 'compromised',
+    'your account', 'update your', 'verify your', 'confirm your',
+    'within 24', 'within 48', 'action required',
+]
+class HTMLFeatureExtractor:
+    """
+    High-performance HTML feature extractor.
+    Parses HTML once and caches all tag lookups for efficiency.
+    Designed for batch processing of 40k+ files.
+    """
+    def extract_features(self, html_content: str, url: str | None = None) -> dict:
+        """
+        Extract all features from HTML content in a single pass.
+        Args:
+            html_content: Raw HTML string
+            url: Optional source URL for context
+        Returns:
+            Dictionary with 67 numeric features
+        """
+        try:
+            # --- Single parse with fast parser ---
+            try:
+                soup = BeautifulSoup(html_content, 'lxml')
+            except Exception:
+                soup = BeautifulSoup(html_content, 'html.parser')
+            # --- Cache tag lookups (done ONCE) ---
+            cache = self._build_cache(soup)
+            features = {}
+            features.update(self._structure_features(soup, cache, html_content))
+            features.update(self._form_features(cache))
+            features.update(self._link_features(cache))
+            features.update(self._script_features(cache))
+            features.update(self._text_features(soup, cache))
+            features.update(self._meta_features(soup, cache))
+            features.update(self._resource_features(cache))
+            features.update(self._advanced_features(soup, cache))
+            return features
+        except Exception as e:
+            logger.debug(f"Feature extraction error: {e}")
+            return self._default_features()
+    # ------------------------------------------------------------------
+    # Cache builder – avoids redundant find_all() across feature groups
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _build_cache(soup) -> dict:
+        """Build a lookup cache of all tags we need. Called once per document."""
+        all_tags = soup.find_all()
+        # Classify tags by name in a single pass
+        by_name: dict[str, list] = {}
+        for tag in all_tags:
+            by_name.setdefault(tag.name, []).append(tag)
+        # Convenience lists used by multiple feature groups
+        links_a = by_name.get('a', [])
+        forms = by_name.get('form', [])
+        inputs = by_name.get('input', [])
+        scripts = by_name.get('script', [])
+        images = by_name.get('img', [])
+        iframes = by_name.get('iframe', [])
+        meta_tags = by_name.get('meta', [])
+        style_tags = by_name.get('style', [])
+        css_links = [t for t in by_name.get('link', [])
+                     if t.get('rel') and 'stylesheet' in t.get('rel', [])]
+        all_link_tags = by_name.get('link', [])
+        # Pre-extract hrefs and input types (used in several groups)
+        hrefs = [a.get('href', '') or '' for a in links_a]
+        input_types = [(inp, (inp.get('type', '') or '').lower()) for inp in inputs]
+        return {
+            'all_tags': all_tags,
+            'by_name': by_name,
+            'links_a': links_a,
+            'hrefs': hrefs,
+            'forms': forms,
+            'inputs': inputs,
+            'input_types': input_types,
+            'scripts': scripts,
+            'images': images,
+            'iframes': iframes,
+            'meta_tags': meta_tags,
+            'style_tags': style_tags,
+            'css_links': css_links,
+            'all_link_tags': all_link_tags,
+        }
+    # ------------------------------------------------------------------
+    # 1. Structure features (12)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _structure_features(soup, c: dict, raw_html: str) -> dict:
+        bn = c['by_name']
+        # DOM depth – walk just the <body>
+        body = soup.find('body')
+        max_depth = 0
+        if body:
+            stack = [(body, 0)]
+            while stack:
+                node, depth = stack.pop()
+                if depth > max_depth:
+                    max_depth = depth
+                for child in getattr(node, 'children', []):
+                    if hasattr(child, 'name') and child.name:
+                        stack.append((child, depth + 1))
+        return {
+            'html_length': len(raw_html),
+            'num_tags': len(c['all_tags']),
+            'num_divs': len(bn.get('div', [])),
+            'num_spans': len(bn.get('span', [])),
+            'num_paragraphs': len(bn.get('p', [])),
+            'num_headings': sum(len(bn.get(h, []))
+                                for h in ('h1', 'h2', 'h3', 'h4', 'h5', 'h6')),
+            'num_lists': len(bn.get('ul', [])) + len(bn.get('ol', [])),
+            'num_images': len(c['images']),
+            'num_iframes': len(c['iframes']),
+            'num_tables': len(bn.get('table', [])),
+            'has_title': 1 if soup.find('title') else 0,
+            'dom_depth': max_depth,
+        }
+    # ------------------------------------------------------------------
+    # 2. Form features (11)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _form_features(c: dict) -> dict:
+        forms = c['forms']
+        input_types = c['input_types']
+        n_password = sum(1 for _, t in input_types if t == 'password')
+        n_email = sum(1 for _, t in input_types if t == 'email')
+        n_text = sum(1 for _, t in input_types if t == 'text')
+        n_hidden = sum(1 for _, t in input_types if t == 'hidden')
+        n_submit = sum(1 for _, t in input_types if t == 'submit')
+        # Also count <button type="submit">
+        n_submit += sum(1 for btn in c['by_name'].get('button', [])
+                        if (btn.get('type', '') or '').lower() == 'submit')
+        form_actions = [f.get('action', '') or '' for f in forms]
+        n_ext_action = sum(1 for a in form_actions if a.startswith('http'))
+        n_empty_action = sum(1 for a in form_actions if not a or a == '#')
+        return {
+            'num_forms': len(forms),
+            'num_input_fields': len(c['inputs']),
+            'num_password_fields': n_password,
+            'num_email_fields': n_email,
+            'num_text_fields': n_text,
+            'num_submit_buttons': n_submit,
+            'num_hidden_fields': n_hidden,
+            'has_login_form': 1 if (n_password > 0 and (n_email > 0 or n_text > 0)) else 0,
+            'has_form': 1 if forms else 0,
+            'num_external_form_actions': n_ext_action,
+            'num_empty_form_actions': n_empty_action,
+        }
+    # ------------------------------------------------------------------
+    # 3. Link features (10)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _link_features(c: dict) -> dict:
+        hrefs = c['hrefs']
+        links_a = c['links_a']
+        n_links = len(links_a)
+        n_external = sum(1 for h in hrefs if h.startswith('http'))
+        n_internal = sum(1 for h in hrefs if h.startswith('/') or h.startswith('#'))
+        n_empty = sum(1 for h in hrefs if not h or h == '#')
+        n_mailto = sum(1 for h in hrefs if h.startswith('mailto:'))
+        n_js = sum(1 for h in hrefs if 'javascript:' in h.lower())
+        n_ip = sum(1 for h in hrefs
+                   if re.search(r'https?://\d+\.\d+\.\d+\.\d+', h))
+        # Count links pointing to suspicious TLDs
+        n_suspicious_tld = 0
+        for h in hrefs:
+            if h.startswith('http'):
+                try:
+                    netloc = urlparse(h).netloc.lower()
+                    for tld in SUSPICIOUS_TLDS:
+                        if netloc.endswith(tld):
+                            n_suspicious_tld += 1
+                            break
+                except Exception:
+                    pass
+        ratio_ext = n_external / n_links if n_links > 0 else 0.0
+        return {
+            'num_links': n_links,
+            'num_external_links': n_external,
+            'num_internal_links': n_internal,
+            'num_empty_links': n_empty,
+            'num_mailto_links': n_mailto,
+            'num_javascript_links': n_js,
+            'ratio_external_links': ratio_ext,
+            'num_ip_based_links': n_ip,
+            'num_suspicious_tld_links': n_suspicious_tld,
+            'num_anchor_text_mismatch': HTMLFeatureExtractor._anchor_mismatch(links_a),
+        }
+    @staticmethod
+    def _anchor_mismatch(links_a: list) -> int:
+        """Count links where visible text shows a domain different from href."""
+        count = 0
+        url_pattern = re.compile(r'https?://[^\s<>"\']+')
+        for a in links_a:
+            href = a.get('href', '') or ''
+            text = a.get_text(strip=True)
+            if not href.startswith('http') or not text:
+                continue
+            text_urls = url_pattern.findall(text)
+            if text_urls:
+                try:
+                    href_domain = urlparse(href).netloc.lower()
+                    text_domain = urlparse(text_urls[0]).netloc.lower()
+                    if href_domain and text_domain and href_domain != text_domain:
+                        count += 1
+                except Exception:
+                    pass
+        return count
+    # ------------------------------------------------------------------
+    # 4. Script features (7)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _script_features(c: dict) -> dict:
+        scripts = c['scripts']
+        n_inline = 0
+        n_external = 0
+        script_text_parts = []
+        for s in scripts:
+            if s.get('src'):
+                n_external += 1
+            if s.string:
+                n_inline += 1
+                script_text_parts.append(s.string)
+        script_content = ' '.join(script_text_parts)
+        return {
+            'num_scripts': len(scripts),
+            'num_inline_scripts': n_inline,
+            'num_external_scripts': n_external,
+            'has_eval': 1 if 'eval(' in script_content else 0,
+            'has_unescape': 1 if 'unescape(' in script_content else 0,
+            'has_escape': 1 if 'escape(' in script_content else 0,
+            'has_document_write': 1 if 'document.write' in script_content else 0,
+        }
+    # ------------------------------------------------------------------
+    # 5. Text content features (8)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _text_features(soup, c: dict) -> dict:
+        text = soup.get_text(separator=' ', strip=True).lower()
+        words = text.split()
+        n_words = len(words)
+        html_len = len(str(soup))
+        return {
+            'text_length': len(text),
+            'num_words': n_words,
+            'text_to_html_ratio': len(text) / html_len if html_len > 0 else 0.0,
+            'num_brand_mentions': sum(1 for kw in BRAND_KEYWORDS if kw in text),
+            'num_urgency_keywords': sum(1 for kw in URGENCY_KEYWORDS if kw in text),
+            'has_copyright': 1 if ('©' in text or 'copyright' in text) else 0,
+            'has_phone_number': 1 if re.search(
+                r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text) else 0,
+            'has_email_address': 1 if re.search(
+                r'[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}', text) else 0,
+        }
+    # ------------------------------------------------------------------
+    # 6. Meta tag features (6)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _meta_features(soup, c: dict) -> dict:
+        meta_tags = c['meta_tags']
+        has_refresh = 0
+        has_desc = 0
+        has_keywords = 0
+        has_author = 0
+        has_viewport = 0
+        for m in meta_tags:
+            name_attr = (m.get('name') or '').lower()
+            http_equiv = (m.get('http-equiv') or '').lower()
+            if name_attr == 'description':
+                has_desc = 1
+            elif name_attr == 'keywords':
+                has_keywords = 1
+            elif name_attr == 'author':
+                has_author = 1
+            elif name_attr == 'viewport':
+                has_viewport = 1
+            if http_equiv == 'refresh':
+                has_refresh = 1
+        return {
+            'num_meta_tags': len(meta_tags),
+            'has_description': has_desc,
+            'has_keywords': has_keywords,
+            'has_author': has_author,
+            'has_viewport': has_viewport,
+            'has_meta_refresh': has_refresh,
+        }
+    # ------------------------------------------------------------------
+    # 7. Resource features (7)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _resource_features(c: dict) -> dict:
+        css_links = c['css_links']
+        images = c['images']
+        style_tags = c['style_tags']
+        img_srcs = [img.get('src', '') or '' for img in images]
+        css_content = ''.join(tag.string or '' for tag in style_tags)
+        has_favicon = 0
+        for lt in c['all_link_tags']:
+            rel = lt.get('rel', [])
+            if 'icon' in rel or 'shortcut' in rel:
+                has_favicon = 1
+                break
+        return {
+            'num_css_files': len(css_links),
+            'num_external_css': sum(1 for lk in css_links
+                                    if (lk.get('href', '') or '').startswith('http')),
+            'num_external_images': sum(1 for s in img_srcs if s.startswith('http')),
+            'num_data_uri_images': sum(1 for s in img_srcs if s.startswith('data:')),
+            'num_inline_styles': len(style_tags),
+            'inline_css_length': len(css_content),
+            'has_favicon': has_favicon,
+        }
+    # ------------------------------------------------------------------
+    # 8. Advanced phishing indicators (16)
+    # ------------------------------------------------------------------
+    @staticmethod
+    def _advanced_features(soup, c: dict) -> dict:
+        forms = c['forms']
+        input_types = c['input_types']
+        hrefs = c['hrefs']
+        all_text_lower = str(soup).lower()
+        # Password + external action combo
+        has_password = any(t == 'password' for _, t in input_types)
+        has_ext_action = any(
+            (f.get('action', '') or '').startswith('http') for f in forms)
+        # Count unique external domains from links
+        ext_domains = set()
+        for h in hrefs:
+            if h.startswith('http'):
+                try:
+                    d = urlparse(h).netloc
+                    if d:
+                        ext_domains.add(d.lower())
+                except Exception:
+                    pass
+        # Forms without labels
+        n_forms_no_label = sum(
+            1 for f in forms
+            if not f.find_all('label') and f.find_all('input')
+        )
+        # Event handlers – single pass over all tags
+        n_onload = 0
+        n_onerror = 0
+        n_onclick = 0
+        for tag in c['all_tags']:
+            attrs = tag.attrs
+            if 'onload' in attrs:
+                n_onload += 1
+            if 'onerror' in attrs:
+                n_onerror += 1
+            if 'onclick' in attrs:
+                n_onclick += 1
+        # Iframe with small/zero dimensions (common cloaking)
+        n_hidden_iframes = 0
+        for iframe in c['iframes']:
+            w = iframe.get('width', '')
+            h = iframe.get('height', '')
+            style = (iframe.get('style', '') or '').lower()
+            if w in ('0', '1') or h in ('0', '1') or 'display:none' in style or 'visibility:hidden' in style:
+                n_hidden_iframes += 1
+        return {
+            'password_with_external_action': 1 if (has_password and has_ext_action) else 0,
+            'has_base64': 1 if 'base64' in all_text_lower else 0,
+            'has_atob': 1 if 'atob(' in all_text_lower else 0,
+            'has_fromcharcode': 1 if 'fromcharcode' in all_text_lower else 0,
+            'num_onload_events': n_onload,
+            'num_onerror_events': n_onerror,
+            'num_onclick_events': n_onclick,
+            'num_unique_external_domains': len(ext_domains),
+            'num_forms_without_labels': n_forms_no_label,
+            'has_display_none': 1 if ('display:none' in all_text_lower or
+                                      'display: none' in all_text_lower) else 0,
+            'has_visibility_hidden': 1 if ('visibility:hidden' in all_text_lower or
+                                           'visibility: hidden' in all_text_lower) else 0,
+            'has_window_open': 1 if 'window.open' in all_text_lower else 0,
+            'has_location_replace': 1 if ('location.replace' in all_text_lower or
+                                          'location.href' in all_text_lower) else 0,
+            'num_hidden_iframes': n_hidden_iframes,
+            'has_right_click_disabled': 1 if ('oncontextmenu' in all_text_lower and
+                                               'return false' in all_text_lower) else 0,
+            'has_status_bar_customization': 1 if ('window.status' in all_text_lower or
+                                                   'onmouseover' in all_text_lower) else 0,
+        }
+    # ------------------------------------------------------------------
+    # Default features (all zeros) – used on parse failure
+    # ------------------------------------------------------------------
+    def _default_features(self) -> dict:
+        return {k: 0 for k in self.get_feature_names()}
+    @staticmethod
+    def get_feature_names() -> list[str]:
+        """Return ordered list of all 67 feature names."""
+        return [
+            # Structure (12)
+            'html_length', 'num_tags', 'num_divs', 'num_spans',
+            'num_paragraphs', 'num_headings', 'num_lists', 'num_images',
+            'num_iframes', 'num_tables', 'has_title', 'dom_depth',
+            # Form (11)
+            'num_forms', 'num_input_fields', 'num_password_fields',
+            'num_email_fields', 'num_text_fields', 'num_submit_buttons',
+            'num_hidden_fields', 'has_login_form', 'has_form',
+            'num_external_form_actions', 'num_empty_form_actions',
+            # Link (10)
+            'num_links', 'num_external_links', 'num_internal_links',
+            'num_empty_links', 'num_mailto_links', 'num_javascript_links',
+            'ratio_external_links', 'num_ip_based_links',
+            'num_suspicious_tld_links', 'num_anchor_text_mismatch',
+            # Script (7)
+            'num_scripts', 'num_inline_scripts', 'num_external_scripts',
+            'has_eval', 'has_unescape', 'has_escape', 'has_document_write',
+            # Text (8)
+            'text_length', 'num_words', 'text_to_html_ratio',
+            'num_brand_mentions', 'num_urgency_keywords',
+            'has_copyright', 'has_phone_number', 'has_email_address',
+            # Meta (6)
+            'num_meta_tags', 'has_description', 'has_keywords',
+            'has_author', 'has_viewport', 'has_meta_refresh',
+            # Resource (7)
+            'num_css_files', 'num_external_css', 'num_external_images',
+            'num_data_uri_images', 'num_inline_styles',
+            'inline_css_length', 'has_favicon',
+            # Advanced (16)
+            'password_with_external_action', 'has_base64', 'has_atob',
+            'has_fromcharcode', 'num_onload_events', 'num_onerror_events',
+            'num_onclick_events', 'num_unique_external_domains',
+            'num_forms_without_labels', 'has_display_none',
+            'has_visibility_hidden', 'has_window_open',
+            'has_location_replace', 'num_hidden_iframes',
+            'has_right_click_disabled', 'has_status_bar_customization',
+        ]

scripts/feature_extraction/html/v1/__pycache__/html_features.cpython-313.pyc ADDED Viewed

Binary file (21.8 kB). View file

scripts/feature_extraction/html/v1/extract_html_features_simple.py ADDED Viewed

	@@ -0,0 +1,305 @@

+"""
+Extract HTML Features - Direct from Files (No Metadata Needed)
+Simplified version that scans directories directly
+WITH QUALITY FILTERING to remove low-quality HTML files
+"""
+import pandas as pd
+from pathlib import Path
+import logging
+from tqdm import tqdm
+import sys
+import re
+from bs4 import BeautifulSoup
+# Add scripts directory to path
+sys.path.append(str(Path(__file__).parent))
+from html_features import HTMLFeatureExtractor
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger(__name__)
+# Quality filter constants
+MIN_FILE_SIZE = 1000  # Minimum 1KB
+MIN_WORDS = 50  # Minimum 50 words of text content
+MIN_TAGS = 10  # Minimum 10 HTML tags
+ERROR_PATTERNS = [
+    'page not found', '404', '403', 'forbidden', 'access denied',
+    'error occurred', 'server error', 'not available', 'suspended',
+    'domain for sale', 'this site can', 'website expired',
+    'coming soon', 'under construction', 'parked domain',
+    'buy this domain', 'this domain', 'domain has expired'
+]
+def is_quality_html(html_content, filename=""):
+    """
+    Check if HTML file meets quality criteria.
+    Returns:
+        tuple: (is_valid, reason)
+    """
+    # Check 1: Minimum file size
+    if len(html_content) < MIN_FILE_SIZE:
+        return False, f"Too small ({len(html_content)} bytes)"
+    try:
+        soup = BeautifulSoup(html_content, 'html.parser')
+        # Check 2: Has body tag (basic HTML structure)
+        if not soup.find('body'):
+            return False, "No body tag"
+        # Check 3: Minimum number of tags
+        num_tags = len(soup.find_all())
+        if num_tags < MIN_TAGS:
+            return False, f"Too few tags ({num_tags})"
+        # Check 4: Get text content and check word count
+        text = soup.get_text(separator=' ', strip=True).lower()
+        words = text.split()
+        if len(words) < MIN_WORDS:
+            return False, f"Too few words ({len(words)})"
+        # Check 5: Not an error page
+        text_lower = text[:2000]  # Check first 2000 chars
+        for pattern in ERROR_PATTERNS:
+            if pattern in text_lower:
+                return False, f"Error page pattern: '{pattern}'"
+        # Check 6: Has some interactive elements OR substantial content
+        has_links = len(soup.find_all('a')) > 0
+        has_forms = len(soup.find_all('form')) > 0
+        has_inputs = len(soup.find_all('input')) > 0
+        has_images = len(soup.find_all('img')) > 0
+        has_divs = len(soup.find_all('div')) > 3
+        if not (has_links or has_forms or has_inputs or has_images or has_divs):
+            return False, "No interactive elements"
+        # Check 7: Not mostly JavaScript (JS-only pages are hard to analyze)
+        script_content = ''.join([s.string or '' for s in soup.find_all('script')])
+        if len(script_content) > len(text) * 3 and len(text) < 200:
+            return False, "Mostly JavaScript, little content"
+        return True, "OK"
+    except Exception as e:
+        return False, f"Parse error: {str(e)[:50]}"
+def extract_features_from_directory(html_dir, label, limit=None, apply_filter=True):
+    """
+    Extract features from all HTML files in a directory.
+    Args:
+        html_dir: Directory containing HTML files
+        label: Label for these files (0=legitimate, 1=phishing)
+        limit: Maximum number of files to process (None = all)
+        apply_filter: Apply quality filter to remove bad HTML files
+    Returns:
+        List of feature dictionaries
+    """
+    html_dir = Path(html_dir)
+    logger.info(f"\nProcessing: {html_dir}")
+    logger.info(f"  Label: {'Phishing' if label == 1 else 'Legitimate'}")
+    logger.info(f"  Quality filter: {'ENABLED' if apply_filter else 'DISABLED'}")
+    # Get all HTML files
+    html_files = sorted(html_dir.glob('*.html'))
+    total_files = len(html_files)
+    logger.info(f"  Found {total_files:,} HTML files")
+    # Initialize extractor
+    extractor = HTMLFeatureExtractor()
+    results = []
+    errors = 0
+    filtered_out = 0
+    filter_reasons = {}
+    # Process each HTML file
+    for html_path in tqdm(html_files,
+                        desc=f"Extracting {'Phishing' if label == 1 else 'Legitimate'} features"):
+        try:
+            # Read HTML content
+            with open(html_path, 'r', encoding='utf-8', errors='ignore') as f:
+                html_content = f.read()
+            # Apply quality filter if enabled
+            if apply_filter:
+                is_valid, reason = is_quality_html(html_content, html_path.name)
+                if not is_valid:
+                    filtered_out += 1
+                    filter_reasons[reason] = filter_reasons.get(reason, 0) + 1
+                    continue
+            # Extract features
+            features = extractor.extract_features(html_content, url=None)
+            # Add metadata
+            features['filename'] = html_path.name # type: ignore
+            features['label'] = label
+            results.append(features)
+            # Check if we reached the limit
+            if limit and len(results) >= limit:
+                logger.info(f"  Reached limit of {limit:,} quality files")
+                break
+        except Exception as e:
+            errors += 1
+            if errors < 10:  # Show first 10 errors
+                logger.warning(f"  Error processing {html_path.name}: {e}")
+    logger.info(f"  Quality files extracted: {len(results):,}")
+    logger.info(f"  Filtered out (low quality): {filtered_out:,} ({filtered_out/total_files*100:.1f}%)")
+    if filter_reasons and apply_filter:
+        logger.info(f"  Filter reasons (top 5):")
+        for reason, count in sorted(filter_reasons.items(), key=lambda x: -x[1])[:5]:
+            logger.info(f"    - {reason}: {count:,}")
+    if errors > 0:
+        logger.warning(f"  Errors: {errors:,}")
+    return results
+def main():
+    logger.info("="*80)
+    logger.info("BALANCED HTML FEATURES EXTRACTION (WITH QUALITY FILTER)")
+    logger.info("="*80)
+    # Quality filter info
+    logger.info("\nQuality Filter Criteria:")
+    logger.info(f"  - Minimum file size: {MIN_FILE_SIZE} bytes")
+    logger.info(f"  - Minimum word count: {MIN_WORDS} words")
+    logger.info(f"  - Minimum HTML tags: {MIN_TAGS}")
+    logger.info(f"  - Must have body tag")
+    logger.info(f"  - Not an error/parked page")
+    logger.info(f"  - Has interactive elements (links/forms/images)")
+    # Paths
+    phishing_html_dir = Path('data/html/phishing_v1')
+    legit_html_dir = Path('data/html/legitimate_v1')
+    output_path = Path('data/features/html_features_old.csv')
+    # Check directories exist
+    if not phishing_html_dir.exists():
+        logger.error(f"Phishing directory not found: {phishing_html_dir}")
+        return
+    if not legit_html_dir.exists():
+        logger.error(f"Legitimate directory not found: {legit_html_dir}")
+        return
+    # Count files
+    logger.info("\n1. Checking available HTML files...")
+    phishing_files = list(phishing_html_dir.glob('*.html'))
+    legit_files = list(legit_html_dir.glob('*.html'))
+    phishing_count = len(phishing_files)
+    legit_count = len(legit_files)
+    logger.info(f"   Phishing HTML files: {phishing_count:,}")
+    logger.info(f"   Legitimate HTML files: {legit_count:,}")
+    # Extract phishing features (with quality filter)
+    logger.info("\n2. Extracting PHISHING HTML features (with quality filter)...")
+    phishing_features = extract_features_from_directory(
+        phishing_html_dir,
+        label=1,  # Phishing
+        limit=None,  # Get all quality files first
+        apply_filter=True
+    )
+    # Extract legitimate features (with quality filter)
+    logger.info("\n3. Extracting LEGITIMATE HTML features (with quality filter)...")
+    legit_features = extract_features_from_directory(
+        legit_html_dir,
+        label=0,  # Legitimate
+        limit=None,  # Get all quality files first
+        apply_filter=True
+    )
+    # Balance the dataset
+    logger.info("\n4. Balancing dataset...")
+    min_count = min(len(phishing_features), len(legit_features))
+    logger.info(f"   Quality phishing samples: {len(phishing_features):,}")
+    logger.info(f"   Quality legitimate samples: {len(legit_features):,}")
+    logger.info(f"   Balancing to: {min_count:,} per class")
+    # Truncate to balanced size
+    phishing_features = phishing_features[:min_count]
+    legit_features = legit_features[:min_count]
+    # Combine results
+    logger.info("\n5. Combining datasets...")
+    all_features = phishing_features + legit_features
+    if len(all_features) == 0:
+        logger.error("No features extracted! Check error messages above.")
+        return
+    # Create DataFrame
+    logger.info("\n6. Creating features DataFrame...")
+    features_df = pd.DataFrame(all_features)
+    # Reorder columns (filename and label first, then features)
+    feature_cols = [col for col in features_df.columns if col not in ['filename', 'label']]
+    features_df = features_df[['filename', 'label'] + feature_cols]
+    # Shuffle dataset
+    features_df = features_df.sample(frac=1, random_state=42).reset_index(drop=True)
+    logger.info(f"   Shape: {features_df.shape}")
+    logger.info(f"   Features: {len(feature_cols)}")
+    # Show label distribution
+    logger.info(f"\n   Label distribution:")
+    label_counts = features_df['label'].value_counts()
+    for label, count in label_counts.items():
+        label_name = 'Phishing' if label == 1 else 'Legitimate'
+        logger.info(f"     {label_name}: {count:,} ({count/len(features_df)*100:.1f}%)")
+    # Save to CSV
+    logger.info(f"\n7. Saving features to: {output_path}")
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    features_df.to_csv(output_path, index=False)
+    logger.info(f"   ✓ Saved!")
+    # Show statistics
+    logger.info("\n" + "="*80)
+    logger.info("EXTRACTION SUMMARY")
+    logger.info("="*80)
+    logger.info(f"\nTotal samples: {len(features_df):,}")
+    logger.info(f"  Phishing: {len(phishing_features):,}")
+    logger.info(f"  Legitimate: {len(legit_features):,}")
+    logger.info(f"\nFeatures extracted: {len(feature_cols)}")
+    logger.info(f"Dataset balance: {(label_counts[0]/label_counts[1])*100:.1f}%")
+    # Show sample statistics
+    logger.info(f"\nFeature statistics (first 10 features):")
+    numeric_cols = features_df.select_dtypes(include=['int64', 'float64']).columns[:10]
+    stats = features_df[numeric_cols].describe()
+    logger.info(f"\n{stats.to_string()}")
+    logger.info("\n" + "="*80)
+    logger.info("✓ QUALITY-FILTERED HTML FEATURES EXTRACTION COMPLETE!")
+    logger.info("="*80)
+    logger.info(f"\nOutput file: {output_path}")
+    logger.info(f"Shape: {features_df.shape}")
+    logger.info(f"Quality filter removed low-quality HTML files")
+    logger.info("="*80)
+if __name__ == '__main__':
+    main()

scripts/feature_extraction/html/v1/html_features.py ADDED Viewed

	@@ -0,0 +1,382 @@

+"""
+HTML Feature Extractor for Phishing Detection
+Extracts ~50 features from HTML content including forms, links, scripts, etc.
+"""
+import re
+from pathlib import Path
+from bs4 import BeautifulSoup
+from urllib.parse import urlparse
+import pandas as pd
+import numpy as np
+class HTMLFeatureExtractor:
+    """Extract features from HTML content for phishing detection."""
+    def __init__(self):
+        # Common legitimate brand keywords
+        self.brand_keywords = [
+            'paypal', 'amazon', 'google', 'microsoft', 'apple', 'facebook',
+            'netflix', 'ebay', 'instagram', 'twitter', 'linkedin', 'yahoo',
+            'bank', 'visa', 'mastercard', 'americanexpress', 'chase', 'wells',
+            'citibank', 'dhl', 'fedex', 'ups', 'usps'
+        ]
+        # Urgency/phishing keywords
+        self.urgency_keywords = [
+            'urgent', 'verify', 'account', 'suspended', 'locked', 'confirm',
+            'update', 'security', 'alert', 'warning', 'expire', 'limited',
+            'immediately', 'click here', 'act now', 'suspended', 'unusual',
+            'unauthorized', 'restricted'
+        ]
+    def extract_features(self, html_content, url=None):
+        """
+        Extract all HTML features from content.
+        Args:
+            html_content: HTML string content
+            url: Optional URL for additional context
+        Returns:
+            Dictionary of features
+        """
+        features = {}
+        try:
+            soup = BeautifulSoup(html_content, 'html.parser')
+            # Basic structure features
+            features.update(self._extract_structure_features(soup))
+            # Form features
+            features.update(self._extract_form_features(soup))
+            # Link features
+            features.update(self._extract_link_features(soup, url))
+            # Script features
+            features.update(self._extract_script_features(soup))
+            # Text content features
+            features.update(self._extract_text_features(soup))
+            # Meta tag features
+            features.update(self._extract_meta_features(soup))
+            # External resource features
+            features.update(self._extract_resource_features(soup, url))
+            # Advanced phishing indicators
+            features.update(self._extract_advanced_features(soup))
+        except Exception as e:
+            print(f"Error extracting features: {e}")
+            # Return default features on error
+            features = self._get_default_features()
+        return features
+    def _extract_structure_features(self, soup):
+        """Extract basic HTML structure features."""
+        return {
+            'html_length': len(str(soup)),
+            'num_tags': len(soup.find_all()),
+            'num_divs': len(soup.find_all('div')),
+            'num_spans': len(soup.find_all('span')),
+            'num_paragraphs': len(soup.find_all('p')),
+            'num_headings': len(soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])),
+            'num_lists': len(soup.find_all(['ul', 'ol'])),
+            'num_images': len(soup.find_all('img')),
+            'num_iframes': len(soup.find_all('iframe')),
+            'num_tables': len(soup.find_all('table')),
+            'has_title': 1 if soup.find('title') else 0,
+        }
+    def _extract_form_features(self, soup):
+        """Extract form-related features."""
+        forms = soup.find_all('form')
+        features = {
+            'num_forms': len(forms),
+            'num_input_fields': len(soup.find_all('input')),
+            'num_password_fields': len(soup.find_all('input', {'type': 'password'})),
+            'num_email_fields': len(soup.find_all('input', {'type': 'email'})),
+            'num_text_fields': len(soup.find_all('input', {'type': 'text'})),
+            'num_submit_buttons': len(soup.find_all(['input', 'button'], {'type': 'submit'})),
+            'num_hidden_fields': len(soup.find_all('input', {'type': 'hidden'})),
+            'has_form': 1 if forms else 0,
+        }
+        # Check form actions
+        if forms:
+            form_actions = [form.get('action', '') for form in forms]
+            features['num_external_form_actions'] = sum(1 for action in form_actions
+                                                        if action.startswith('http'))
+            features['num_empty_form_actions'] = sum(1 for action in form_actions
+                                                    if not action or action == '#')
+        else:
+            features['num_external_form_actions'] = 0
+            features['num_empty_form_actions'] = 0
+        return features
+    def _extract_link_features(self, soup, url=None):
+        """Extract link-related features."""
+        links = soup.find_all('a')
+        hrefs = [link.get('href', '') for link in links]
+        features = {
+            'num_links': len(links),
+            'num_external_links': sum(1 for href in hrefs if href.startswith('http')),
+            'num_internal_links': sum(1 for href in hrefs if href.startswith('/') or href.startswith('#')),
+            'num_empty_links': sum(1 for href in hrefs if not href or href == '#'),
+            'num_mailto_links': sum(1 for href in hrefs if href.startswith('mailto:')),
+            'num_javascript_links': sum(1 for href in hrefs if 'javascript:' in href.lower()),
+        }
+        # Calculate ratio of external links
+        if features['num_links'] > 0:
+            features['ratio_external_links'] = features['num_external_links'] / features['num_links']  # type: ignore
+        else:
+            features['ratio_external_links'] = 0
+        # Check for suspicious link patterns
+        features['num_ip_based_links'] = sum(1 for href in hrefs
+                                            if re.search(r'http://\d+\.\d+\.\d+\.\d+', href))
+        return features
+    def _extract_script_features(self, soup):
+        """Extract JavaScript/script features."""
+        scripts = soup.find_all('script')
+        features = {
+            'num_scripts': len(scripts),
+            'num_inline_scripts': sum(1 for script in scripts if script.string),
+            'num_external_scripts': sum(1 for script in scripts if script.get('src')),
+        }
+        # Check for suspicious script patterns
+        script_content = ' '.join([script.string for script in scripts if script.string])
+        features['has_eval'] = 1 if 'eval(' in script_content else 0
+        features['has_unescape'] = 1 if 'unescape(' in script_content else 0
+        features['has_escape'] = 1 if 'escape(' in script_content else 0
+        features['has_document_write'] = 1 if 'document.write' in script_content else 0
+        return features
+    def _extract_text_features(self, soup):
+        """Extract text content features."""
+        # Get all visible text
+        text = soup.get_text(separator=' ', strip=True).lower()
+        features = {
+            'text_length': len(text),
+            'num_words': len(text.split()),
+        }
+        # Check for brand mentions
+        features['num_brand_mentions'] = sum(1 for brand in self.brand_keywords
+                                            if brand in text)
+        # Check for urgency keywords
+        features['num_urgency_keywords'] = sum(1 for keyword in self.urgency_keywords
+                                            if keyword in text)
+        # Check for specific patterns
+        features['has_copyright'] = 1 if '©' in text or 'copyright' in text else 0
+        features['has_phone_number'] = 1 if re.search(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', text) else 0
+        features['has_email'] = 1 if re.search(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text) else 0
+        return features
+    def _extract_meta_features(self, soup):
+        """Extract meta tag features."""
+        meta_tags = soup.find_all('meta')
+        features = {
+            'num_meta_tags': len(meta_tags),
+            'has_description': 1 if soup.find('meta', {'name': 'description'}) else 0,
+            'has_keywords': 1 if soup.find('meta', {'name': 'keywords'}) else 0,
+            'has_author': 1 if soup.find('meta', {'name': 'author'}) else 0,
+            'has_viewport': 1 if soup.find('meta', {'name': 'viewport'}) else 0,
+        }
+        # Check for refresh meta tag (often used in phishing)
+        refresh_meta = soup.find('meta', {'http-equiv': 'refresh'})
+        features['has_meta_refresh'] = 1 if refresh_meta else 0
+        return features
+    def _extract_resource_features(self, soup, url=None):
+        """Extract external resource features."""
+        # CSS links
+        css_links = soup.find_all('link', {'rel': 'stylesheet'})
+        # Images
+        images = soup.find_all('img')
+        img_srcs = [img.get('src', '') for img in images]
+        # Inline styles
+        inline_style_tags = soup.find_all('style')
+        inline_style_content = ''.join([tag.string or '' for tag in inline_style_tags])
+        features = {
+            'num_css_files': len(css_links),
+            'num_external_css': sum(1 for link in css_links
+                                if link.get('href', '').startswith('http')),
+            'num_external_images': sum(1 for src in img_srcs if src.startswith('http')),
+            'num_data_uri_images': sum(1 for src in img_srcs if src.startswith('data:')),
+            'num_inline_styles': len(inline_style_tags),
+            'inline_css_length': len(inline_style_content),
+            'has_favicon': 1 if soup.find('link', {'rel': 'icon'}) or soup.find('link', {'rel': 'shortcut icon'}) else 0,
+        }
+        return features
+    def _extract_advanced_features(self, soup):
+        """Extract advanced phishing indicators."""
+        features = {}
+        # Suspicious element combinations
+        has_password = len(soup.find_all('input', {'type': 'password'})) > 0
+        has_external_action = any(
+            form.get('action', '').startswith('http')
+            for form in soup.find_all('form')
+        )
+        features['password_with_external_action'] = 1 if (has_password and has_external_action) else 0
+        # Obfuscation indicators
+        all_text = str(soup).lower()
+        features['has_base64'] = 1 if 'base64' in all_text else 0
+        features['has_atob'] = 1 if 'atob(' in all_text else 0
+        features['has_fromcharcode'] = 1 if 'fromcharcode' in all_text else 0
+        # Suspicious attributes
+        features['num_onload_events'] = len(soup.find_all(attrs={'onload': True}))
+        features['num_onerror_events'] = len(soup.find_all(attrs={'onerror': True}))
+        features['num_onclick_events'] = len(soup.find_all(attrs={'onclick': True}))
+        # Domain analysis from links
+        external_domains = set()
+        for link in soup.find_all('a', href=True):
+            href = link['href']
+            if href.startswith('http'):
+                try:
+                    domain = urlparse(href).netloc
+                    if domain:
+                        external_domains.add(domain)
+                except:
+                    pass
+        features['num_unique_external_domains'] = len(external_domains)
+        # Suspicious patterns in forms
+        forms = soup.find_all('form')
+        features['num_forms_without_labels'] = sum(
+            1 for form in forms
+            if len(form.find_all('label')) == 0 and len(form.find_all('input')) > 0
+        )
+        # CSS visibility hiding (phishing technique)
+        features['has_display_none'] = 1 if 'display:none' in all_text or 'display: none' in all_text else 0
+        features['has_visibility_hidden'] = 1 if 'visibility:hidden' in all_text or 'visibility: hidden' in all_text else 0
+        # Popup/redirect indicators
+        features['has_window_open'] = 1 if 'window.open' in all_text else 0
+        features['has_location_replace'] = 1 if 'location.replace' in all_text or 'location.href' in all_text else 0
+        return features
+    def _get_default_features(self):
+        """Return dictionary with all features set to 0."""
+        return {
+            # Structure features (11)
+            'html_length': 0, 'num_tags': 0, 'num_divs': 0, 'num_spans': 0,
+            'num_paragraphs': 0, 'num_headings': 0, 'num_lists': 0, 'num_images': 0,
+            'num_iframes': 0, 'num_tables': 0, 'has_title': 0,
+            # Form features (8)
+            'num_forms': 0, 'num_input_fields': 0, 'num_password_fields': 0,
+            'num_email_fields': 0, 'num_text_fields': 0, 'num_submit_buttons': 0,
+            'num_hidden_fields': 0, 'has_form': 0, 'num_external_form_actions': 0,
+            'num_empty_form_actions': 0,
+            # Link features (8)
+            'num_links': 0, 'num_external_links': 0, 'num_internal_links': 0,
+            'num_empty_links': 0, 'num_mailto_links': 0, 'num_javascript_links': 0,
+            'ratio_external_links': 0, 'num_ip_based_links': 0,
+            # Script features (7)
+            'num_scripts': 0, 'num_inline_scripts': 0, 'num_external_scripts': 0,
+            'has_eval': 0, 'has_unescape': 0, 'has_escape': 0, 'has_document_write': 0,
+            # Text features (7)
+            'text_length': 0, 'num_words': 0, 'num_brand_mentions': 0,
+            'num_urgency_keywords': 0, 'has_copyright': 0, 'has_phone_number': 0,
+            'has_email': 0,
+            # Meta features (6)
+            'num_meta_tags': 0, 'num_css_files': 0, 'num_external_css': 0, 'num_external_images': 0,
+            'num_data_uri_images': 0, 'num_inline_styles': 0, 'inline_css_length': 0,
+            'has_favicon': 0,
+            # Advanced phishing indicators (13)
+            'password_with_external_action': 0, 'has_base64': 0, 'has_atob': 0,
+            'has_fromcharcode': 0, 'num_onload_events': 0, 'num_onerror_events': 0,
+            'num_onclick_events': 0, 'num_unique_external_domains': 0,
+            'num_forms_without_labels': 0, 'has_display_none': 0,
+            'has_visibility_hidden': 0, 'has_window_open': 0, 'has_location_replace': 0,
+            # Resource features (4)
+            'num_css_files': 0, 'num_external_css': 0, 'num_external_images': 0,
+            'num_data_uri_images': 0,
+        }
+    def get_feature_names(self):
+        """Return list of all feature names."""
+        return list(self._get_default_features().keys())
+def extract_features_from_file(html_file_path, url=None):
+    """
+    Extract features from a single HTML file.
+    Args:
+        html_file_path: Path to HTML file
+        url: Optional URL for context
+    Returns:
+        Dictionary of features
+    """
+    extractor = HTMLFeatureExtractor()
+    try:
+        with open(html_file_path, 'r', encoding='utf-8', errors='ignore') as f:
+            html_content = f.read()
+        return extractor.extract_features(html_content, url)
+    except Exception as e:
+        print(f"Error reading file {html_file_path}: {e}")
+        return extractor._get_default_features()
+if __name__ == '__main__':
+    # Test with a sample HTML file
+    import sys
+    if len(sys.argv) > 1:
+        html_file = sys.argv[1]
+        features = extract_features_from_file(html_file)
+        print(f"\nExtracted {len(features)} features from {html_file}:")
+        print("-" * 80)
+        for feature, value in features.items():
+            print(f"{feature:30s}: {value}")
+    else:
+        print("Usage: python html_features.py <html_file_path>")
+        print("\nAvailable features:")
+        extractor = HTMLFeatureExtractor()
+        for i, feature in enumerate(extractor.get_feature_names(), 1):
+            print(f"{i:2d}. {feature}")
+        print(f"\nTotal: {len(extractor.get_feature_names())} features")

scripts/feature_extraction/url/__pycache__/url_features_v3.cpython-313.pyc ADDED Viewed

Binary file (50.9 kB). View file

scripts/feature_extraction/url/url_features_diagnostic.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import pandas as pd
+from collections import Counter
+from urllib.parse import urlparse
+# Load data
+df = pd.read_csv('data/features/url_features.csv')
+phish_df = df[df['label'] == 1].copy()  # Assuming 1 = phishing
+legit_df = df[df['label'] == 0].copy()  # Assuming 0 = legitimate
+print("=== FREE PLATFORM DETECTION ANALYSIS ===\n")
+# 1. Check detection rate
+print(f"Total phishing: {len(phish_df)}")
+print(f"Phishing on free platforms: {phish_df['is_free_platform'].sum()} ({phish_df['is_free_platform'].mean()*100:.1f}%)")
+print(f"\nTotal legitimate: {len(legit_df)}")
+print(f"Legitimate on free platforms: {legit_df['is_free_platform'].sum()} ({legit_df['is_free_platform'].mean()*100:.1f}%)")
+# 2. Load original URLs
+urls_df = pd.read_csv('data/processed/clean_dataset.csv')
+phish_urls = urls_df[urls_df['label'] == 1]['url'].tolist()  # Adjust column names
+legit_urls = urls_df[urls_df['label'] == 0]['url'].tolist()
+# 3. Extract domains from phishing URLs
+def extract_domain(url):
+    try:
+        parsed = urlparse(url if url.startswith('http') else 'http://' + url)
+        return parsed.netloc.lower()
+    except:
+        return ''
+phish_domains = [extract_domain(url) for url in phish_urls]
+# 4. Find common domain patterns
+print("\n=== TOP 50 PHISHING DOMAINS (by frequency) ===")
+domain_counts = Counter(phish_domains)
+for domain, count in domain_counts.most_common(50):
+    print(f"{domain:50s}: {count:5d}")
+# 5. Find common suffixes (platforms)
+print("\n=== COMMON DOMAIN SUFFIXES (platforms) ===")
+suffixes = []
+for domain in phish_domains:
+    parts = domain.split('.')
+    if len(parts) >= 2:
+        suffix = '.'.join(parts[-2:])  # Last 2 parts (e.g., weebly.com)
+        suffixes.append(suffix)
+suffix_counts = Counter(suffixes)
+print("\nTop 30 suffixes:")
+for suffix, count in suffix_counts.most_common(30):
+    print(f"{suffix:30s}: {count:5d} ({count/len(phish_domains)*100:.1f}%)")

scripts/feature_extraction/url/url_features_v1.py ADDED Viewed

	@@ -0,0 +1,626 @@

+"""
+URL Feature Extraction v1 - URL-Only Features for Stage 1 Model
+This extractor focuses ONLY on URL structure and lexical features.
+NO HTTP requests, NO external services, NO HTML parsing.
+Features:
+- Lexical (length, characters, entropy)
+- Structural (domain parts, path segments, TLD)
+- Statistical (entropy, n-grams, patterns)
+- Security indicators (from URL only)
+- Brand/phishing patterns
+Designed for:
+- Fast inference (< 1ms per URL)
+- No network dependencies
+- Production deployment
+"""
+import pandas as pd
+import numpy as np
+from urllib.parse import urlparse, parse_qs, unquote
+import re
+import math
+import socket
+from pathlib import Path
+from collections import Counter
+import sys
+import logging
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger("url_features_v2")
+class URLFeatureExtractorV2:
+    """
+    Fast URL-only feature extractor for Stage 1 phishing detection.
+    No HTTP requests, no external API calls - pure URL analysis.
+    """
+    def __init__(self):
+        """Initialize feature extractor with keyword lists."""
+        # Phishing-related keywords
+        self.phishing_keywords = [
+            'login', 'signin', 'sign-in', 'log-in', 'logon', 'signon',
+            'account', 'accounts', 'update', 'verify', 'verification',
+            'secure', 'security', 'banking', 'bank', 'confirm', 'password',
+            'passwd', 'credential', 'suspended', 'locked', 'unusual',
+            'authenticate', 'auth', 'wallet', 'invoice', 'payment',
+            'billing', 'expire', 'expired', 'limited', 'restrict',
+            'urgent', 'immediately', 'alert', 'warning', 'resolve',
+            'recover', 'restore', 'reactivate', 'unlock', 'validate'
+        ]
+        # Brand names commonly targeted
+        self.brand_names = [
+            'paypal', 'ebay', 'amazon', 'apple', 'microsoft', 'google',
+            'facebook', 'instagram', 'twitter', 'netflix', 'linkedin',
+            'dropbox', 'chase', 'wellsfargo', 'bankofamerica', 'citibank',
+            'americanexpress', 'amex', 'visa', 'mastercard', 'outlook',
+            'office365', 'office', 'yahoo', 'aol', 'icloud', 'adobe',
+            'spotify', 'steam', 'dhl', 'fedex', 'ups', 'usps',
+            'coinbase', 'binance', 'blockchain', 'metamask', 'whatsapp',
+            'telegram', 'discord', 'zoom', 'docusign', 'wetransfer',
+            'hsbc', 'barclays', 'santander', 'ing', 'revolut'
+        ]
+        # URL shorteners
+        self.shorteners = [
+            'bit.ly', 'bitly.com', 'goo.gl', 'tinyurl.com', 't.co', 'ow.ly',
+            'is.gd', 'buff.ly', 'adf.ly', 'bit.do', 'short.to', 'tiny.cc',
+            'j.mp', 'surl.li', 'rb.gy', 'cutt.ly', 'qrco.de', 'v.gd',
+            'shorturl.at', 'rebrand.ly', 'clck.ru', 's.id', 'shrtco.de'
+        ]
+        # Suspicious TLDs
+        self.suspicious_tlds = {
+            'tk', 'ml', 'ga', 'cf', 'gq',  # Free domains
+            'xyz', 'top', 'club', 'work', 'date', 'racing', 'win',
+            'loan', 'download', 'stream', 'click', 'link', 'bid',
+            'review', 'party', 'trade', 'webcam', 'science',
+            'accountant', 'faith', 'cricket', 'zip', 'mov'
+        }
+        # Trusted TLDs
+        self.trusted_tlds = {
+            'com', 'org', 'net', 'edu', 'gov', 'mil',
+            'uk', 'us', 'ca', 'de', 'fr', 'jp', 'au',
+            'nl', 'be', 'ch', 'it', 'es', 'se', 'no'
+        }
+        # Free hosting services
+        self.free_hosting = [
+            'weebly.com', 'wix.com', 'wordpress.com', 'blogspot.com',
+            'tumblr.com', 'jimdo.com', 'github.io', 'gitlab.io',
+            'netlify.app', 'vercel.app', 'herokuapp.com', 'firebaseapp.com',
+            'web.app', 'pages.dev', 'godaddysites.com', 'square.site',
+            '000webhostapp.com', 'sites.google.com', 'carrd.co'
+        ]
+    def extract_features(self, url: str) -> dict:
+        """
+        Extract all URL-only features from a single URL.
+        Args:
+            url: URL string
+        Returns:
+            Dictionary of features
+        """
+        try:
+            # Ensure URL has scheme
+            if not url.startswith(('http://', 'https://')):
+                url = 'http://' + url
+            parsed = urlparse(url)
+            domain = parsed.netloc.lower()
+            domain_no_port = domain.split(':')[0]
+            path = parsed.path
+            query = parsed.query
+            features = {}
+            # 1. Length features
+            features.update(self._length_features(url, domain_no_port, path, query))
+            # 2. Character count features
+            features.update(self._char_count_features(url, domain_no_port, path))
+            # 3. Ratio features
+            features.update(self._ratio_features(url, domain_no_port))
+            # 4. Domain structure features
+            features.update(self._domain_features(domain_no_port, parsed))
+            # 5. Path features
+            features.update(self._path_features(path))
+            # 6. Query features
+            features.update(self._query_features(query))
+            # 7. Statistical features (entropy, patterns)
+            features.update(self._statistical_features(url, domain_no_port, path))
+            # 8. Security indicator features
+            features.update(self._security_features(url, parsed, domain_no_port))
+            # 9. Keyword/brand features
+            features.update(self._keyword_features(url, domain_no_port, path))
+            # 10. Encoding features
+            features.update(self._encoding_features(url, domain_no_port))
+            return features
+        except Exception as e:
+            logger.error(f"Error extracting features from URL: {url[:50]}... Error: {e}")
+            return self._get_default_features()
+    def _length_features(self, url: str, domain: str, path: str, query: str) -> dict:
+        """Length-based features."""
+        return {
+            'url_length': len(url),
+            'domain_length': len(domain),
+            'path_length': len(path),
+            'query_length': len(query),
+            # Binary indicators
+            'url_length_gt_75': 1 if len(url) > 75 else 0,
+            'url_length_gt_100': 1 if len(url) > 100 else 0,
+            'url_length_gt_150': 1 if len(url) > 150 else 0,
+            'domain_length_gt_25': 1 if len(domain) > 25 else 0,
+        }
+    def _char_count_features(self, url: str, domain: str, path: str) -> dict:
+        """Character count features."""
+        return {
+            # URL character counts
+            'num_dots': url.count('.'),
+            'num_hyphens': url.count('-'),
+            'num_underscores': url.count('_'),
+            'num_slashes': url.count('/'),
+            'num_question_marks': url.count('?'),
+            'num_ampersands': url.count('&'),
+            'num_equals': url.count('='),
+            'num_at': url.count('@'),
+            'num_percent': url.count('%'),
+            'num_digits_url': sum(c.isdigit() for c in url),
+            'num_letters_url': sum(c.isalpha() for c in url),
+            # Domain character counts
+            'domain_dots': domain.count('.'),
+            'domain_hyphens': domain.count('-'),
+            'domain_digits': sum(c.isdigit() for c in domain),
+            # Path character counts
+            'path_slashes': path.count('/'),
+            'path_dots': path.count('.'),
+            'path_digits': sum(c.isdigit() for c in path),
+        }
+    def _ratio_features(self, url: str, domain: str) -> dict:
+        """Ratio-based features."""
+        url_len = max(len(url), 1)
+        domain_len = max(len(domain), 1)
+        return {
+            'digit_ratio_url': sum(c.isdigit() for c in url) / url_len,
+            'letter_ratio_url': sum(c.isalpha() for c in url) / url_len,
+            'special_char_ratio': sum(not c.isalnum() for c in url) / url_len,
+            'digit_ratio_domain': sum(c.isdigit() for c in domain) / domain_len,
+            'symbol_ratio_domain': sum(c in '-_.' for c in domain) / domain_len,
+        }
+    def _domain_features(self, domain: str, parsed) -> dict:
+        """Domain structure features."""
+        parts = domain.split('.')
+        tld = parts[-1] if parts else ''
+        # Get SLD (second level domain)
+        sld = parts[-2] if len(parts) > 1 else ''
+        # Count subdomains (parts minus domain and TLD)
+        num_subdomains = max(0, len(parts) - 2)
+        return {
+            'num_subdomains': num_subdomains,
+            'num_domain_parts': len(parts),
+            'tld_length': len(tld),
+            'sld_length': len(sld),
+            'longest_domain_part': max((len(p) for p in parts), default=0),
+            'avg_domain_part_len': sum(len(p) for p in parts) / max(len(parts), 1),
+            # TLD indicators
+            'has_suspicious_tld': 1 if tld in self.suspicious_tlds else 0,
+            'has_trusted_tld': 1 if tld in self.trusted_tlds else 0,
+            # Port
+            'has_port': 1 if parsed.port else 0,
+            'has_non_std_port': 1 if parsed.port and parsed.port not in [80, 443] else 0,
+        }
+    def _path_features(self, path: str) -> dict:
+        """Path structure features."""
+        segments = [s for s in path.split('/') if s]
+        # Get file extension if present
+        extension = ''
+        if '.' in path:
+            potential_ext = path.rsplit('.', 1)[-1].split('?')[0].lower()
+            if len(potential_ext) <= 10:
+                extension = potential_ext
+        return {
+            'path_depth': len(segments),
+            'max_path_segment_len': max((len(s) for s in segments), default=0),
+            'avg_path_segment_len': sum(len(s) for s in segments) / max(len(segments), 1),
+            # Extension features
+            'has_extension': 1 if extension else 0,
+            'has_php': 1 if extension == 'php' else 0,
+            'has_html': 1 if extension in ['html', 'htm'] else 0,
+            'has_exe': 1 if extension in ['exe', 'bat', 'cmd', 'msi'] else 0,
+            # Suspicious path patterns
+            'has_double_slash': 1 if '//' in path else 0,
+        }
+    def _query_features(self, query: str) -> dict:
+        """Query string features."""
+        params = parse_qs(query)
+        return {
+            'num_params': len(params),
+            'has_query': 1 if query else 0,
+            'query_value_length': sum(len(''.join(v)) for v in params.values()),
+            'max_param_len': max((len(k) + len(''.join(v)) for k, v in params.items()), default=0),
+        }
+    def _statistical_features(self, url: str, domain: str, path: str) -> dict:
+        """Statistical and entropy features."""
+        return {
+            # Entropy
+            'url_entropy': self._entropy(url),
+            'domain_entropy': self._entropy(domain),
+            'path_entropy': self._entropy(path) if path else 0,
+            # Consecutive character patterns
+            'max_consecutive_digits': self._max_consecutive(url, str.isdigit),
+            'max_consecutive_chars': self._max_consecutive(url, str.isalpha),
+            'max_consecutive_consonants': self._max_consecutive_consonants(domain),
+            # Character variance
+            'char_repeat_rate': self._repeat_rate(url),
+            # N-gram uniqueness
+            'unique_bigram_ratio': self._unique_ngram_ratio(url, 2),
+            'unique_trigram_ratio': self._unique_ngram_ratio(url, 3),
+            # Vowel/consonant ratio in domain
+            'vowel_ratio_domain': self._vowel_ratio(domain),
+        }
+    def _security_features(self, url: str, parsed, domain: str) -> dict:
+        """Security indicator features (URL-based only)."""
+        return {
+            # Protocol
+            'is_https': 1 if parsed.scheme == 'https' else 0,
+            'is_http': 1 if parsed.scheme == 'http' else 0,
+            # IP address
+            'has_ip_address': 1 if self._is_ip(domain) else 0,
+            # Suspicious patterns
+            'has_at_symbol': 1 if '@' in url else 0,
+            'has_redirect': 1 if 'redirect' in url.lower() or 'url=' in url.lower() else 0,
+            # URL shortener
+            'is_shortened': 1 if any(s in domain for s in self.shorteners) else 0,
+            # Free hosting
+            'is_free_hosting': 1 if any(h in domain for h in self.free_hosting) else 0,
+            # www presence
+            'has_www': 1 if domain.startswith('www.') else 0,
+            'www_in_middle': 1 if 'www' in domain and not domain.startswith('www') else 0,
+        }
+    def _keyword_features(self, url: str, domain: str, path: str) -> dict:
+        """Keyword and brand detection features."""
+        url_lower = url.lower()
+        domain_lower = domain.lower()
+        path_lower = path.lower()
+        # Count phishing keywords
+        phishing_in_url = sum(1 for k in self.phishing_keywords if k in url_lower)
+        phishing_in_domain = sum(1 for k in self.phishing_keywords if k in domain_lower)
+        phishing_in_path = sum(1 for k in self.phishing_keywords if k in path_lower)
+        # Count brand names
+        brands_in_url = sum(1 for b in self.brand_names if b in url_lower)
+        brands_in_domain = sum(1 for b in self.brand_names if b in domain_lower)
+        brands_in_path = sum(1 for b in self.brand_names if b in path_lower)
+        # Brand impersonation: brand in path but not in domain
+        brand_impersonation = 1 if brands_in_path > 0 and brands_in_domain == 0 else 0
+        return {
+            'num_phishing_keywords': phishing_in_url,
+            'phishing_in_domain': phishing_in_domain,
+            'phishing_in_path': phishing_in_path,
+            'num_brands': brands_in_url,
+            'brand_in_domain': 1 if brands_in_domain > 0 else 0,
+            'brand_in_path': 1 if brands_in_path > 0 else 0,
+            'brand_impersonation': brand_impersonation,
+            # Specific high-value keywords
+            'has_login': 1 if 'login' in url_lower or 'signin' in url_lower else 0,
+            'has_account': 1 if 'account' in url_lower else 0,
+            'has_verify': 1 if 'verify' in url_lower or 'confirm' in url_lower else 0,
+            'has_secure': 1 if 'secure' in url_lower or 'security' in url_lower else 0,
+            'has_update': 1 if 'update' in url_lower else 0,
+            'has_bank': 1 if 'bank' in url_lower else 0,
+            'has_password': 1 if 'password' in url_lower or 'passwd' in url_lower else 0,
+            'has_suspend': 1 if 'suspend' in url_lower or 'locked' in url_lower else 0,
+            # Suspicious patterns
+            'has_webscr': 1 if 'webscr' in url_lower else 0,
+            'has_cmd': 1 if 'cmd=' in url_lower else 0,
+            'has_cgi': 1 if 'cgi-bin' in url_lower or 'cgi_bin' in url_lower else 0,
+        }
+    def _encoding_features(self, url: str, domain: str) -> dict:
+        """Encoding-related features."""
+        # Check for punycode
+        has_punycode = 'xn--' in domain
+        # Decode and check difference
+        try:
+            decoded = unquote(url)
+            encoding_diff = len(decoded) - len(url)
+        except:
+            encoding_diff = 0
+        # Safe regex checks (wrap in try-except for malformed URLs)
+        try:
+            has_hex = 1 if re.search(r'[0-9a-f]{20,}', url.lower()) else 0
+        except:
+            has_hex = 0
+        try:
+            has_base64 = 1 if re.search(r'[A-Za-z0-9+/]{30,}={0,2}', url) else 0
+        except:
+            has_base64 = 0
+        try:
+            has_unicode = 1 if any(ord(c) > 127 for c in url) else 0
+        except:
+            has_unicode = 0
+        return {
+            'has_url_encoding': 1 if '%' in url else 0,
+            'encoding_count': url.count('%'),
+            'encoding_diff': abs(encoding_diff),
+            'has_punycode': 1 if has_punycode else 0,
+            'has_unicode': has_unicode,
+            'has_hex_string': has_hex,
+            'has_base64': has_base64,
+        }
+    # Helper methods
+    def _entropy(self, text: str) -> float:
+        """Calculate Shannon entropy."""
+        if not text:
+            return 0.0
+        freq = Counter(text)
+        length = len(text)
+        return -sum((c / length) * math.log2(c / length) for c in freq.values())
+    def _max_consecutive(self, text: str, condition) -> int:
+        """Max consecutive characters matching condition."""
+        max_count = count = 0
+        for char in text:
+            if condition(char):
+                count += 1
+                max_count = max(max_count, count)
+            else:
+                count = 0
+        return max_count
+    def _max_consecutive_consonants(self, text: str) -> int:
+        """Max consecutive consonants."""
+        consonants = set('bcdfghjklmnpqrstvwxyz')
+        max_count = count = 0
+        for char in text.lower():
+            if char in consonants:
+                count += 1
+                max_count = max(max_count, count)
+            else:
+                count = 0
+        return max_count
+    def _repeat_rate(self, text: str) -> float:
+        """Rate of repeated adjacent characters."""
+        if len(text) < 2:
+            return 0.0
+        repeats = sum(1 for i in range(len(text) - 1) if text[i] == text[i + 1])
+        return repeats / (len(text) - 1)
+    def _unique_ngram_ratio(self, text: str, n: int) -> float:
+        """Ratio of unique n-grams to total n-grams."""
+        if len(text) < n:
+            return 0.0
+        ngrams = [text[i:i + n] for i in range(len(text) - n + 1)]
+        return len(set(ngrams)) / len(ngrams)
+    def _vowel_ratio(self, text: str) -> float:
+        """Ratio of vowels in text."""
+        if not text:
+            return 0.0
+        vowels = sum(1 for c in text.lower() if c in 'aeiou')
+        letters = sum(1 for c in text if c.isalpha())
+        return vowels / max(letters, 1)
+    def _is_ip(self, domain: str) -> bool:
+        """Check if domain is IP address."""
+        # IPv4
+        if re.match(r'^(\d{1,3}\.){3}\d{1,3}$', domain):
+            return True
+        # IPv6
+        try:
+            socket.inet_pton(socket.AF_INET6, domain.strip('[]'))
+            return True
+        except:
+            return False
+    def _get_default_features(self) -> dict:
+        """Default feature values for error cases."""
+        return {name: 0 for name in self.get_feature_names()}
+    def get_feature_names(self) -> list:
+        """Get list of all feature names."""
+        # Extract from a dummy URL to get all feature names
+        dummy_features = {
+            # Length features
+            'url_length': 0, 'domain_length': 0, 'path_length': 0, 'query_length': 0,
+            'url_length_gt_75': 0, 'url_length_gt_100': 0, 'url_length_gt_150': 0,
+            'domain_length_gt_25': 0,
+            # Char counts
+            'num_dots': 0, 'num_hyphens': 0, 'num_underscores': 0, 'num_slashes': 0,
+            'num_question_marks': 0, 'num_ampersands': 0, 'num_equals': 0, 'num_at': 0,
+            'num_percent': 0, 'num_digits_url': 0, 'num_letters_url': 0,
+            'domain_dots': 0, 'domain_hyphens': 0, 'domain_digits': 0,
+            'path_slashes': 0, 'path_dots': 0, 'path_digits': 0,
+            # Ratios
+            'digit_ratio_url': 0, 'letter_ratio_url': 0, 'special_char_ratio': 0,
+            'digit_ratio_domain': 0, 'symbol_ratio_domain': 0,
+            # Domain features
+            'num_subdomains': 0, 'num_domain_parts': 0, 'tld_length': 0, 'sld_length': 0,
+            'longest_domain_part': 0, 'avg_domain_part_len': 0,
+            'has_suspicious_tld': 0, 'has_trusted_tld': 0, 'has_port': 0, 'has_non_std_port': 0,
+            # Path features
+            'path_depth': 0, 'max_path_segment_len': 0, 'avg_path_segment_len': 0,
+            'has_extension': 0, 'has_php': 0, 'has_html': 0, 'has_exe': 0, 'has_double_slash': 0,
+            # Query features
+            'num_params': 0, 'has_query': 0, 'query_value_length': 0, 'max_param_len': 0,
+            # Statistical features
+            'url_entropy': 0, 'domain_entropy': 0, 'path_entropy': 0,
+            'max_consecutive_digits': 0, 'max_consecutive_chars': 0, 'max_consecutive_consonants': 0,
+            'char_repeat_rate': 0, 'unique_bigram_ratio': 0, 'unique_trigram_ratio': 0,
+            'vowel_ratio_domain': 0,
+            # Security features
+            'is_https': 0, 'is_http': 0, 'has_ip_address': 0, 'has_at_symbol': 0,
+            'has_redirect': 0, 'is_shortened': 0, 'is_free_hosting': 0, 'has_www': 0, 'www_in_middle': 0,
+            # Keyword features
+            'num_phishing_keywords': 0, 'phishing_in_domain': 0, 'phishing_in_path': 0,
+            'num_brands': 0, 'brand_in_domain': 0, 'brand_in_path': 0, 'brand_impersonation': 0,
+            'has_login': 0, 'has_account': 0, 'has_verify': 0, 'has_secure': 0, 'has_update': 0,
+            'has_bank': 0, 'has_password': 0, 'has_suspend': 0,
+            'has_webscr': 0, 'has_cmd': 0, 'has_cgi': 0,
+            # Encoding features
+            'has_url_encoding': 0, 'encoding_count': 0, 'encoding_diff': 0,
+            'has_punycode': 0, 'has_unicode': 0, 'has_hex_string': 0, 'has_base64': 0,
+        }
+        return list(dummy_features.keys())
+    def extract_batch(self, urls: list, show_progress: bool = True) -> pd.DataFrame:
+        """
+        Extract features from multiple URLs.
+        Args:
+            urls: List of URL strings
+            show_progress: Show progress messages
+        Returns:
+            DataFrame with features
+        """
+        if show_progress:
+            logger.info(f"Extracting URL features from {len(urls):,} URLs...")
+        features_list = []
+        progress_interval = 50000
+        for i, url in enumerate(urls):
+            if show_progress and i > 0 and i % progress_interval == 0:
+                logger.info(f"  Processed {i:,} / {len(urls):,} ({100 * i / len(urls):.1f}%)")
+            features = self.extract_features(url)
+            features_list.append(features)
+        df = pd.DataFrame(features_list)
+        if show_progress:
+            logger.info(f"✓ Extracted {len(df.columns)} features from {len(df):,} URLs")
+        return df
+def main():
+    """Extract URL-only features from dataset."""
+    import argparse
+    parser = argparse.ArgumentParser(description='URL-Only Feature Extraction (Stage 1)')
+    parser.add_argument('--sample', type=int, default=None, help='Sample N URLs')
+    parser.add_argument('--output', type=str, default=None, help='Output filename')
+    args = parser.parse_args()
+    logger.info("=" * 70)
+    logger.info("URL-Only Feature Extraction v1")
+    logger.info("=" * 70)
+    logger.info("")
+    logger.info("Features: URL structure, lexical, statistical")
+    logger.info("NO HTTP requests, NO external APIs")
+    logger.info("")
+    # Load dataset
+    script_dir = Path(__file__).parent
+    data_file = (script_dir / '../../data/processed/clean_dataset.csv').resolve()
+    logger.info(f"Loading: {data_file.name}")
+    df = pd.read_csv(data_file)
+    logger.info(f"Loaded: {len(df):,} URLs")
+    if args.sample and args.sample < len(df):
+        df = df.sample(n=args.sample, random_state=42)
+        logger.info(f"Sampled: {len(df):,} URLs")
+    # Extract features
+    extractor = URLFeatureExtractorV2()
+    features_df = extractor.extract_batch(df['url'].tolist())
+    features_df['label'] = df['label'].values
+    # Save
+    output_dir = (script_dir / '../../data/features').resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if args.output:
+        output_file = output_dir / args.output
+    else:
+        suffix = f'_sample{args.sample}' if args.sample else ''
+        output_file = output_dir / f'url_features{suffix}.csv'
+    features_df.to_csv(output_file, index=False)
+    logger.info("")
+    logger.info("=" * 70)
+    logger.info(f"✓ Saved: {output_file}")
+    logger.info(f"  Shape: {features_df.shape}")
+    logger.info(f"  Features: {len(features_df.columns) - 1}")
+    logger.info("=" * 70)
+    # Show stats
+    print("\nFeature Statistics (sample):")
+    print(features_df.describe().T.head(20))
+if __name__ == "__main__":
+    main()

scripts/feature_extraction/url/url_features_v2.py ADDED Viewed

	@@ -0,0 +1,1396 @@

+"""
+URL Feature Extraction v2 - IMPROVED VERSION
+Improvements:
+- Fixed free hosting detection (exact/suffix match instead of substring)
+- Added free platform detection (Google Sites, Weebly, Firebase, etc.)
+- Added UUID subdomain detection (Replit, Firebase patterns)
+- Added platform subdomain length feature
+- Added longest_part thresholds (gt_20, gt_30, gt_40)
+- Expanded brand list with regional brands
+- Improved extension categorization (added archive, image categories)
+- Fixed get_feature_names() to be dynamic
+- Better URL shortener detection
+Key Features:
+- Lexical (length, characters, entropy)
+- Structural (domain parts, path segments, TLD)
+- Statistical (entropy, n-grams, patterns)
+- Security indicators (from URL only)
+- Brand/phishing patterns
+- FREE PLATFORM ABUSE DETECTION (NEW!)
+Designed for:
+- Fast inference (< 1ms per URL)
+- No network dependencies
+- Production deployment
+"""
+import pandas as pd
+import numpy as np
+from urllib.parse import urlparse, parse_qs, unquote
+import re
+import math
+import socket
+import unicodedata
+from pathlib import Path
+from collections import Counter
+import sys
+import logging
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger("url_features_v2")
+class URLFeatureExtractorV2:
+    """
+    Fast URL-only feature extractor for Stage 1 phishing detection.
+    IMPROVED VERSION with better free platform detection.
+    """
+    def __init__(self):
+        """Initialize feature extractor with keyword lists."""
+        # Phishing-related keywords
+        self.phishing_keywords = [
+            'login', 'signin', 'sign-in', 'log-in', 'logon', 'signon',
+            'account', 'accounts', 'update', 'verify', 'verification',
+            'secure', 'security', 'banking', 'bank', 'confirm', 'password',
+            'passwd', 'credential', 'suspended', 'locked', 'unusual',
+            'authenticate', 'auth', 'wallet', 'invoice', 'payment',
+            'billing', 'expire', 'expired', 'limited', 'restrict',
+            'urgent', 'immediately', 'alert', 'warning', 'resolve',
+            'recover', 'restore', 'reactivate', 'unlock', 'validate'
+        ]
+        # Brand names - EXPANDED with regional brands
+        self.brand_names = [
+            # US Tech Giants
+            'paypal', 'ebay', 'amazon', 'apple', 'microsoft', 'google',
+            'facebook', 'instagram', 'twitter', 'netflix', 'linkedin',
+            'dropbox', 'adobe', 'spotify', 'steam', 'zoom', 'docusign',
+            'salesforce', 'shopify', 'square', 'venmo', 'cashapp', 'zelle',
+            # US Banks
+            'chase', 'wellsfargo', 'bankofamerica', 'citibank', 'citi',
+            'americanexpress', 'amex', 'visa', 'mastercard',
+            'capitalone', 'usbank', 'pnc', 'truist',
+            # Email/Communication
+            'outlook', 'office365', 'office', 'yahoo', 'aol', 'icloud',
+            'gmail', 'protonmail', 'whatsapp', 'telegram', 'discord',
+            'signal', 'skype', 'teams',
+            # Shipping/Logistics
+            'dhl', 'fedex', 'ups', 'usps', 'amazon', 'alibaba',
+            # Crypto/Finance
+            'coinbase', 'binance', 'blockchain', 'metamask', 'kraken',
+            'gemini', 'robinhood', 'etrade', 'fidelity', 'schwab',
+            'payoneer', 'stripe', 'wise', 'revolut',
+            # Social/Entertainment
+            'tiktok', 'snapchat', 'twitch', 'roblox', 'epic', 'epicgames',
+            'playstation', 'xbox', 'nintendo', 'blizzard', 'riot',
+            # REGIONAL BRANDS (from analysis)
+            # Europe
+            'allegro', 'allegrolokalnie',  # Poland
+            'olx',                          # Europe/LatAm
+            'bol', 'marktplaats',          # Netherlands
+            'leboncoin',                    # France
+            'idealo', 'otto',              # Germany
+            'hsbc', 'barclays', 'santander', 'ing', 'revolut',  # European banks
+            # Asia
+            'rakuten',                      # Japan
+            'lazada', 'shopee',            # Southeast Asia
+            'baidu', 'taobao', 'alipay', 'wechat', 'weibo',  # China
+            'paytm', 'phonepe',            # India
+            # Latin America
+            'mercadolibre', 'mercadopago', # LatAm
+            # Russia
+            'yandex', 'vk', 'mailru',
+            # Other
+            'uber', 'lyft', 'airbnb', 'booking', 'expedia',
+            'wetransfer', 'mediafire', 'mega',
+        ]
+        # URL shorteners - EXACT MATCH ONLY
+        self.shorteners = {
+        # Original
+        'bit.ly', 'bitly.com', 'goo.gl', 'tinyurl.com', 't.co', 'ow.ly',
+        'is.gd', 'buff.ly', 'adf.ly', 'bit.do', 'short.to', 'tiny.cc',
+        'j.mp', 'surl.li', 'rb.gy', 'cutt.ly', 'qrco.de', 'v.gd',
+        'shorturl.at', 'rebrand.ly', 'clck.ru', 's.id', 'shrtco.de',
+        # NEW from analysis (CRITICAL!)
+        'qrco.de',      # 3,824 occurrences!
+        'q-r.to',       # 2,974
+        'l.ead.me',     # 2,907
+        'ead.me',       # Base domain
+        'urlz.fr',
+        'hotm.art',
+        'reurl.cc',
+        'did.li',
+        'zpr.io',
+        'linkin.bio',
+        'linqapp.com',
+        'linktr.ee',
+        'flow.page',
+        'campsite.bio',
+        'qr-codes.io',
+        'scanned.page',
+        'l.wl.co',
+        'wl.co',
+        'hm.ru',
+        'flowcode.com',
+    }
+        # Suspicious TLDs
+        self.suspicious_tlds = {
+            'tk', 'ml', 'ga', 'cf', 'gq',  # Free domains
+            'xyz', 'top', 'club', 'work', 'date', 'racing', 'win',
+            'loan', 'download', 'stream', 'click', 'link', 'bid',
+            'review', 'party', 'trade', 'webcam', 'science',
+            'accountant', 'faith', 'cricket', 'zip', 'mov',
+            'icu', 'buzz', 'space', 'online', 'site', 'website',
+            'tech', 'store', 'rest', 'cfd', 'monster', 'sbs'
+        }
+        # Trusted TLDs
+        self.trusted_tlds = {
+            'com', 'org', 'net', 'edu', 'gov', 'mil',
+            'uk', 'us', 'ca', 'de', 'fr', 'jp', 'au',
+            'nl', 'be', 'ch', 'it', 'es', 'se', 'no',
+            'pl', 'br', 'in', 'mx', 'kr', 'ru', 'cn'
+        }
+        # FREE PLATFORMS - EXACT/SUFFIX MATCH (from your PhishTank analysis!)
+        self.free_platforms = {
+            # Website Builders
+            'weebly.com', 'wixsite.com', 'wix.com', 'webflow.io',
+            'framer.website', 'carrd.co', 'notion.site', 'webwave.me',
+            'godaddysites.com', 'square.site', 'sites.google.com',
+            # Google Platforms (HIGH PHISHING RATE from analysis)
+            'firebaseapp.com', 'web.app', 'appspot.com',
+            'firebase.app', 'page.link',
+            # Developer Platforms (from analysis: Replit, Vercel, etc.)
+            'github.io', 'gitlab.io', 'pages.github.com',
+            'vercel.app', 'netlify.app', 'netlify.com',
+            'replit.dev', 'repl.co', 'replit.co',
+            'glitch.me', 'glitch.com',
+            'pages.dev', 'workers.dev',  # Cloudflare
+            'herokuapp.com', 'heroku.com',
+            'onrender.com', 'railway.app', 'fly.dev',
+            'amplifyapp.com',  # AWS Amplify
+            'surge.sh', 'now.sh',
+            # Blogging/CMS
+            'wordpress.com', 'blogspot.com', 'blogger.com',
+            'tumblr.com', 'medium.com', 'ghost.io',
+            'substack.com', 'beehiiv.com',
+            # Adobe/Creative
+            'adobesites.com', 'myportfolio.com', 'behance.net',
+            'adobe.com', 'framer.app',
+            # Forms/Surveys (from analysis: jotform, hsforms)
+            'jotform.com', 'typeform.com', 'forms.gle',
+            'hsforms.com', 'hubspot.com', 'surveymonkey.com',
+            'formstack.com', 'cognito.com',
+            # File Sharing
+            'dropboxusercontent.com', 'dl.dropboxusercontent.com',
+            'sharepoint.com', '1drv.ms', 'onedrive.live.com',
+            'box.com', 'wetransfer.com', 'we.tl',
+            # Free Hosting
+            '000webhostapp.com', 'freehosting.com', 'freehostia.com',
+            '5gbfree.com', 'x10hosting.com', 'awardspace.com',
+            'byet.host', 'infinityfree.com',
+            # Education/Sandbox
+            'repl.it', 'codepen.io', 'jsfiddle.net', 'codesandbox.io',
+            'stackblitz.com', 'observablehq.com',
+            # Other (from analysis)
+            'webcindario.com', 'gitbook.io', 'tinyurl.com',
+            'start.page', 'my.site', 'site123.com'
+        }
+        # Common English words for dictionary check
+        self.common_words = {
+            'about', 'account', 'after', 'again', 'all', 'also', 'america', 'american',
+            'another', 'answer', 'any', 'app', 'apple', 'area', 'back', 'bank', 'best',
+            'between', 'book', 'business', 'call', 'can', 'card', 'care', 'case', 'center',
+            'central', 'change', 'check', 'city', 'class', 'cloud', 'come', 'company',
+            'contact', 'control', 'country', 'course', 'credit', 'data', 'day', 'dept',
+            'department', 'different', 'digital', 'doctor', 'down', 'east', 'easy', 'end',
+            'energy', 'even', 'event', 'every', 'express', 'fact', 'family', 'feel',
+            'field', 'file', 'find', 'first', 'food', 'form', 'free', 'friend', 'from',
+            'game', 'general', 'get', 'give', 'global', 'good', 'government', 'great',
+            'group', 'hand', 'have', 'head', 'health', 'help', 'here', 'high', 'home',
+            'house', 'how', 'image', 'info', 'information', 'insurance', 'international',
+            'into', 'just', 'keep', 'kind', 'know', 'large', 'last', 'late', 'leave',
+            'left', 'legal', 'life', 'like', 'line', 'little', 'local', 'long', 'look',
+            'love', 'mail', 'main', 'make', 'management', 'manager', 'many', 'map', 'market',
+            'marketing', 'media', 'medical', 'member', 'message', 'money', 'month', 'more',
+            'most', 'move', 'music', 'name', 'national', 'need', 'network', 'never', 'new',
+            'news', 'next', 'north', 'not', 'note', 'number', 'office', 'official', 'old',
+            'online', 'only', 'open', 'order', 'other', 'over', 'page', 'part', 'party',
+            'people', 'person', 'personal', 'photo', 'place', 'plan', 'play', 'plus', 'point',
+            'policy', 'portal', 'post', 'power', 'press', 'price', 'private', 'product',
+            'program', 'project', 'property', 'public', 'quality', 'question', 'quick', 'rate',
+            'read', 'real', 'record', 'report', 'research', 'resource', 'result', 'right',
+            'room', 'sale', 'sales', 'save', 'school', 'search', 'second', 'section',
+            'security', 'see', 'senior', 'service', 'services', 'set', 'shop', 'show',
+            'side', 'sign', 'site', 'small', 'social', 'software', 'solution', 'solutions',
+            'some', 'south', 'space', 'special', 'staff', 'start', 'state', 'store', 'story',
+            'student', 'study', 'support', 'sure', 'system', 'systems', 'take', 'team', 'tech',
+            'technology', 'test', 'text', 'than', 'that', 'their', 'them', 'then', 'there',
+            'these', 'they', 'thing', 'think', 'this', 'those', 'through', 'time', 'today',
+            'together', 'total', 'trade', 'training', 'travel', 'trust', 'type', 'under',
+            'university', 'until', 'update', 'upon', 'user', 'value', 'very', 'video',
+            'view', 'want', 'water', 'website', 'week', 'well', 'west', 'what', 'when',
+            'where', 'which', 'while', 'white', 'will', 'with', 'within', 'without', 'woman',
+            'women', 'word', 'work', 'world', 'would', 'write', 'year', 'york', 'young', 'your'
+        }
+        # Keyboard patterns
+        self.keyboard_patterns = [
+            'qwerty', 'asdfgh', 'zxcvbn', '12345', '123456', '1234567', '12345678',
+            'qwertyuiop', 'asdfghjkl', 'zxcvbnm'
+        ]
+        # Lookalike character mappings
+        self.lookalike_chars = {
+            '0': 'o', 'o': '0',
+            '1': 'l', 'l': '1', 'i': '1',
+            'rn': 'm', 'vv': 'w', 'cl': 'd'
+        }
+        self.microsoft_services = {
+        'forms.office.com',
+        'sharepoint.com',
+        'onedrive.live.com',
+        '1drv.ms',
+        }
+        self.zoom_services = {
+        'docs.zoom.us',
+        'zoom.us',
+        }
+        self.adobe_services = {
+        'express.adobe.com',
+        'new.express.adobe.com',  # Multi-level!
+        'spark.adobe.com',
+        'portfolio.adobe.com',
+        }
+        self.google_services = {
+        'docs.google.com',
+        'sites.google.com',
+        'drive.google.com',
+        'script.google.com',
+        'storage.googleapis.com',
+        'storage.cloud.google.com',
+        'forms.google.com',
+        'calendar.google.com',
+        'meet.google.com',
+        }
+    def extract_features(self, url: str) -> dict:
+        """
+        Extract all URL-only features from a single URL.
+        Args:
+            url: URL string
+        Returns:
+            Dictionary of features
+        """
+        try:
+            # Ensure URL has scheme
+            if not url.startswith(('http://', 'https://')):
+                url = 'http://' + url
+            parsed = urlparse(url)
+            domain = parsed.netloc.lower()
+            domain_no_port = domain.split(':')[0]
+            path = parsed.path
+            query = parsed.query
+            features = {}
+            # 1. Length features
+            features.update(self._length_features(url, domain_no_port, path, query))
+            # 2. Character count features
+            features.update(self._char_count_features(url, domain_no_port, path))
+            # 3. Ratio features
+            features.update(self._ratio_features(url, domain_no_port))
+            # 4. Domain structure features
+            features.update(self._domain_features(domain_no_port, parsed))
+            # 5. Path features
+            features.update(self._path_features(path, domain_no_port))
+            # 6. Query features
+            features.update(self._query_features(query))
+            # 7. Statistical features (entropy, patterns)
+            features.update(self._statistical_features(url, domain_no_port, path))
+            # 8. Security indicator features
+            features.update(self._security_features(url, parsed, domain_no_port))
+            # 9. Keyword/brand features
+            features.update(self._keyword_features(url, domain_no_port, path, parsed))
+            # 10. Encoding features
+            features.update(self._encoding_features(url, domain_no_port))
+            return features
+        except Exception as e:
+            logger.error(f"Error extracting features from URL: {url[:50]}... Error: {e}")
+            return self._get_default_features()
+    def _length_features(self, url: str, domain: str, path: str, query: str) -> dict:
+        """Length-based features."""
+        return {
+            'url_length': len(url),
+            'domain_length': len(domain),
+            'path_length': len(path),
+            'query_length': len(query),
+            # Categorical length encoding
+            'url_length_category': self._categorize_length(len(url), [30, 75, 150]),
+            'domain_length_category': self._categorize_length(len(domain), [10, 20, 30]),
+        }
+    def _char_count_features(self, url: str, domain: str, path: str) -> dict:
+        """Character count features."""
+        return {
+            # URL character counts
+            'num_dots': url.count('.'),
+            'num_hyphens': url.count('-'),
+            'num_underscores': url.count('_'),
+            'num_slashes': url.count('/'),
+            'num_question_marks': url.count('?'),
+            'num_ampersands': url.count('&'),
+            'num_equals': url.count('='),
+            'num_at': url.count('@'),
+            'num_percent': url.count('%'),
+            'num_digits_url': sum(c.isdigit() for c in url),
+            'num_letters_url': sum(c.isalpha() for c in url),
+            # Domain character counts
+            'domain_dots': domain.count('.'),
+            'domain_hyphens': domain.count('-'),
+            'domain_digits': sum(c.isdigit() for c in domain),
+            # Path character counts
+            'path_slashes': path.count('/'),
+            'path_dots': path.count('.'),
+            'path_digits': sum(c.isdigit() for c in path),
+        }
+    def _ratio_features(self, url: str, domain: str) -> dict:
+        """Ratio-based features."""
+        url_len = max(len(url), 1)
+        domain_len = max(len(domain), 1)
+        return {
+            'digit_ratio_url': sum(c.isdigit() for c in url) / url_len,
+            'letter_ratio_url': sum(c.isalpha() for c in url) / url_len,
+            'special_char_ratio': sum(not c.isalnum() for c in url) / url_len,
+            'digit_ratio_domain': sum(c.isdigit() for c in domain) / domain_len,
+            'symbol_ratio_domain': sum(c in '-_.' for c in domain) / domain_len,
+        }
+    def _domain_features(self, domain: str, parsed) -> dict:
+        """Domain structure features."""
+        parts = domain.split('.')
+        tld = parts[-1] if parts else ''
+        sld = parts[-2] if len(parts) > 1 else ''
+        num_subdomains = max(0, len(parts) - 2)
+        longest_part = max((len(p) for p in parts), default=0)
+        return {
+            'num_subdomains': num_subdomains,
+            'num_domain_parts': len(parts),
+            'tld_length': len(tld),
+            'sld_length': len(sld),
+            'longest_domain_part': longest_part,
+            'avg_domain_part_len': sum(len(p) for p in parts) / max(len(parts), 1),
+            # NEW: Longest part thresholds (from analysis!)
+            'longest_part_gt_20': 1 if longest_part > 20 else 0,
+            'longest_part_gt_30': 1 if longest_part > 30 else 0,
+            'longest_part_gt_40': 1 if longest_part > 40 else 0,
+            # TLD indicators
+            'has_suspicious_tld': 1 if tld in self.suspicious_tlds else 0,
+            'has_trusted_tld': 1 if tld in self.trusted_tlds else 0,
+            # Port
+            'has_port': 1 if parsed.port else 0,
+            'has_non_std_port': 1 if parsed.port and parsed.port not in [80, 443] else 0,
+            # Domain randomness features
+            'domain_randomness_score': self._calculate_domain_randomness(sld),
+            'sld_consonant_cluster_score': self._consonant_clustering_score(sld),
+            'sld_keyboard_pattern': self._keyboard_pattern_score(sld),
+            'sld_has_dictionary_word': self._contains_dictionary_word(sld),
+            'sld_pronounceability_score': self._pronounceability_score(sld),
+            'domain_digit_position_suspicious': self._suspicious_digit_position(sld),
+        }
+    def _path_features(self, path: str, domain: str) -> dict:
+        """Path structure features."""
+        segments = [s for s in path.split('/') if s]
+        # Get file extension if present
+        extension = ''
+        if '.' in path:
+            potential_ext = path.rsplit('.', 1)[-1].split('?')[0].lower()
+            if len(potential_ext) <= 10:
+                extension = potential_ext
+        return {
+            'path_depth': len(segments),
+            'max_path_segment_len': max((len(s) for s in segments), default=0),
+            'avg_path_segment_len': sum(len(s) for s in segments) / max(len(segments), 1),
+            # Extension features
+            'has_extension': 1 if extension else 0,
+            'extension_category': self._categorize_extension(extension),
+            'has_suspicious_extension': 1 if extension in ['zip', 'exe', 'apk', 'scr', 'bat', 'cmd'] else 0,
+            'has_exe': 1 if extension in ['exe', 'bat', 'cmd', 'msi'] else 0,
+            # Suspicious path patterns
+            'has_double_slash': 1 if '//' in path else 0,
+            'path_has_brand_not_domain': self._brand_in_path_only(path, domain),
+            'path_has_ip_pattern': 1 if re.search(r'\d{1,3}[._-]\d{1,3}[._-]\d{1,3}', path) else 0,
+            'suspicious_path_extension_combo': self._suspicious_extension_pattern(path),
+        }
+    def _query_features(self, query: str) -> dict:
+        """Query string features."""
+        params = parse_qs(query)
+        return {
+            'num_params': len(params),
+            'has_query': 1 if query else 0,
+            'query_value_length': sum(len(''.join(v)) for v in params.values()),
+            'max_param_len': max((len(k) + len(''.join(v)) for k, v in params.items()), default=0),
+            'query_has_url': 1 if re.search(r'https?%3A%2F%2F|http%3A//', query.lower()) else 0,
+        }
+    def _statistical_features(self, url: str, domain: str, path: str) -> dict:
+        """Statistical and entropy features."""
+        parts = domain.split('.')
+        sld = parts[-2] if len(parts) > 1 else domain
+        return {
+            # Entropy
+            'url_entropy': self._entropy(url),
+            'domain_entropy': self._entropy(domain),
+            'path_entropy': self._entropy(path) if path else 0,
+            # Consecutive character patterns
+            'max_consecutive_digits': self._max_consecutive(url, str.isdigit),
+            'max_consecutive_chars': self._max_consecutive(url, str.isalpha),
+            'max_consecutive_consonants': self._max_consecutive_consonants(domain),
+            # Character variance
+            'char_repeat_rate': self._repeat_rate(url),
+            # N-gram uniqueness
+            'unique_bigram_ratio': self._unique_ngram_ratio(url, 2),
+            'unique_trigram_ratio': self._unique_ngram_ratio(url, 3),
+            # Improved statistical features
+            'sld_letter_diversity': self._character_diversity(sld),
+            'domain_has_numbers_letters': 1 if any(c.isdigit() for c in domain) and any(c.isalpha() for c in domain) else 0,
+            'url_complexity_score': self._calculate_url_complexity(url),
+        }
+    def _security_features(self, url: str, parsed, domain: str) -> dict:
+        """Security indicator features (URL-based only)."""
+        parts = domain.split('.')
+        return {
+            # IP address
+            'has_ip_address': 1 if self._is_ip(domain) else 0,
+            # Suspicious patterns
+            'has_at_symbol': 1 if '@' in url else 0,
+            'has_redirect': 1 if 'redirect' in url.lower() or 'url=' in url.lower() else 0,
+            # URL shortener - FIXED: exact match only
+            'is_shortened': self._is_url_shortener(domain),
+            # Free hosting - DEPRECATED (use is_free_platform instead)
+            'is_free_hosting': self._is_free_platform(domain),
+            # NEW: Free platform detection (CRITICAL for your dataset!)
+            'is_free_platform': self._is_free_platform(domain),
+            'platform_subdomain_length': self._get_platform_subdomain_length(domain),
+            'has_uuid_subdomain': self._detect_uuid_pattern(domain),
+        }
+    def _keyword_features(self, url: str, domain: str, path: str, parsed) -> dict:
+        """Keyword and brand detection features."""
+        url_lower = url.lower()
+        domain_lower = domain.lower()
+        path_lower = path.lower()
+        # Count phishing keywords
+        phishing_in_url = sum(1 for k in self.phishing_keywords if k in url_lower)
+        phishing_in_domain = sum(1 for k in self.phishing_keywords if k in domain_lower)
+        phishing_in_path = sum(1 for k in self.phishing_keywords if k in path_lower)
+        # Count brand names
+        brands_in_url = sum(1 for b in self.brand_names if b in url_lower)
+        brands_in_domain = sum(1 for b in self.brand_names if b in domain_lower)
+        brands_in_path = sum(1 for b in self.brand_names if b in path_lower)
+        # Brand impersonation
+        brand_impersonation = 1 if brands_in_path > 0 and brands_in_domain == 0 else 0
+        return {
+            'num_phishing_keywords': phishing_in_url,
+            'phishing_in_domain': phishing_in_domain,
+            'phishing_in_path': phishing_in_path,
+            'num_brands': brands_in_url,
+            'brand_in_domain': 1 if brands_in_domain > 0 else 0,
+            'brand_in_path': 1 if brands_in_path > 0 else 0,
+            'brand_impersonation': brand_impersonation,
+            # Specific high-value keywords
+            'has_login': 1 if 'login' in url_lower or 'signin' in url_lower else 0,
+            'has_account': 1 if 'account' in url_lower else 0,
+            'has_verify': 1 if 'verify' in url_lower or 'confirm' in url_lower else 0,
+            'has_secure': 1 if 'secure' in url_lower or 'security' in url_lower else 0,
+            'has_update': 1 if 'update' in url_lower else 0,
+            'has_bank': 1 if 'bank' in url_lower else 0,
+            'has_password': 1 if 'password' in url_lower or 'passwd' in url_lower else 0,
+            'has_suspend': 1 if 'suspend' in url_lower or 'locked' in url_lower else 0,
+            # Suspicious patterns
+            'has_webscr': 1 if 'webscr' in url_lower else 0,
+            'has_cmd': 1 if 'cmd=' in url_lower else 0,
+            'has_cgi': 1 if 'cgi-bin' in url_lower or 'cgi_bin' in url_lower else 0,
+            # Advanced brand spoofing features
+            'brand_in_subdomain_not_domain': self._brand_subdomain_spoofing(parsed),
+            'multiple_brands_in_url': 1 if brands_in_url >= 2 else 0,
+            'brand_with_hyphen': self._brand_with_hyphen(domain_lower),
+            'suspicious_brand_tld': self._suspicious_brand_tld(domain),
+            'brand_keyword_combo': self._brand_phishing_keyword_combo(url_lower),
+        }
+    def _encoding_features(self, url: str, domain: str) -> dict:
+        """Encoding-related features."""
+        has_punycode = 'xn--' in domain
+        try:
+            decoded = unquote(url)
+            encoding_diff = len(decoded) - len(url)
+        except:
+            encoding_diff = 0
+        try:
+            has_hex = 1 if re.search(r'[0-9a-f]{20,}', url.lower()) else 0
+        except:
+            has_hex = 0
+        try:
+            has_base64 = 1 if re.search(r'[A-Za-z0-9+/]{30,}={0,2}', url) else 0
+        except:
+            has_base64 = 0
+        try:
+            has_unicode = 1 if any(ord(c) > 127 for c in url) else 0
+        except:
+            has_unicode = 0
+        return {
+            'has_url_encoding': 1 if '%' in url else 0,
+            'encoding_count': url.count('%'),
+            'encoding_diff': abs(encoding_diff),
+            'has_punycode': 1 if has_punycode else 0,
+            'has_unicode': has_unicode,
+            'has_hex_string': has_hex,
+            'has_base64': has_base64,
+            # Homograph & encoding detection
+            'has_lookalike_chars': self._detect_lookalike_chars(domain),
+            'mixed_script_score': self._mixed_script_detection(domain),
+            'homograph_brand_risk': self._homograph_brand_check(domain),
+            'suspected_idn_homograph': self._idn_homograph_score(url),
+            'double_encoding': self._detect_double_encoding(url),
+            'encoding_in_domain': 1 if '%' in domain else 0,
+            'suspicious_unicode_category': self._suspicious_unicode_chars(url),
+        }
+    # ============================================================
+    # HELPER METHODS
+    # ============================================================
+    def _entropy(self, text: str) -> float:
+        """Calculate Shannon entropy."""
+        if not text:
+            return 0.0
+        freq = Counter(text)
+        length = len(text)
+        return -sum((c / length) * math.log2(c / length) for c in freq.values())
+    def _max_consecutive(self, text: str, condition) -> int:
+        """Max consecutive characters matching condition."""
+        max_count = count = 0
+        for char in text:
+            if condition(char):
+                count += 1
+                max_count = max(max_count, count)
+            else:
+                count = 0
+        return max_count
+    def _max_consecutive_consonants(self, text: str) -> int:
+        """Max consecutive consonants."""
+        consonants = set('bcdfghjklmnpqrstvwxyz')
+        max_count = count = 0
+        for char in text.lower():
+            if char in consonants:
+                count += 1
+                max_count = max(max_count, count)
+            else:
+                count = 0
+        return max_count
+    def _repeat_rate(self, text: str) -> float:
+        """Rate of repeated adjacent characters."""
+        if len(text) < 2:
+            return 0.0
+        repeats = sum(1 for i in range(len(text) - 1) if text[i] == text[i + 1])
+        return repeats / (len(text) - 1)
+    def _unique_ngram_ratio(self, text: str, n: int) -> float:
+        """Ratio of unique n-grams to total n-grams."""
+        if len(text) < n:
+            return 0.0
+        ngrams = [text[i:i + n] for i in range(len(text) - n + 1)]
+        return len(set(ngrams)) / len(ngrams)
+    def _is_ip(self, domain: str) -> bool:
+        """Check if domain is IP address."""
+        # IPv4
+        if re.match(r'^(\d{1,3}\.){3}\d{1,3}$', domain):
+            return True
+        # IPv6
+        try:
+            socket.inet_pton(socket.AF_INET6, domain.strip('[]'))
+            return True
+        except:
+            return False
+    # ============================================================
+    # NEW/IMPROVED METHODS
+    # ============================================================
+    def _is_url_shortener(self, domain: str) -> int:
+        """
+        URL shortener detection - EXACT match.
+        """
+        domain_lower = domain.lower()
+        return 1 if domain_lower in self.shorteners else 0
+    def _is_free_platform(self, domain: str) -> int:
+        """
+        Detect if hosted on free platform.
+        CRITICAL FIX: Exact or suffix match (not substring!).
+        Examples:
+        - 'mysite.weebly.com' → 1 (suffix match)
+        - 'weebly.com' → 1 (exact match)
+        - 'weebly-alternative.com' → 0 (NOT a match!)
+        """
+        domain_lower = domain.lower()
+        # Exact match
+        if domain_lower in self.free_platforms:
+            return 1
+        if domain_lower in self.google_services:
+            return 1
+        if domain_lower in self.adobe_services:
+            return 1
+        if domain_lower in self.microsoft_services:
+            return 1
+        if domain_lower in self.zoom_services:
+            return 1
+        # Suffix match (subdomain.platform.com)
+        for platform in self.free_platforms:
+            if domain_lower.endswith('.' + platform):
+                return 1
+        return 0
+    def _get_platform_subdomain_length(self, domain: str) -> int:
+        """
+        IMPROVED: Handle multi-level subdomains.
+        Examples:
+        - docs.google.com → subdomain = 'docs' (4 chars)
+        - new.express.adobe.com → subdomain = 'new.express' (11 chars)
+        - storage.cloud.google.com → subdomain = 'storage.cloud' (13 chars)
+        """
+        domain_lower = domain.lower()
+        # Check Google
+        if '.google.com' in domain_lower:
+            subdomain = domain_lower.replace('.google.com', '')
+            return len(subdomain)
+        # Check Adobe
+        if '.adobe.com' in domain_lower:
+            subdomain = domain_lower.replace('.adobe.com', '')
+            return len(subdomain)
+        # Check Microsoft
+        if '.office.com' in domain_lower:
+            subdomain = domain_lower.replace('.office.com', '')
+            return len(subdomain)
+        # Check free platforms (existing logic)
+        for platform in self.free_platforms:
+            if domain_lower.endswith('.' + platform):
+                subdomain = domain_lower[:-len('.' + platform)]
+                return len(subdomain)
+        return 0
+    def _detect_uuid_pattern(self, domain: str) -> int:
+        """
+        Detect UUID patterns in subdomain (Replit, Firebase, etc.).
+        Example:
+        'b82dba2b-fde4-4477-b6d5-8b17144e1bee.replit.dev' → 1
+        """
+        # UUID pattern: 8-4-4-4-12 hex characters
+        uuid_pattern = r'[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}'
+        return 1 if re.search(uuid_pattern, domain.lower()) else 0
+    # ============================================================
+    # DOMAIN RANDOMNESS HELPERS
+    # ============================================================
+    def _calculate_domain_randomness(self, domain: str) -> float:
+        """Calculate randomness score for domain (0-1)."""
+        if not domain or len(domain) < 4:
+            return 0.5
+        domain_lower = domain.lower()
+        scores = []
+        # 1. Vowel distribution
+        vowels = 'aeiou'
+        vowel_positions = [i for i, c in enumerate(domain_lower) if c in vowels]
+        if len(vowel_positions) >= 2:
+            avg_gap = sum(vowel_positions[i+1] - vowel_positions[i]
+                         for i in range(len(vowel_positions)-1)) / (len(vowel_positions)-1)
+            vowel_irregularity = min(abs(avg_gap - 2.5) / 5, 1.0)
+            scores.append(vowel_irregularity)
+        # 2. Character frequency
+        char_freq = Counter(domain_lower)
+        common_letters = 'etaoinshr'
+        common_count = sum(char_freq.get(c, 0) for c in common_letters)
+        uncommon_ratio = 1 - (common_count / max(len(domain_lower), 1))
+        scores.append(uncommon_ratio)
+        # 3. Repeated characters
+        unique_ratio = len(set(domain_lower)) / max(len(domain_lower), 1)
+        if unique_ratio > 0.75:
+            scores.append((unique_ratio - 0.75) / 0.25)
+        else:
+            scores.append(0)
+        return min(sum(scores) / max(len(scores), 1), 1.0)
+    def _consonant_clustering_score(self, text: str) -> float:
+        """Detect unnatural consonant clusters."""
+        if not text:
+            return 0
+        text_lower = text.lower()
+        consonants = 'bcdfghjklmnpqrstvwxyz'
+        max_cluster = 0
+        current_cluster = 0
+        for char in text_lower:
+            if char in consonants:
+                current_cluster += 1
+                max_cluster = max(max_cluster, current_cluster)
+            else:
+                current_cluster = 0
+        if max_cluster >= 5:
+            return 1.0
+        elif max_cluster >= 4:
+            return 0.7
+        elif max_cluster >= 3:
+            return 0.4
+        else:
+            return 0.0
+    def _keyboard_pattern_score(self, text: str) -> int:
+        """Detect keyboard walking patterns."""
+        if not text:
+            return 0
+        text_lower = text.lower()
+        count = 0
+        for pattern in self.keyboard_patterns:
+            if pattern in text_lower:
+                count += 1
+        return count
+    def _contains_dictionary_word(self, text: str) -> int:
+        """Check if text contains any common English word."""
+        if not text or len(text) < 4:
+            return 0
+        text_lower = text.lower()
+        if text_lower in self.common_words:
+            return 1
+        for word in self.common_words:
+            if len(word) >= 4 and word in text_lower:
+                return 1
+        return 0
+    def _pronounceability_score(self, text: str) -> float:
+        """Score based on bigram frequencies in English."""
+        if not text or len(text) < 2:
+            return 0.5
+        text_lower = text.lower()
+        common_bigrams = {
+            'th', 'he', 'in', 'er', 'an', 're', 'on', 'at', 'en', 'nd',
+            'ti', 'es', 'or', 'te', 'of', 'ed', 'is', 'it', 'al', 'ar',
+            'st', 'to', 'nt', 'ng', 'se', 'ha', 'as', 'ou', 'io', 've'
+        }
+        bigrams = [text_lower[i:i+2] for i in range(len(text_lower)-1)]
+        if not bigrams:
+            return 0.5
+        common_count = sum(1 for bg in bigrams if bg in common_bigrams)
+        score = common_count / len(bigrams)
+        return score
+    def _suspicious_digit_position(self, text: str) -> int:
+        """Detect suspicious digit positions."""
+        if not text:
+            return 0
+        if text and text[0].isdigit():
+            return 1
+        if len(text) >= 2 and text[-1].isdigit() and text[-2].isdigit():
+            return 1
+        return 0
+    # ============================================================
+    # BRAND SPOOFING HELPERS
+    # ============================================================
+    def _brand_subdomain_spoofing(self, parsed) -> int:
+        """Detect brand in subdomain but not main domain."""
+        try:
+            parts = parsed.netloc.split('.')
+            if len(parts) < 3:
+                return 0
+            subdomains = '.'.join(parts[:-2]).lower()
+            main_domain = '.'.join(parts[-2:]).lower()
+            for brand in self.brand_names:
+                if brand in subdomains and brand not in main_domain:
+                    return 1
+            return 0
+        except:
+            return 0
+    def _brand_with_hyphen(self, domain: str) -> int:
+        """Detect hyphenated brand names."""
+        if not domain:
+            return 0
+        domain_lower = domain.lower()
+        for brand in self.brand_names:
+            if len(brand) >= 4:
+                for i in range(1, len(brand)):
+                    hyphenated = brand[:i] + '-' + brand[i:]
+                    if hyphenated in domain_lower:
+                        return 1
+        return 0
+    def _suspicious_brand_tld(self, domain: str) -> int:
+        """Detect brand name with suspicious TLD."""
+        if not domain:
+            return 0
+        domain_lower = domain.lower()
+        parts = domain_lower.split('.')
+        if len(parts) < 2:
+            return 0
+        tld = parts[-1]
+        domain_without_tld = '.'.join(parts[:-1])
+        if tld in self.suspicious_tlds:
+            for brand in self.brand_names:
+                if brand in domain_without_tld:
+                    return 1
+        return 0
+    def _brand_phishing_keyword_combo(self, url: str) -> int:
+        """Detect brand + phishing keyword combination."""
+        if not url:
+            return 0
+        url_lower = url.lower()
+        has_brand = any(brand in url_lower for brand in self.brand_names)
+        if has_brand:
+            phishing_combo_keywords = [
+                'verify', 'security', 'secure', 'account', 'update',
+                'login', 'confirm', 'suspended', 'locked'
+            ]
+            for keyword in phishing_combo_keywords:
+                if keyword in url_lower:
+                    return 1
+        return 0
+    # ============================================================
+    # PATH & QUERY HELPERS
+    # ============================================================
+    def _brand_in_path_only(self, path: str, domain: str) -> int:
+        """Detect brand in path but not in domain."""
+        if not path or not domain:
+            return 0
+        path_lower = path.lower()
+        domain_lower = domain.lower()
+        for brand in self.brand_names:
+            if brand in path_lower and brand not in domain_lower:
+                return 1
+        return 0
+    def _suspicious_extension_pattern(self, path: str) -> int:
+        """Detect suspicious extension patterns."""
+        if not path:
+            return 0
+        path_lower = path.lower()
+        suspicious_patterns = [
+            '.php.exe', '.html.exe', '.pdf.exe', '.doc.exe',
+            '.zip.exe', '.rar.exe', '.html.zip', '.pdf.scr'
+        ]
+        for pattern in suspicious_patterns:
+            if pattern in path_lower:
+                return 1
+        parts = path_lower.split('.')
+        if len(parts) >= 3:
+            ext1 = parts[-2]
+            ext2 = parts[-1]
+            doc_exts = ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'html', 'htm']
+            exec_exts = ['exe', 'scr', 'bat', 'cmd', 'com', 'pif']
+            if ext1 in doc_exts and ext2 in exec_exts:
+                return 1
+        return 0
+    # ============================================================
+    # ENCODING HELPERS
+    # ============================================================
+    def _detect_lookalike_chars(self, domain: str) -> int:
+        """Detect lookalike characters."""
+        if not domain:
+            return 0
+        domain_lower = domain.lower()
+        suspicious_patterns = [
+            ('rn', 'm'),
+            ('vv', 'w'),
+            ('cl', 'd'),
+        ]
+        for pattern, _ in suspicious_patterns:
+            if pattern in domain_lower:
+                return 1
+        if any(c in domain_lower for c in ['0', '1']):
+            has_letters = any(c.isalpha() for c in domain_lower)
+            if has_letters:
+                for lookalike_char in self.lookalike_chars:
+                    if lookalike_char in domain_lower:
+                        return 1
+        return 0
+    def _mixed_script_detection(self, domain: str) -> int:
+        """Detect mixing of scripts."""
+        if not domain:
+            return 0
+        scripts = set()
+        for char in domain:
+            if char.isalpha():
+                try:
+                    script = unicodedata.name(char).split()[0]
+                    if script in ['LATIN', 'CYRILLIC', 'GREEK']:
+                        scripts.add(script)
+                except:
+                    pass
+        return len(scripts) if len(scripts) > 1 else 0
+    def _homograph_brand_check(self, domain: str) -> int:
+        """Check for homograph attacks on brands."""
+        if not domain:
+            return 0
+        domain_lower = domain.lower()
+        top_brands = ['paypal', 'apple', 'amazon', 'google', 'microsoft', 'facebook']
+        for brand in top_brands:
+            if len(domain_lower) < len(brand) - 2 or len(domain_lower) > len(brand) + 2:
+                continue
+            differences = 0
+            for i in range(min(len(domain_lower), len(brand))):
+                if i < len(domain_lower) and i < len(brand):
+                    if domain_lower[i] != brand[i]:
+                        if (domain_lower[i] in '01' and brand[i] in 'ol') or \
+                           (domain_lower[i] in 'ol' and brand[i] in '01'):
+                            differences += 1
+                        else:
+                            differences += 1
+            if differences <= 2 and differences > 0:
+                return 1
+        return 0
+    def _idn_homograph_score(self, url: str) -> float:
+        """Combined IDN homograph attack score."""
+        score = 0.0
+        count = 0
+        if 'xn--' in url.lower():
+            score += 0.5
+            count += 1
+        non_ascii = sum(1 for c in url if ord(c) > 127)
+        if non_ascii > 0:
+            score += min(non_ascii / 10, 0.3)
+            count += 1
+        return score / max(count, 1) if count > 0 else 0.0
+    def _detect_double_encoding(self, url: str) -> int:
+        """Detect double URL encoding."""
+        if not url:
+            return 0
+        double_encoded_patterns = ['%25', '%2520', '%252e', '%252f']
+        count = sum(url.lower().count(pattern) for pattern in double_encoded_patterns)
+        return count
+    def _suspicious_unicode_chars(self, url: str) -> int:
+        """Detect uncommon Unicode categories."""
+        if not url:
+            return 0
+        suspicious_count = 0
+        for char in url:
+            try:
+                category = unicodedata.category(char)
+                if category in ['Mn', 'Mc', 'Me', 'Zl', 'Zp',
+                               'Cc', 'Cf', 'Sm', 'Sc', 'Sk', 'So']:
+                    suspicious_count += 1
+            except:
+                pass
+        return suspicious_count
+    # ============================================================
+    # FEATURE REFINEMENT HELPERS
+    # ============================================================
+    def _categorize_length(self, length: int, thresholds: list) -> int:
+        """Multi-category encoding for length features."""
+        for i, threshold in enumerate(thresholds):
+            if length <= threshold:
+                return i
+        return len(thresholds)
+    def _categorize_extension(self, extension: str) -> int:
+        """
+        Categorize file extension:
+        0 = none
+        1 = document
+        2 = web/script
+        3 = executable
+        4 = archive
+        5 = image
+        6 = other
+        """
+        if not extension:
+            return 0
+        ext_lower = extension.lower()
+        if ext_lower in ['pdf', 'doc', 'docx', 'xls', 'xlsx', 'ppt', 'pptx', 'txt', 'rtf']:
+            return 1
+        if ext_lower in ['html', 'htm', 'php', 'asp', 'aspx', 'jsp', 'js', 'css']:
+            return 2
+        if ext_lower in ['exe', 'bat', 'cmd', 'scr', 'msi', 'com', 'pif', 'app', 'apk']:
+            return 3
+        if ext_lower in ['zip', 'rar', '7z', 'tar', 'gz', 'bz2']:
+            return 4
+        if ext_lower in ['jpg', 'jpeg', 'png', 'gif', 'svg', 'ico', 'webp']:
+            return 5
+        return 6
+    def _character_diversity(self, text: str) -> float:
+        """Shannon diversity index for characters."""
+        if not text:
+            return 0.0
+        unique_chars = len(set(text))
+        return min(unique_chars / max(len(text), 1), 1.0)
+    def _calculate_url_complexity(self, url: str) -> float:
+        """Combined URL complexity score."""
+        if not url:
+            return 0.0
+        special_chars = sum(1 for c in url if not c.isalnum() and c not in [':', '/', '.'])
+        special_ratio = special_chars / max(len(url), 1)
+        length_score = min(len(url) / 200, 1.0)
+        encoding_score = min(url.count('%') / 10, 1.0)
+        complexity = (special_ratio * 0.4 + length_score * 0.3 + encoding_score * 0.3)
+        return min(complexity, 1.0)
+    # ============================================================
+    # UTILITY METHODS
+    # ============================================================
+    def _get_default_features(self) -> dict:
+        """Default feature values for error cases."""
+        # Get feature names dynamically
+        dummy_url = "http://example.com"
+        try:
+            return self.extract_features(dummy_url)
+        except:
+            return {}
+    def get_feature_names(self) -> list:
+        """
+        Get list of all feature names DYNAMICALLY.
+        FIXED: No longer hardcoded!
+        """
+        dummy_url = "http://example.com/test"
+        dummy_features = self.extract_features(dummy_url)
+        # Remove 'label' if present
+        feature_names = [k for k in dummy_features.keys() if k != 'label']
+        return sorted(feature_names)
+    def extract_batch(self, urls: list, show_progress: bool = True) -> pd.DataFrame:
+        """
+        Extract features from multiple URLs.
+        Args:
+            urls: List of URL strings
+            show_progress: Show progress messages
+        Returns:
+            DataFrame with features
+        """
+        if show_progress:
+            logger.info(f"Extracting URL features from {len(urls):,} URLs...")
+        features_list = []
+        progress_interval = 50000
+        for i, url in enumerate(urls):
+            if show_progress and i > 0 and i % progress_interval == 0:
+                logger.info(f"  Processed {i:,} / {len(urls):,} ({100 * i / len(urls):.1f}%)")
+            features = self.extract_features(url)
+            features_list.append(features)
+        df = pd.DataFrame(features_list)
+        if show_progress:
+            logger.info(f"✓ Extracted {len(df.columns)} features from {len(df):,} URLs")
+        return df
+def main():
+    """Extract URL-only features from dataset."""
+    import argparse
+    parser = argparse.ArgumentParser(description='URL-Only Feature Extraction v2.1 (IMPROVED)')
+    parser.add_argument('--sample', type=int, default=None, help='Sample N URLs')
+    parser.add_argument('--output', type=str, default=None, help='Output filename')
+    args = parser.parse_args()
+    logger.info("=" * 70)
+    logger.info("URL-Only Feature Extraction v2")
+    logger.info("=" * 70)
+    logger.info("")
+    logger.info("NEW Features:")
+    logger.info("  - Fixed free platform detection (exact/suffix match)")
+    logger.info("  - Added platform_subdomain_length")
+    logger.info("  - Added has_uuid_subdomain")
+    logger.info("  - Added longest_part thresholds (gt_20, gt_30, gt_40)")
+    logger.info("  - Expanded brand list with regional brands")
+    logger.info("  - Improved extension categorization")
+    logger.info("")
+    # Load dataset
+    script_dir = Path(__file__).parent
+    data_file = (script_dir / '../../data/processed/clean_dataset.csv').resolve()
+    logger.info(f"Loading: {data_file.name}")
+    df = pd.read_csv(data_file)
+    logger.info(f"Loaded: {len(df):,} URLs")
+    if args.sample and args.sample < len(df):
+        df = df.sample(n=args.sample, random_state=42)
+        logger.info(f"Sampled: {len(df):,} URLs")
+    # Extract features
+    extractor = URLFeatureExtractorV2()
+    features_df = extractor.extract_batch(df['url'].tolist())
+    features_df['label'] = df['label'].values
+    # Save
+    output_dir = (script_dir / '../../data/features').resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if args.output:
+        output_file = output_dir / args.output
+    else:
+        suffix = f'_sample{args.sample}' if args.sample else ''
+        output_file = output_dir / f'url_features_v2{suffix}.csv'
+    features_df.to_csv(output_file, index=False)
+    logger.info("")
+    logger.info("=" * 70)
+    logger.info(f"✓ Saved: {output_file}")
+    logger.info(f"  Shape: {features_df.shape}")
+    logger.info(f"  Features: {len(features_df.columns) - 1}")
+    logger.info("=" * 70)
+    # Show feature names
+    print("\nAll Features:")
+    feature_names = extractor.get_feature_names()
+    for i, name in enumerate(feature_names, 1):
+        print(f"{i:3d}. {name}")
+    # Show stats
+    print("\n\nFeature Statistics (first 30):")
+    print(features_df.describe().T.head(30))
+    # Show new features stats
+    print("\n\nNEW FEATURES Statistics:")
+    new_features = [
+        'is_free_platform', 'platform_subdomain_length', 'has_uuid_subdomain',
+        'longest_part_gt_20', 'longest_part_gt_30', 'longest_part_gt_40'
+    ]
+    for feat in new_features:
+        if feat in features_df.columns:
+            if feat == 'platform_subdomain_length':
+                print(f"\n{feat}:")
+                print(f"  Mean: {features_df[feat].mean():.2f}")
+                print(f"  Max: {features_df[feat].max()}")
+                print(f"  Non-zero: {(features_df[feat] > 0).sum()} ({(features_df[feat] > 0).sum() / len(features_df) * 100:.1f}%)")
+            else:
+                print(f"\n{feat}: {features_df[feat].sum()} / {len(features_df)} ({features_df[feat].mean() * 100:.1f}%)")
+if __name__ == "__main__":
+    main()

scripts/feature_extraction/url/url_features_v3.py ADDED Viewed

	@@ -0,0 +1,866 @@

+"""
+URL Feature Extraction v2.2 - OPTIMIZED & NORMALIZED
+KEY IMPROVEMENTS:
+1. ✅ URL Normalization - www.github.com & github.com produce identical features
+2. ✅ Scheme normalization - http/https handled consistently
+3. ✅ Removed redundant features (www_in_middle, www_subdomain_only, etc.)
+4. ✅ Separated is_free_hosting from is_free_platform (both are important!)
+5. ✅ Focus on TOP 20 most important features from model analysis
+6. ✅ Optimized for production use (< 1ms per URL)
+TOP FEATURES (from your model analysis):
+- domain_dots (XGB: 27.7%, RF: 4.4%)
+- is_shortened (XGB: 24.7%, RF: 6.2%)
+- is_free_hosting (XGB: 10.6%, RF: 4.0%)
+- is_free_platform (XGB: 9.2%, RF: 4.1%)
+- num_subdomains (XGB: 4.2%, RF: 5.2%)
+- domain_length (XGB: 0.8%, RF: 5.0%)
+- domain_entropy, url_entropy, path features...
+"""
+import re
+import math
+import argparse
+import logging
+import pandas as pd
+from urllib.parse import urlparse, parse_qs, unquote
+from collections import Counter
+from pathlib import Path
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger("url_features_optimized")
+class URLFeatureExtractorOptimized:
+    """
+    Optimized URL feature extractor with normalization.
+    KEY: Normalizes www/http variants for consistent features!
+    """
+    def __init__(self):
+        """Initialize with keyword lists - OPTIMIZED"""
+        # Phishing keywords (top indicators)
+        self.phishing_keywords = [
+            'login', 'signin', 'account', 'update', 'verify', 'secure',
+            'banking', 'confirm', 'password', 'suspended', 'authenticate',
+            'wallet', 'payment', 'billing', 'expire', 'urgent', 'alert'
+        ]
+        # Brand names (expanded with regional)
+        self.brand_names = [
+            'paypal', 'ebay', 'amazon', 'apple', 'microsoft', 'google',
+            'facebook', 'instagram', 'twitter', 'x', 'netflix', 'linkedin',
+            'dropbox', 'adobe', 'spotify', 'steam', 'zoom', 'docusign',
+            'chase', 'wellsfargo', 'bankofamerica', 'citibank', 'citi',
+            'visa', 'mastercard', 'amex', 'capitalone',
+            'outlook', 'office365', 'gmail', 'yahoo', 'icloud', 'whatsapp',
+            'dhl', 'fedex', 'ups', 'usps', 'alibaba',
+            'coinbase', 'binance', 'blockchain', 'metamask', 'stripe',
+            'tiktok', 'snapchat', 'roblox'
+        ]
+        # URL shorteners - EXACT match only
+        self.shorteners = {
+            'bit.ly', 'bitly.com', 'goo.gl', 'tinyurl.com', 't.co', 'ow.ly',
+            'is.gd', 'buff.ly', 'adf.ly', 'short.to', 'tiny.cc', 'rb.gy',
+            'cutt.ly', 'qrco.de', 'linktr.ee', 'linkin.bio'
+        }
+        # Suspicious TLDs
+        self.suspicious_tlds = {
+            'tk', 'ml', 'ga', 'cf', 'gq', 'xyz', 'top', 'club', 'work',
+            'date', 'loan', 'download', 'click', 'link', 'zip', 'mov'
+        }
+        # Trusted TLDs
+        self.trusted_tlds = {
+            'com', 'org', 'net', 'edu', 'gov', 'mil', 'uk', 'us', 'ca',
+            'de', 'fr', 'jp', 'au', 'nl', 'it', 'es'
+        }
+        # FREE HOSTING - separate from platforms!
+        self.free_hosting = {
+            '000webhostapp.com', 'freehosting.com', 'freehostia.com',
+            '5gbfree.com', 'x10hosting.com', 'awardspace.com',
+            'byet.host', 'infinityfree.com', 'webcindario.com'
+        }
+        # FREE PLATFORMS - frequently abused for phishing
+        self.free_platforms = {
+            # Website builders
+            'weebly.com', 'wixsite.com', 'wix.com', 'webflow.io',
+            'carrd.co', 'notion.site', 'webwave.me', 'godaddysites.com',
+            'square.site', 'sites.google.com',
+            # Cloud platforms
+            'firebaseapp.com', 'web.app', 'appspot.com',
+            'github.io', 'gitlab.io', 'vercel.app', 'netlify.app',
+            'replit.dev', 'repl.co', 'glitch.me', 'herokuapp.com',
+            'onrender.com', 'railway.app', 'fly.dev', 'pages.dev',
+            # Blogging
+            'wordpress.com', 'blogspot.com', 'blogger.com', 'tumblr.com',
+            # Forms/docs
+            'jotform.com', 'typeform.com', 'forms.gle',
+            # File sharing
+            'dropboxusercontent.com', 'sharepoint.com', '1drv.ms'
+        }
+        # Common English words
+        self.common_words = {
+            'about', 'account', 'after', 'all', 'also', 'app', 'apple', 'area',
+            'back', 'bank', 'best', 'book', 'business', 'call', 'can', 'card',
+            'center', 'check', 'city', 'cloud', 'come', 'company', 'contact',
+            'data', 'day', 'digital', 'email', 'file', 'find', 'first', 'free',
+            'from', 'game', 'get', 'global', 'good', 'group', 'help', 'home',
+            'info', 'just', 'keep', 'like', 'link', 'login', 'mail', 'main',
+            'make', 'media', 'money', 'more', 'name', 'need', 'network', 'new',
+            'news', 'next', 'office', 'online', 'only', 'open', 'page', 'pay',
+            'people', 'phone', 'place', 'post', 'product', 'read', 'real',
+            'search', 'secure', 'service', 'services', 'shop', 'sign', 'site',
+            'start', 'support', 'system', 'tech', 'time', 'today', 'update',
+            'user', 'verify', 'view', 'web', 'website', 'work', 'world'
+        }
+        # Keyboard patterns
+        self.keyboard_patterns = [
+            'qwerty', 'asdfgh', 'zxcvbn', '12345', '123456', 'qwertyuiop'
+        ]
+    def normalize_url(self, url: str) -> tuple:
+        """
+        Normalize URL for consistent feature extraction.
+        CRITICAL: www.github.com and github.com should have same features!
+        Returns:
+            (normalized_url, original_domain, normalized_domain, is_http)
+        """
+        # Ensure scheme
+        if not url.startswith(('http://', 'https://')):
+            url = 'https://' + url
+        parsed = urlparse(url.lower())
+        original_domain = parsed.netloc.split(':')[0]  # Remove port
+        # Normalize domain (remove www)
+        has_www = original_domain.startswith('www.')
+        normalized_domain = original_domain[4:] if has_www else original_domain
+        # Track if originally HTTP (security feature)
+        is_http = parsed.scheme == 'http'
+        # Rebuild URL with normalized domain and https
+        normalized_url = f"https://{normalized_domain}{parsed.path}"
+        if parsed.query:
+            normalized_url += f"?{parsed.query}"
+        return normalized_url, original_domain, normalized_domain, is_http
+    def extract_features(self, url: str) -> dict:
+        """
+        Extract features with URL normalization.
+        www.github.com and github.com produce IDENTICAL features!
+        """
+        try:
+            # Normalize URL
+            norm_url, orig_domain, norm_domain, is_http = self.normalize_url(url)
+            # Parse normalized URL
+            parsed = urlparse(norm_url)
+            domain = norm_domain
+            path = parsed.path
+            query = parsed.query
+            if not domain:
+                return self._get_default_features()
+            features = {}
+            # Extract all features using NORMALIZED URL/domain
+            features.update(self._length_features(norm_url, domain, path, query))
+            features.update(self._char_count_features(norm_url, domain, path))
+            features.update(self._ratio_features(norm_url, domain))
+            features.update(self._domain_features(domain, parsed))
+            features.update(self._path_features(path, domain))
+            features.update(self._query_features(query))
+            features.update(self._statistical_features(norm_url, domain, path))
+            features.update(self._security_features(norm_url, parsed, domain, is_http))
+            features.update(self._keyword_features(norm_url, domain, path))
+            features.update(self._encoding_features(norm_url, domain))
+            return features
+        except Exception as e:
+            logger.error(f"Error extracting features: {url[:50]}... Error: {e}")
+            return self._get_default_features()
+    # ============================================================
+    # FEATURE EXTRACTION METHODS
+    # ============================================================
+    def _length_features(self, url: str, domain: str, path: str, query: str) -> dict:
+        """Length-based features."""
+        url_len = len(url)
+        domain_len = len(domain)
+        return {
+            'url_length': url_len,
+            'domain_length': domain_len,
+            'path_length': len(path),
+            'query_length': len(query),
+            # Categorized lengths (0=short, 1=medium, 2=long, 3=very_long)
+            'url_length_category': 0 if url_len < 40 else 1 if url_len < 75 else 2 if url_len < 120 else 3,
+            'domain_length_category': 0 if domain_len < 10 else 1 if domain_len < 20 else 2 if domain_len < 30 else 3,
+        }
+    def _char_count_features(self, url: str, domain: str, path: str) -> dict:
+        """Character count features."""
+        return {
+            'num_dots': url.count('.'),
+            'num_hyphens': url.count('-'),
+            'num_underscores': url.count('_'),
+            'num_slashes': url.count('/'),
+            'num_question_marks': url.count('?'),
+            'num_ampersands': url.count('&'),
+            'num_equals': url.count('='),
+            'num_at': url.count('@'),
+            'num_percent': url.count('%'),
+            'num_digits_url': sum(c.isdigit() for c in url),
+            'num_letters_url': sum(c.isalpha() for c in url),
+            # Domain-specific
+            'domain_dots': domain.count('.'),
+            'domain_hyphens': domain.count('-'),
+            'domain_digits': sum(c.isdigit() for c in domain),
+            # Path-specific
+            'path_slashes': path.count('/'),
+            'path_dots': path.count('.'),
+            'path_digits': sum(c.isdigit() for c in path),
+        }
+    def _ratio_features(self, url: str, domain: str) -> dict:
+        """Character ratio features."""
+        url_len = max(len(url), 1)
+        domain_len = max(len(domain), 1)
+        digit_count = sum(c.isdigit() for c in url)
+        letter_count = sum(c.isalpha() for c in url)
+        special_count = url_len - digit_count - letter_count
+        return {
+            'digit_ratio_url': digit_count / url_len,
+            'letter_ratio_url': letter_count / url_len,
+            'special_char_ratio': special_count / url_len,
+            'digit_ratio_domain': sum(c.isdigit() for c in domain) / domain_len,
+            'symbol_ratio_domain': sum(c in '-_.' for c in domain) / domain_len,
+        }
+    def _domain_features(self, domain: str, parsed) -> dict:
+        """Domain structure features."""
+        parts = domain.split('.')
+        num_parts = len(parts)
+        # Subdomain count (e.g., sub.example.com = 1 subdomain)
+        num_subdomains = max(0, num_parts - 2) if num_parts >= 2 else 0
+        # TLD and SLD
+        tld = parts[-1] if parts else ''
+        sld = parts[-2] if len(parts) >= 2 else ''
+        # Domain part lengths
+        part_lens = [len(p) for p in parts]
+        longest_part = max(part_lens) if part_lens else 0
+        avg_part_len = sum(part_lens) / len(part_lens) if part_lens else 0
+        return {
+            'num_subdomains': num_subdomains,
+            'num_domain_parts': num_parts,
+            'tld_length': len(tld),
+            'sld_length': len(sld),
+            'longest_domain_part': longest_part,
+            'avg_domain_part_len': avg_part_len,
+            # Threshold flags
+            'longest_part_gt_20': 1 if longest_part > 20 else 0,
+            'longest_part_gt_30': 1 if longest_part > 30 else 0,
+            'longest_part_gt_40': 1 if longest_part > 40 else 0,
+            # TLD checks
+            'has_suspicious_tld': 1 if tld in self.suspicious_tlds else 0,
+            'has_trusted_tld': 1 if tld in self.trusted_tlds else 0,
+            # Port
+            'has_port': 1 if ':' in parsed.netloc else 0,
+            'has_non_std_port': 1 if ':' in parsed.netloc and not parsed.netloc.endswith((':80', ':443')) else 0,
+            # Randomness
+            'domain_randomness_score': self._calculate_domain_randomness(domain),
+            'sld_consonant_cluster_score': self._consonant_clustering_score(sld),
+            'sld_keyboard_pattern': self._keyboard_pattern_score(sld),
+            'sld_has_dictionary_word': 1 if self._contains_dictionary_word(sld) else 0,
+            'sld_pronounceability_score': self._pronounceability_score(sld),
+            'domain_digit_position_suspicious': 1 if self._suspicious_digit_position(sld) else 0,
+        }
+    def _path_features(self, path: str, domain: str) -> dict:
+        """Path structure features."""
+        if not path or path == '/':
+            return {
+                'path_depth': 0,
+                'max_path_segment_len': 0,
+                'avg_path_segment_len': 0.0,
+                'has_extension': 0,
+                'extension_category': 0,
+                'has_suspicious_extension': 0,
+                'has_exe': 0,
+                'has_double_slash': 0,
+                'path_has_brand_not_domain': 0,
+                'path_has_ip_pattern': 0,
+                'suspicious_path_extension_combo': 0,
+            }
+        segments = [s for s in path.split('/') if s]
+        depth = len(segments)
+        # Extension
+        has_ext = '.' in segments[-1] if segments else False
+        ext = segments[-1].split('.')[-1].lower() if has_ext else ''
+        # Check for suspicious extensions
+        exec_exts = {'exe', 'bat', 'cmd', 'scr', 'vbs', 'ps1'}
+        doc_exts = {'pdf', 'doc', 'docx', 'xls', 'xlsx'}
+        # Brand in path but not domain
+        path_brands = sum(1 for b in self.brand_names if b in path.lower())
+        domain_brands = sum(1 for b in self.brand_names if b in domain.lower())
+        return {
+            'path_depth': depth,
+            'max_path_segment_len': max((len(s) for s in segments), default=0),
+            'avg_path_segment_len': sum(len(s) for s in segments) / depth if depth > 0 else 0,
+            'has_extension': 1 if has_ext else 0,
+            'extension_category': self._categorize_extension(ext),
+            'has_suspicious_extension': 1 if ext in exec_exts else 0,
+            'has_exe': 1 if ext == 'exe' else 0,
+            'has_double_slash': 1 if '//' in path else 0,
+            'path_has_brand_not_domain': 1 if path_brands > 0 and domain_brands == 0 else 0,
+            'path_has_ip_pattern': 1 if re.search(r'\\d{1,3}[._-]\\d{1,3}[._-]\\d{1,3}[._-]\\d{1,3}', path) else 0,
+            'suspicious_path_extension_combo': 1 if (ext in doc_exts and 'download' in path.lower()) else 0,
+        }
+    def _query_features(self, query: str) -> dict:
+        """Query string features."""
+        if not query:
+            return {
+                'num_params': 0,
+                'has_query': 0,
+                'query_value_length': 0,
+                'max_param_len': 0,
+                'query_has_url': 0,
+            }
+        params = query.split('&')
+        param_values = [p.split('=')[1] if '=' in p else '' for p in params]
+        return {
+            'num_params': len(params),
+            'has_query': 1,
+            'query_value_length': sum(len(v) for v in param_values),
+            'max_param_len': max((len(p) for p in params), default=0),
+            'query_has_url': 1 if any(v.startswith(('http', 'www')) for v in param_values) else 0,
+        }
+    def _statistical_features(self, url: str, domain: str, path: str) -> dict:
+        """Statistical features (entropy, patterns)."""
+        return {
+            'url_entropy': self._entropy(url),
+            'domain_entropy': self._entropy(domain),
+            'path_entropy': self._entropy(path) if path else 0,
+            'max_consecutive_digits': self._max_consecutive(url, str.isdigit),
+            'max_consecutive_chars': self._max_consecutive(url, str.isalpha),
+            'max_consecutive_consonants': self._max_consecutive_consonants(domain),
+            'char_repeat_rate': self._repeat_rate(url),
+            'unique_bigram_ratio': self._unique_ngram_ratio(url, 2),
+            'unique_trigram_ratio': self._unique_ngram_ratio(url, 3),
+            'sld_letter_diversity': self._character_diversity(domain.split('.')[-2] if '.' in domain else domain),
+            'domain_has_numbers_letters': 1 if any(c.isdigit() for c in domain) and any(c.isalpha() for c in domain) else 0,
+            'url_complexity_score': self._calculate_url_complexity(url),
+        }
+    def _security_features(self, url: str, parsed, domain: str, is_http: bool) -> dict:
+        """Security indicator features."""
+        return {
+            'has_ip_address': 1 if self._is_ip(domain) else 0,
+            'has_at_symbol': 1 if '@' in url else 0,
+            'has_redirect': 1 if '//' in parsed.path else 0,
+            # CRITICAL FEATURES (from your top 20)
+            'is_shortened': self._is_url_shortener(domain),
+            'is_free_hosting': self._is_free_hosting(domain),
+            'is_free_platform': self._is_free_platform(domain),
+            'platform_subdomain_length': self._get_platform_subdomain_length(domain),
+            'has_uuid_subdomain': self._detect_uuid_pattern(domain),
+            # HTTP vs HTTPS (from ORIGINAL URL)
+            'is_http': 1 if is_http else 0,
+        }
+    def _keyword_features(self, url: str, domain: str, path: str) -> dict:
+        """Keyword and brand detection."""
+        url_lower = url.lower()
+        domain_lower = domain.lower()
+        path_lower = path.lower()
+        # Phishing keywords
+        phishing_count = sum(1 for k in self.phishing_keywords if k in url_lower)
+        # Brand mentions
+        brands_in_url = [b for b in self.brand_names if b in url_lower]
+        return {
+            'num_phishing_keywords': phishing_count,
+            'phishing_in_domain': 1 if any(k in domain_lower for k in self.phishing_keywords) else 0,
+            'phishing_in_path': 1 if any(k in path_lower for k in self.phishing_keywords) else 0,
+            'num_brands': len(brands_in_url),
+            'brand_in_domain': 1 if any(b in domain_lower for b in self.brand_names) else 0,
+            'brand_in_path': 1 if any(b in path_lower for b in self.brand_names) else 0,
+            'brand_impersonation': self._brand_impersonation_score(domain, brands_in_url),
+            # Specific phishing keywords
+            'has_login': 1 if 'login' in url_lower else 0,
+            'has_account': 1 if 'account' in url_lower else 0,
+            'has_verify': 1 if 'verify' in url_lower else 0,
+            'has_secure': 1 if 'secure' in url_lower else 0,
+            'has_update': 1 if 'update' in url_lower else 0,
+            'has_bank': 1 if 'bank' in url_lower else 0,
+            'has_password': 1 if 'password' in url_lower or 'passwd' in url_lower else 0,
+            'has_suspend': 1 if 'suspend' in url_lower else 0,
+            'has_webscr': 1 if 'webscr' in url_lower else 0,
+            'has_cmd': 1 if 'cmd=' in url_lower or '/cmd/' in url_lower else 0,
+            'has_cgi': 1 if 'cgi-bin' in url_lower or '.cgi' in url_lower else 0,
+            # Brand spoofing patterns
+            'brand_in_subdomain_not_domain': self._brand_subdomain_spoofing(domain, brands_in_url),
+            'multiple_brands_in_url': 1 if len(brands_in_url) > 1 else 0,
+            'brand_with_hyphen': self._brand_with_hyphen(domain),
+            'suspicious_brand_tld': self._suspicious_brand_tld(domain),
+            'brand_keyword_combo': self._brand_phishing_keyword_combo(url),
+        }
+    def _encoding_features(self, url: str, domain: str) -> dict:
+        """Encoding and obfuscation features."""
+        return {
+            'has_url_encoding': 1 if '%' in url else 0,
+            'encoding_count': url.count('%'),
+            'encoding_diff': len(url) - len(unquote(url)),
+            'has_punycode': 1 if 'xn--' in domain else 0,
+            'has_unicode': 1 if any(ord(c) > 127 for c in url) else 0,
+            'has_hex_string': 1 if re.search(r'0x[0-9a-f]{6,}', url.lower()) else 0,
+            'has_base64': 1 if re.search(r'[A-Za-z0-9+/]{20,}={0,2}', url) else 0,
+            'has_lookalike_chars': self._detect_lookalike_chars(domain),
+            'mixed_script_score': self._mixed_script_detection(domain),
+            'homograph_brand_risk': self._homograph_brand_check(domain),
+            'suspected_idn_homograph': 1 if self._idn_homograph_score(url) > 0.5 else 0,
+            'double_encoding': 1 if self._detect_double_encoding(url) else 0,
+            'encoding_in_domain': 1 if '%' in domain else 0,
+            'suspicious_unicode_category': self._suspicious_unicode_chars(url),
+        }
+    # ============================================================
+    # HELPER METHODS
+    # ============================================================
+    def _entropy(self, text: str) -> float:
+        """Calculate Shannon entropy."""
+        if not text:
+            return 0.0
+        counts = Counter(text)
+        probs = [count / len(text) for count in counts.values()]
+        return -sum(p * math.log2(p) for p in probs)
+    def _max_consecutive(self, text: str, condition) -> int:
+        """Max consecutive characters matching condition."""
+        if not text:
+            return 0
+        max_count = current = 0
+        for char in text:
+            if condition(char):
+                current += 1
+                max_count = max(max_count, current)
+            else:
+                current = 0
+        return max_count
+    def _max_consecutive_consonants(self, text: str) -> int:
+        """Max consecutive consonants."""
+        vowels = set('aeiou')
+        max_count = current = 0
+        for char in text.lower():
+            if char.isalpha() and char not in vowels:
+                current += 1
+                max_count = max(max_count, current)
+            else:
+                current = 0
+        return max_count
+    def _repeat_rate(self, text: str) -> float:
+        """Character repetition rate."""
+        if len(text) < 2:
+            return 0.0
+        repeats = sum(1 for i in range(len(text) - 1) if text[i] == text[i + 1])
+        return repeats / (len(text) - 1)
+    def _unique_ngram_ratio(self, text: str, n: int) -> float:
+        """Unique n-gram ratio."""
+        if len(text) < n:
+            return 1.0
+        ngrams = [text[i:i+n] for i in range(len(text) - n + 1)]
+        return len(set(ngrams)) / len(ngrams) if ngrams else 1.0
+    def _is_ip(self, domain: str) -> bool:
+        """Check if domain is an IP address."""
+        # IPv4
+        ipv4_pattern = r'^\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}$'
+        if re.match(ipv4_pattern, domain):
+            return True
+        # IPv6 (simplified)
+        if ':' in domain and domain.count(':') >= 2:
+            return True
+        return False
+    def _is_url_shortener(self, domain: str) -> int:
+        """Check if domain is a URL shortener (EXACT match only)."""
+        return 1 if domain in self.shorteners else 0
+    def _is_free_hosting(self, domain: str) -> int:
+        """Check if domain uses FREE HOSTING service."""
+        if domain in self.free_hosting:
+            return 1
+        for host in self.free_hosting:
+            if domain.endswith('.' + host):
+                return 1
+        return 0
+    def _is_free_platform(self, domain: str) -> int:
+        """Check if domain uses FREE PLATFORM (distinct from free hosting!)."""
+        if domain in self.free_platforms:
+            return 1
+        for platform in self.free_platforms:
+            if domain.endswith('.' + platform):
+                return 1
+        return 0
+    def _get_platform_subdomain_length(self, domain: str) -> int:
+        """Get subdomain length for free platforms."""
+        for platform in self.free_platforms:
+            if domain.endswith('.' + platform) or domain == platform:
+                if '.' in domain:
+                    subdomain = domain.split('.')[0]
+                    return len(subdomain)
+        return 0
+    def _detect_uuid_pattern(self, domain: str) -> int:
+        """Detect UUID-like subdomains (Replit, Firebase patterns)."""
+        parts = domain.split('.')
+        if len(parts) >= 2:
+            subdomain = parts[0]
+            # UUID-like: long alphanumeric with hyphens
+            if len(subdomain) >= 20 and '-' in subdomain:
+                return 1
+        return 0
+    def _calculate_domain_randomness(self, domain: str) -> float:
+        """Domain randomness score (0-1)."""
+        if not domain:
+            return 0.0
+        sld = domain.split('.')[-2] if domain.count('.') >= 1 else domain
+        # Factors: entropy, consonant clusters, no dictionary words
+        entropy_score = min(1.0, self._entropy(sld) / 4.5)
+        consonant_score = min(1.0, self._consonant_clustering_score(sld) / 3.0)
+        no_dict_word = 0.3 if not self._contains_dictionary_word(sld) else 0.0
+        return (entropy_score * 0.5 + consonant_score * 0.3 + no_dict_word * 0.2)
+    def _consonant_clustering_score(self, text: str) -> float:
+        """Consonant clustering score."""
+        max_consonants = self._max_consecutive_consonants(text)
+        return min(3.0, max_consonants / 2.0)
+    def _keyboard_pattern_score(self, text: str) -> int:
+        """Check for keyboard patterns."""
+        text_lower = text.lower()
+        for pattern in self.keyboard_patterns:
+            if pattern in text_lower:
+                return 1
+        return 0
+    def _contains_dictionary_word(self, text: str) -> int:
+        """Check if text contains common English word."""
+        text_lower = text.lower()
+        for word in self.common_words:
+            if len(word) >= 4 and word in text_lower:
+                return 1
+        return 0
+    def _pronounceability_score(self, text: str) -> float:
+        """Pronounceability score based on vowel/consonant alternation."""
+        if len(text) < 3:
+            return 0.5
+        vowels = set('aeiou')
+        alternations = 0
+        for i in range(len(text) - 1):
+            c1, c2 = text[i].lower(), text[i + 1].lower()
+            if c1.isalpha() and c2.isalpha():
+                if (c1 in vowels) != (c2 in vowels):
+                    alternations += 1
+        return min(1.0, alternations / (len(text) - 1))
+    def _suspicious_digit_position(self, text: str) -> int:
+        """Check for suspicious digit positions (digits at start/end)."""
+        if not text:
+            return 0
+        if text[0].isdigit() or text[-1].isdigit():
+            return 1
+        return 0
+    def _brand_impersonation_score(self, domain: str, brands_in_url: list) -> int:
+        """Check if brand appears in suspicious way."""
+        if not brands_in_url:
+            return 0
+        domain_lower = domain.lower()
+        for brand in brands_in_url:
+            # Brand in subdomain or with separator
+            if f"{brand}-" in domain_lower or f"{brand}." in domain_lower:
+                # But not as main domain
+                if not domain_lower.endswith(f".{brand}.com"):
+                    return 1
+        return 0
+    def _brand_subdomain_spoofing(self, domain: str, brands: list) -> int:
+        """Brand in subdomain but not main domain."""
+        if not brands:
+            return 0
+        parts = domain.split('.')
+        if len(parts) >= 3:
+            subdomain = '.'.join(parts[:-2])
+            main_domain = '.'.join(parts[-2:])
+            for brand in self.brand_names:
+                if brand in subdomain.lower() and brand not in main_domain.lower():
+                    return 1
+        return 0
+    def _brand_with_hyphen(self, domain: str) -> int:
+        """Brand name with hyphen (spoofing technique)."""
+        domain_lower = domain.lower()
+        for brand in self.brand_names:
+            if f"{brand}-" in domain_lower or f"-{brand}" in domain_lower:
+                return 1
+        return 0
+    def _suspicious_brand_tld(self, domain: str) -> int:
+        """Brand in domain with suspicious TLD."""
+        parts = domain.split('.')
+        if len(parts) >= 2:
+            sld = parts[-2].lower()
+            tld = parts[-1].lower()
+            if sld in self.brand_names and tld in self.suspicious_tlds:
+                return 1
+        return 0
+    def _brand_phishing_keyword_combo(self, url: str) -> int:
+        """Brand + phishing keyword combination."""
+        url_lower = url.lower()
+        has_brand = any(b in url_lower for b in self.brand_names)
+        has_phishing = any(k in url_lower for k in self.phishing_keywords)
+        return 1 if has_brand and has_phishing else 0
+    def _categorize_extension(self, ext: str) -> int:
+        """Categorize file extension (0=none, 1=doc, 2=media, 3=exec, 4=web, 5=other)."""
+        if not ext:
+            return 0
+        doc_exts = {'pdf', 'doc', 'docx', 'xls', 'xlsx', 'txt'}
+        media_exts = {'jpg', 'jpeg', 'png', 'gif', 'mp4', 'mp3'}
+        exec_exts = {'exe', 'bat', 'cmd', 'scr', 'vbs', 'ps1'}
+        web_exts = {'html', 'htm', 'php', 'asp', 'jsp'}
+        if ext in doc_exts:
+            return 1
+        elif ext in media_exts:
+            return 2
+        elif ext in exec_exts:
+            return 3
+        elif ext in web_exts:
+            return 4
+        return 5
+    def _character_diversity(self, text: str) -> float:
+        """Character diversity (unique chars / total chars)."""
+        if not text:
+            return 0.0
+        return len(set(text)) / len(text)
+    def _calculate_url_complexity(self, url: str) -> float:
+        """Overall URL complexity score."""
+        complexity = 0.0
+        complexity += min(1.0, len(url) / 100)  # Length factor
+        complexity += min(1.0, url.count('.') / 5)  # Dots
+        complexity += min(1.0, url.count('-') / 3)  # Hyphens
+        complexity += min(1.0, url.count('/') / 5)  # Paths
+        complexity += min(1.0, self._entropy(url) / 5)  # Entropy
+        return complexity / 5
+    def _detect_lookalike_chars(self, domain: str) -> int:
+        """Detect lookalike character substitutions."""
+        suspicious_patterns = ['rn', 'vv', 'cl', '0', '1']
+        domain_lower = domain.lower()
+        for pattern in suspicious_patterns:
+            if pattern in domain_lower:
+                return 1
+        return 0
+    def _mixed_script_detection(self, domain: str) -> int:
+        """Detect mixed scripts (Cyrillic + Latin)."""
+        latin = sum(1 for c in domain if ord(c) < 128 and c.isalpha())
+        non_latin = sum(1 for c in domain if ord(c) >= 128 and c.isalpha())
+        if latin > 0 and non_latin > 0:
+            return min(3, non_latin)
+        return 0
+    def _homograph_brand_check(self, domain: str) -> int:
+        """Check for homograph attack on brand names."""
+        # Simplified: check if domain looks like a brand but has non-ASCII
+        has_non_ascii = any(ord(c) > 127 for c in domain)
+        looks_like_brand = any(brand in domain.lower() for brand in self.brand_names[:10])
+        return 1 if has_non_ascii and looks_like_brand else 0
+    def _idn_homograph_score(self, url: str) -> float:
+        """Calculate IDN homograph attack score."""
+        if 'xn--' not in url:
+            return 0.0
+        # Punycode detected - potential IDN homograph
+        non_ascii_count = sum(1 for c in url if ord(c) > 127)
+        return min(1.0, non_ascii_count / 10)
+    def _detect_double_encoding(self, url: str) -> int:
+        """Detect double URL encoding (%%25)."""
+        if '%%' in url or '%25' in url:
+            return 1
+        return 0
+    def _suspicious_unicode_chars(self, url: str) -> int:
+        """Count suspicious Unicode characters."""
+        # Check for right-to-left override, zero-width, etc.
+        suspicious = sum(1 for c in url if ord(c) in [0x202E, 0x200B, 0x200C, 0x200D, 0xFEFF])
+        return min(5, suspicious)
+    def _get_default_features(self) -> dict:
+        """Return default features for invalid URLs."""
+        feature_names = self.get_feature_names()
+        return {name: 0 for name in feature_names}
+    def get_feature_names(self) -> list:
+        """Get all feature names in order."""
+        dummy_features = self.extract_features("https://example.com/test")
+        return list(dummy_features.keys())
+    def extract_batch(self, urls: list, show_progress: bool = True) -> pd.DataFrame:
+        """Extract features from batch of URLs."""
+        if show_progress:
+            from tqdm import tqdm
+            features = [self.extract_features(url) for url in tqdm(urls, desc="Extracting")]
+        else:
+            features = [self.extract_features(url) for url in urls]
+        return pd.DataFrame(features)
+def main():
+    """Extract URL features from dataset with normalization."""
+    import sys
+    parser = argparse.ArgumentParser(description='URL Feature Extraction v2.2 OPTIMIZED')
+    parser.add_argument('--sample', type=int, default=None, help='Sample N URLs')
+    parser.add_argument('--output', type=str, default=None, help='Output filename')
+    args = parser.parse_args()
+    logger.info("=" * 70)
+    logger.info("URL Feature Extraction v3 - OPTIMIZED & NORMALIZED")
+    logger.info("=" * 70)
+    logger.info("")
+    logger.info("KEY IMPROVEMENTS:")
+    logger.info("  ✅ URL Normalization (www/non-www consistent)")
+    logger.info("  ✅ Scheme normalization (http/https handling)")
+    logger.info("  ✅ Separated is_free_hosting from is_free_platform")
+    logger.info("  ✅ Removed redundant www features")
+    logger.info("  ✅ Optimized for production")
+    logger.info("")
+    # Load dataset
+    script_dir = Path(__file__).parent
+    data_file = (script_dir / '../../../data/processed/url_dataset_balanced.csv').resolve()
+    logger.info(f"Loading: {data_file.name}")
+    df = pd.read_csv(data_file)
+    logger.info(f"Loaded: {len(df):,} URLs")
+    if args.sample and args.sample < len(df):
+        df = df.sample(n=args.sample, random_state=42)
+        logger.info(f"Sampled: {len(df):,} URLs")
+    # Extract features
+    extractor = URLFeatureExtractorOptimized()
+    features_df = extractor.extract_batch(df['url'].tolist())
+    features_df['label'] = df['label'].values
+    # Save
+    output_dir = (script_dir / '../../../data/features').resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if args.output:
+        output_file = output_dir / args.output
+    else:
+        output_file = output_dir / 'url_features_790k.csv'
+    features_df.to_csv(output_file, index=False)
+    logger.info("")
+    logger.info("=" * 70)
+    logger.info(f"✓ Saved: {output_file}")
+    logger.info(f"  Shape: {features_df.shape}")
+    logger.info(f"  Features: {len(features_df.columns) - 1}")
+    logger.info("=" * 70)
+    # Test normalization
+    logger.info("")
+    logger.info("NORMALIZATION TEST:")
+    test_urls = [
+        "https://github.com/user/repo",
+        "http://www.github.com/user/repo",
+        "www.github.com/user/repo",
+        "github.com/user/repo"
+    ]
+    for url in test_urls:
+        norm_url, orig, norm_domain, is_http = extractor.normalize_url(url)
+        logger.info(f"  {url}")
+        logger.info(f"    → {norm_domain} (http={is_http})")
+    logger.info("")
+    logger.info("All normalized URLs should have identical domain: 'github.com'")
+if __name__ == "__main__":
+    main()

scripts/phishing_analysis/analysis.py ADDED Viewed

	@@ -0,0 +1,144 @@

+import pandas as pd
+from urllib.parse import urlparse
+import re
+# Load phishing URLs
+phish_df = pd.read_csv('phishing_urls.csv')
+print("=== PHISHING DATASET INFO ===")
+print(f"Total phishing URLs: {len(phish_df)}")
+print(f"Columns: {phish_df.columns.tolist()}\n")
+# Assume URL column is 'url' (adjust if different)
+url_column = 'url'  # Change to your actual column name
+print("=== PHISHING TYPE ANALYSIS (from raw URLs) ===\n")
+# Function to analyze URL
+def analyze_phishing_type(url):
+    """Determine phishing type from raw URL"""
+    url = str(url).lower()
+    parsed = urlparse(url)
+    domain = parsed.netloc
+    path = parsed.path
+    result = {
+        'url': url,
+        'domain': domain,
+        'path': path,
+        'type': 'unknown'
+    }
+    # 1. IP-based phishing
+    ip_pattern = r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
+    if re.search(ip_pattern, domain):
+        result['type'] = 'ip_based'
+        return result
+    # 2. Brand impersonation (check for known brands)
+    brands = [
+        'paypal', 'amazon', 'apple', 'google', 'microsoft', 'facebook',
+        'netflix', 'ebay', 'instagram', 'twitter', 'linkedin', 'bank',
+        'chase', 'wellsfargo', 'citi', 'americanexpress', 'visa', 'mastercard',
+        'dhl', 'fedex', 'ups', 'usps', 'alibaba', 'walmart', 'adobe',
+        'dropbox', 'office365', 'outlook', 'yahoo', 'aol', 'whatsapp'
+    ]
+    # Check if brand in URL but not in main domain
+    url_full = domain + path
+    sld = domain.split('.')[-2] if len(domain.split('.')) >= 2 else domain
+    brand_found = None
+    for brand in brands:
+        if brand in url_full:
+            brand_found = brand
+            # Check if brand is actually the main domain
+            if brand == sld or sld.startswith(brand):
+                # Legitimate brand usage
+                break
+            else:
+                # Brand impersonation
+                result['type'] = 'brand_impersonation'
+                result['brand'] = brand
+                return result
+    # 3. Phishing keywords (generic phishing)
+    phishing_keywords = [
+        'login', 'signin', 'verify', 'account', 'update', 'secure',
+        'confirm', 'suspended', 'locked', 'alert', 'urgent', 'validate',
+        'banking', 'credential', 'auth', 'password', 'restore', 'recover'
+    ]
+    keyword_count = sum(1 for kw in phishing_keywords if kw in url_full)
+    if keyword_count >= 2:
+        result['type'] = 'generic_phishing'
+        result['keyword_count'] = keyword_count
+        return result
+    # 4. Suspicious TLD
+    suspicious_tlds = ['.tk', '.ml', '.ga', '.cf', '.gq', '.xyz', '.top', '.work', '.click']
+    if any(tld in domain for tld in suspicious_tlds):
+        result['type'] = 'suspicious_tld'
+        return result
+    # 5. Compromised site (trusted TLD + phishing in path)
+    trusted_tlds = ['.com', '.org', '.net', '.edu', '.gov']
+    if any(tld in domain for tld in trusted_tlds):
+        if any(kw in path for kw in phishing_keywords):
+            result['type'] = 'compromised_site'
+            return result
+    # Default
+    result['type'] = 'other'
+    return result
+# Analyze all URLs
+print("Analyzing URLs... (this may take a minute)")
+results = []
+for url in phish_df[url_column]:
+    results.append(analyze_phishing_type(url))
+results_df = pd.DataFrame(results)
+# Count types
+type_counts = results_df['type'].value_counts()
+print("\n=== PHISHING TYPE DISTRIBUTION ===")
+for ptype, count in type_counts.items():
+    percentage = (count / len(phish_df)) * 100
+    print(f"{ptype:20s}: {count:6d} / {len(phish_df)} ({percentage:5.1f}%)")
+# Domain characteristics
+print("\n=== DOMAIN CHARACTERISTICS ===")
+# Domain lengths
+domain_lengths = results_df['domain'].apply(len)
+print(f"Avg domain length: {domain_lengths.mean():.1f} chars")
+print(f"Median domain length: {domain_lengths.median():.1f} chars")
+# Number of domain parts
+num_parts = results_df['domain'].apply(lambda d: len(d.split('.')))
+print(f"Avg domain parts: {num_parts.mean():.1f}")
+print(f"Median domain parts: {num_parts.median():.1f}")
+# Number of subdomains
+num_subdomains = num_parts - 2  # Subtract SLD and TLD
+print(f"Avg subdomains: {num_subdomains.mean():.1f}")
+# Path characteristics
+print("\n=== PATH CHARACTERISTICS ===")
+path_lengths = results_df['path'].apply(len)
+print(f"Avg path length: {path_lengths.mean():.1f} chars")
+print(f"URLs with paths: {(path_lengths > 1).sum()} / {len(phish_df)} ({(path_lengths > 1).sum()/len(phish_df)*100:.1f}%)")
+# Show examples
+print("\n=== EXAMPLES BY TYPE ===")
+for ptype in type_counts.index[:5]:
+    examples = results_df[results_df['type'] == ptype]['url'].head(3)
+    print(f"\n{ptype.upper()}:")
+    for i, ex in enumerate(examples, 1):
+        print(f"  {i}. {ex[:100]}...")
+# Save detailed results
+results_df.to_csv('phishing_type_analysis.csv', index=False)
+print("\n✅ Detailed results saved to: phishing_type_analysis.csv")

scripts/phishing_analysis/phishing_analysis.py ADDED Viewed

	@@ -0,0 +1,85 @@

+import pandas as pd
+from urllib.parse import urlparse
+from collections import Counter
+import re
+# Load detailed results
+results_df = pd.read_csv('data/raw/clean_dataset_no_duplicates.csv')
+print("=== DETAILED 'OTHER' CATEGORY ANALYSIS ===\n")
+# Filter only 'other' type
+other_df = results_df[results_df['type'] == 'other']
+# 1. TLD distribution
+print("TOP 20 TLDs in 'OTHER' category:")
+tlds = other_df['domain'].apply(lambda d: '.' + d.split('.')[-1] if '.' in d else '')
+tld_counts = Counter(tlds)
+for tld, count in tld_counts.most_common(20):
+    pct = (count / len(other_df)) * 100
+    print(f"  {tld:10s}: {count:5d} ({pct:4.1f}%)")
+# 2. Domain length distribution
+print("\n=== DOMAIN LENGTH DISTRIBUTION (OTHER) ===")
+lengths = other_df['domain'].str.len()
+print(f"Min: {lengths.min()}")
+print(f"25%: {lengths.quantile(0.25):.0f}")
+print(f"50%: {lengths.median():.0f}")
+print(f"75%: {lengths.quantile(0.75):.0f}")
+print(f"Max: {lengths.max()}")
+# 3. Check for non-English brands/keywords
+print("\n=== POTENTIAL NON-ENGLISH BRANDS/KEYWORDS ===")
+# Common patterns in 'other'
+all_domains = ' '.join(other_df['domain'].tolist()).lower()
+# Find common substrings
+from collections import defaultdict
+substring_counts = defaultdict(int)
+for domain in other_df['domain']:
+    domain = domain.lower()
+    # Extract words (split by dots, hyphens)
+    parts = re.split(r'[\.\-_]', domain)
+    for part in parts:
+        if len(part) >= 5:  # Min 5 chars
+            substring_counts[part] += 1
+# Top recurring words
+print("Top 30 recurring words in domains:")
+for word, count in sorted(substring_counts.items(), key=lambda x: x[1], reverse=True)[:30]:
+    if count >= 10:  # Appears at least 10 times
+        print(f"  {word:30s}: {count:4d} occurrences")
+# 4. Digit patterns
+print("\n=== DIGIT PATTERNS ===")
+has_digits = other_df['domain'].str.contains(r'\d')
+print(f"Domains with digits: {has_digits.sum()} / {len(other_df)} ({has_digits.sum()/len(other_df)*100:.1f}%)")
+# 5. Length of longest part
+print("\n=== LONGEST DOMAIN PART ===")
+longest_parts = other_df['domain'].apply(lambda d: max(d.split('.'), key=len))
+longest_part_lens = longest_parts.str.len()
+print(f"Avg longest part: {longest_part_lens.mean():.1f} chars")
+print(f"Median longest part: {longest_part_lens.median():.1f} chars")
+# Show some examples of long domains
+print("\nExamples of domains with longest part > 30 chars:")
+long_domains = other_df[longest_part_lens > 30]['url'].head(10)
+for url in long_domains:
+    print(f"  {url[:100]}...")
+# 6. Hyphen analysis
+print("\n=== HYPHEN ANALYSIS ===")
+hyphen_counts = other_df['domain'].str.count('-')
+print(f"Avg hyphens per domain: {hyphen_counts.mean():.2f}")
+print(f"Domains with 3+ hyphens: {(hyphen_counts >= 3).sum()} ({(hyphen_counts >= 3).sum()/len(other_df)*100:.1f}%)")
+# 7. Subdomain analysis
+print("\n=== SUBDOMAIN ANALYSIS ===")
+num_parts = other_df['domain'].str.count(r'\.') + 1
+num_subdomains = num_parts - 2
+print(f"Domains with 2+ subdomains: {(num_subdomains >= 2).sum()} ({(num_subdomains >= 2).sum()/len(other_df)*100:.1f}%)")
+print("\n✅ Analysis complete!")

scripts/phishing_analysis/phishing_type_analysis.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

scripts/predict_combined.py ADDED Viewed

	@@ -0,0 +1,274 @@

+"""
+Combined URL+HTML Phishing Detector - Interactive Demo
+Downloads HTML from URL, extracts both URL and HTML features,
+and predicts using the combined model (XGBoost + Random Forest).
+Usage:
+    python scripts/predict_combined.py
+    python scripts/predict_combined.py https://example.com
+"""
+import sys
+import logging
+import warnings
+from pathlib import Path
+import joblib
+import numpy as np
+import pandas as pd
+import requests
+from colorama import init, Fore, Style
+warnings.filterwarnings('ignore', message='.*Unverified HTTPS.*')
+import urllib3
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+init(autoreset=True)
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S',
+)
+logger = logging.getLogger('predict_combined')
+# Project imports
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+sys.path.insert(0, str(PROJECT_ROOT))
+from scripts.feature_extraction.url.url_features_v3 import URLFeatureExtractorOptimized
+from scripts.feature_extraction.html.html_feature_extractor import HTMLFeatureExtractor
+from scripts.feature_extraction.html.feature_engineering import engineer_features
+class CombinedPhishingDetector:
+    """Detect phishing using combined URL + HTML features."""
+    HEADERS = {
+        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
+                       'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
+    }
+    def __init__(self):
+        models_dir = PROJECT_ROOT / 'saved_models'
+        # Feature extractors
+        self.url_extractor = URLFeatureExtractorOptimized()
+        self.html_extractor = HTMLFeatureExtractor()
+        # Load combined models
+        self.models = {}
+        self._load_model(models_dir, 'XGBoost Combined',
+                         'xgboost_combined.joblib',
+                         'xgboost_combined_feature_names.joblib')
+        self._load_model(models_dir, 'Random Forest Combined',
+                         'random_forest_combined.joblib',
+                         'random_forest_combined_feature_names.joblib')
+        if not self.models:
+            raise FileNotFoundError(
+                "No combined models found! Train first:\n"
+                "  python scripts/merge_url_html_features.py --balance\n"
+                "  python models/train_combined_models.py")
+    def _load_model(self, models_dir: Path, name: str,
+                    model_file: str, features_file: str):
+        model_path = models_dir / model_file
+        feat_path = models_dir / features_file
+        if model_path.exists():
+            self.models[name] = {
+                'model': joblib.load(model_path),
+                'features': joblib.load(feat_path) if feat_path.exists() else None,
+            }
+            n = len(self.models[name]['features']) if self.models[name]['features'] else '?'
+            logger.info(f"Loaded {name} ({n} features)")
+    def predict(self, url: str) -> dict:
+        """Download HTML, extract features, predict."""
+        # 1. Extract URL features
+        url_features = self.url_extractor.extract_features(url)
+        url_df = pd.DataFrame([url_features])
+        url_df = url_df.rename(columns={c: f'url_{c}' for c in url_df.columns})
+        # 2. Download + extract HTML features
+        html_features = None
+        html_error = None
+        try:
+            resp = requests.get(url, timeout=10, verify=False, headers=self.HEADERS)
+            raw_html_features = self.html_extractor.extract_features(resp.text)
+            raw_df = pd.DataFrame([raw_html_features])
+            eng_df = engineer_features(raw_df)
+            eng_df = eng_df.rename(columns={c: f'html_{c}' for c in eng_df.columns})
+            html_features = raw_html_features
+        except Exception as e:
+            html_error = str(e)
+            logger.warning(f"Could not download HTML: {e}")
+            # Create zero-filled HTML features
+            eng_df = pd.DataFrame()
+        # 3. Combine
+        combined_df = pd.concat([url_df, eng_df], axis=1)
+        # 4. Predict with each model
+        predictions = []
+        for name, data in self.models.items():
+            model = data['model']
+            expected = data['features']
+            if expected:
+                aligned = pd.DataFrame(columns=expected)
+                for f in expected:
+                    aligned[f] = combined_df[f].values if f in combined_df.columns else 0
+                X = aligned.values
+            else:
+                X = combined_df.values
+            proba = model.predict_proba(X)[0]
+            pred = 1 if proba[1] > 0.5 else 0
+            predictions.append({
+                'model_name': name,
+                'prediction': 'PHISHING' if pred else 'LEGITIMATE',
+                'confidence': float(proba[pred] * 100),
+                'phishing_probability': float(proba[1] * 100),
+                'legitimate_probability': float(proba[0] * 100),
+            })
+        # Consensus
+        phishing_votes = sum(1 for p in predictions if p['prediction'] == 'PHISHING')
+        total = len(predictions)
+        is_phishing = phishing_votes > total / 2
+        if phishing_votes == total:
+            consensus = "ALL MODELS AGREE: PHISHING"
+        elif phishing_votes == 0:
+            consensus = "ALL MODELS AGREE: LEGITIMATE"
+        else:
+            consensus = f"MIXED: {phishing_votes}/{total} models say PHISHING"
+        return {
+            'url': url,
+            'is_phishing': is_phishing,
+            'consensus': consensus,
+            'predictions': predictions,
+            'url_features': url_features,
+            'html_features': html_features,
+            'html_error': html_error,
+        }
+    def print_results(self, result: dict):
+        """Pretty-print results."""
+        print("\n" + "=" * 80)
+        print(f"{Fore.CYAN}{Style.BRIGHT}COMBINED URL+HTML PHISHING DETECTION{Style.RESET_ALL}")
+        print("=" * 80)
+        print(f"\n{Fore.YELLOW}URL:{Style.RESET_ALL} {result['url']}")
+        if result.get('html_error'):
+            print(f"{Fore.RED}HTML download failed: {result['html_error']}{Style.RESET_ALL}")
+            print(f"{Fore.YELLOW}Using URL features only (HTML features zeroed){Style.RESET_ALL}")
+        # Model predictions
+        print(f"\n{Fore.CYAN}{Style.BRIGHT}MODEL PREDICTIONS:{Style.RESET_ALL}")
+        print("-" * 80)
+        for pred in result['predictions']:
+            is_safe = pred['prediction'] == 'LEGITIMATE'
+            color = Fore.GREEN if is_safe else Fore.RED
+            icon = "✓" if is_safe else "⚠"
+            print(f"\n{Style.BRIGHT}{pred['model_name']}:{Style.RESET_ALL}")
+            print(f"  {icon} Prediction: {color}{Style.BRIGHT}{pred['prediction']}{Style.RESET_ALL}")
+            print(f"  Confidence: {pred['confidence']:.1f}%")
+            print(f"    Phishing:   {Fore.RED}{pred['phishing_probability']:6.2f}%{Style.RESET_ALL}")
+            print(f"    Legitimate: {Fore.GREEN}{pred['legitimate_probability']:6.2f}%{Style.RESET_ALL}")
+        # Consensus
+        print(f"\n{Fore.CYAN}{Style.BRIGHT}CONSENSUS:{Style.RESET_ALL}")
+        print("-" * 80)
+        if result['is_phishing']:
+            print(f"🚨 {Fore.RED}{Style.BRIGHT}{result['consensus']}{Style.RESET_ALL}")
+        else:
+            print(f"✅ {Fore.GREEN}{Style.BRIGHT}{result['consensus']}{Style.RESET_ALL}")
+        # Key features
+        url_feat = result.get('url_features', {})
+        html_feat = result.get('html_features', {})
+        print(f"\n{Fore.CYAN}{Style.BRIGHT}KEY URL FEATURES:{Style.RESET_ALL}")
+        print("-" * 80)
+        url_keys = [
+            ('Domain Length', url_feat.get('domain_length', 0)),
+            ('Num Subdomains', url_feat.get('num_subdomains', 0)),
+            ('Domain Dots', url_feat.get('domain_dots', 0)),
+            ('Is Shortened', 'Yes' if url_feat.get('is_shortened') else 'No'),
+            ('Is Free Platform', 'Yes' if url_feat.get('is_free_platform') else 'No'),
+            ('Is HTTP', 'Yes' if url_feat.get('is_http') else 'No'),
+            ('Has @ Symbol', 'Yes' if url_feat.get('has_at_symbol') else 'No'),
+        ]
+        for name, val in url_keys:
+            print(f"  {name:25s}: {val}")
+        if html_feat:
+            print(f"\n{Fore.CYAN}{Style.BRIGHT}KEY HTML FEATURES:{Style.RESET_ALL}")
+            print("-" * 80)
+            html_keys = [
+                ('Text Length', html_feat.get('text_length', 0)),
+                ('Num Links', html_feat.get('num_links', 0)),
+                ('Num Forms', html_feat.get('num_forms', 0)),
+                ('Password Fields', html_feat.get('num_password_fields', 0)),
+                ('Has Login Form', 'Yes' if html_feat.get('has_login_form') else 'No'),
+                ('Has Meta Refresh', 'Yes' if html_feat.get('has_meta_refresh') else 'No'),
+                ('Has atob()', 'Yes' if html_feat.get('has_atob') else 'No'),
+                ('External Links', html_feat.get('num_external_links', 0)),
+            ]
+            for name, val in html_keys:
+                print(f"  {name:25s}: {val}")
+        print("\n" + "=" * 80 + "\n")
+def main():
+    print(f"\n{Fore.CYAN}{Style.BRIGHT}")
+    print("╔══════════════════════════════════════════════════════════════╗")
+    print("║      COMBINED URL+HTML PHISHING DETECTOR                   ║")
+    print("╚══════════════════════════════════════════════════════════════╝")
+    print(f"{Style.RESET_ALL}")
+    print(f"{Fore.YELLOW}Loading models...{Style.RESET_ALL}")
+    detector = CombinedPhishingDetector()
+    print(f"{Fore.GREEN}✓ Models loaded!{Style.RESET_ALL}\n")
+    # Single URL from command line
+    if len(sys.argv) > 1:
+        url = sys.argv[1]
+        if not url.startswith(('http://', 'https://')):
+            url = 'https://' + url
+        result = detector.predict(url)
+        detector.print_results(result)
+        return
+    # Interactive loop
+    while True:
+        print(f"{Fore.CYAN}{'─' * 80}{Style.RESET_ALL}")
+        url = input(f"{Fore.YELLOW}Enter URL (or 'quit'):{Style.RESET_ALL} ").strip()
+        if url.lower() in ('quit', 'exit', 'q'):
+            print(f"\n{Fore.GREEN}Goodbye!{Style.RESET_ALL}\n")
+            break
+        if not url:
+            continue
+        if not url.startswith(('http://', 'https://')):
+            url = 'https://' + url
+        try:
+            result = detector.predict(url)
+            detector.print_results(result)
+        except Exception as e:
+            print(f"\n{Fore.RED}Error: {e}{Style.RESET_ALL}\n")
+if __name__ == '__main__':
+    main()

scripts/predict_html.py ADDED Viewed

	@@ -0,0 +1,303 @@

+"""
+HTML Phishing Detection - Interactive Prediction
+Predicts if HTML file/URL is phishing using trained model
+"""
+import sys
+from pathlib import Path
+import joblib
+import pandas as pd
+from colorama import init, Fore, Style
+import requests
+# Add project root to path
+sys.path.append(str(Path(__file__).parent.parent))
+from scripts.feature_extraction.html.html_feature_extractor import HTMLFeatureExtractor
+from scripts.feature_extraction.html.feature_engineering import engineer_features
+# Initialize colorama
+init(autoreset=True)
+class HTMLPhishingPredictor:
+    """Predict phishing from HTML content using trained models."""
+    def __init__(self):
+        """Initialize predictor with all trained models."""
+        models_dir = Path('saved_models')
+        # Load Random Forest model and its feature names
+        rf_model_path = models_dir / 'random_forest_html.joblib'
+        rf_features_path = models_dir / 'random_forest_html_feature_names.joblib'
+        if rf_model_path.exists():
+            print(f"Loading Random Forest model: {rf_model_path}")
+            self.rf_model = joblib.load(rf_model_path)
+            self.has_rf = True
+            # Load RF feature names
+            if rf_features_path.exists():
+                self.rf_feature_names = joblib.load(rf_features_path)
+                print(f"Loaded {len(self.rf_feature_names)} Random Forest feature names")
+            else:
+                self.rf_feature_names = None
+        else:
+            print(f"{Fore.YELLOW}Random Forest model not found{Style.RESET_ALL}")
+            self.rf_model = None
+            self.has_rf = False
+            self.rf_feature_names = None
+        # Load XGBoost model and its feature names
+        xgb_model_path = models_dir / 'xgboost_html.joblib'
+        xgb_features_path = models_dir / 'xgboost_html_feature_names.joblib'
+        if xgb_model_path.exists():
+            print(f"Loading XGBoost model: {xgb_model_path}")
+            self.xgb_model = joblib.load(xgb_model_path)
+            self.has_xgb = True
+            # Load XGBoost feature names
+            if xgb_features_path.exists():
+                self.xgb_feature_names = joblib.load(xgb_features_path)
+                print(f"Loaded {len(self.xgb_feature_names)} XGBoost feature names")
+            else:
+                self.xgb_feature_names = None
+        else:
+            print(f"{Fore.YELLOW}XGBoost model not found{Style.RESET_ALL}")
+            self.xgb_model = None
+            self.has_xgb = False
+            self.xgb_feature_names = None
+        if not self.has_rf and not self.has_xgb:
+            raise FileNotFoundError("No trained models found! Train models first.")
+        self.extractor = HTMLFeatureExtractor()
+    def predict_from_file(self, html_file_path):
+        """Predict from HTML file."""
+        # Read HTML content
+        with open(html_file_path, 'r', encoding='utf-8', errors='ignore') as f:
+            html_content = f.read()
+        return self.predict_from_html(html_content, source=str(html_file_path))
+    def predict_from_url(self, url):
+        """Download HTML from URL and predict."""
+        print(f"\nDownloading HTML from: {url}")
+        try:
+            # Download HTML
+            response = requests.get(url, timeout=10, verify=False)
+            html_content = response.text
+            return self.predict_from_html(html_content, source=url)
+        except Exception as e:
+            print(f"{Fore.RED}Error downloading URL: {e}")
+            return None
+    def predict_from_html(self, html_content, source=""):
+        """Predict from HTML content using all available models."""
+        # Extract raw features
+        features = self.extractor.extract_features(html_content)
+        # Apply feature engineering (same as training)
+        raw_df = pd.DataFrame([features])
+        eng_df = engineer_features(raw_df)
+        # Get predictions from all models
+        predictions = {}
+        if self.has_rf:
+            if self.rf_feature_names:
+                feature_values = [eng_df[fn].iloc[0] if fn in eng_df.columns
+                                  else features.get(fn, 0)
+                                  for fn in self.rf_feature_names]
+                X_rf = pd.DataFrame([dict(zip(self.rf_feature_names, feature_values))])
+            else:
+                X_rf = eng_df
+            rf_pred = self.rf_model.predict(X_rf)[0] # type: ignore
+            rf_proba = self.rf_model.predict_proba(X_rf)[0] # type: ignore
+            predictions['Random Forest'] = {
+                'prediction': rf_pred,
+                'probability': rf_proba
+            }
+        if self.has_xgb:
+            if self.xgb_feature_names:
+                feature_values = [eng_df[fn].iloc[0] if fn in eng_df.columns
+                                  else features.get(fn, 0)
+                                  for fn in self.xgb_feature_names]
+                X_xgb = pd.DataFrame([dict(zip(self.xgb_feature_names, feature_values))])
+            else:
+                X_xgb = eng_df
+            xgb_pred = self.xgb_model.predict(X_xgb)[0] # type: ignore
+            xgb_proba = self.xgb_model.predict_proba(X_xgb)[0] # type: ignore
+            predictions['XGBoost'] = {
+                'prediction': xgb_pred,
+                'probability': xgb_proba
+            }
+        # Ensemble prediction (average probabilities)
+        if len(predictions) > 1:
+            avg_proba = sum([p['probability'] for p in predictions.values()]) / len(predictions)
+            ensemble_pred = 1 if avg_proba[1] > 0.5 else 0 # type: ignore
+            predictions['Ensemble'] = {
+                'prediction': ensemble_pred,
+                'probability': avg_proba
+            }
+        # Display results
+        self._display_prediction(predictions, features, source)
+        return {
+            'predictions': predictions,
+            'features': features
+        }
+    def _display_prediction(self, predictions, features, source):
+        """Display prediction results with colors."""
+        print("\n" + "="*80)
+        if source:
+            print(f"Source: {source}")
+        print("="*80)
+        # Get ensemble or single prediction for final verdict
+        if 'Ensemble' in predictions:
+            final_pred = predictions['Ensemble']['prediction']
+            final_proba = predictions['Ensemble']['probability']
+        else:
+            # Use the only available model
+            model_name = list(predictions.keys())[0]
+            final_pred = predictions[model_name]['prediction']
+            final_proba = predictions[model_name]['probability']
+        # Final Verdict
+        if final_pred == 1:
+            print(f"\n{Fore.RED}{'⚠ PHISHING DETECTED ⚠':^80}")
+            print(f"{Fore.RED}Confidence: {final_proba[1]*100:.2f}%")
+        else:
+            print(f"\n{Fore.GREEN}{'✓ LEGITIMATE WEBSITE ✓':^80}")
+            print(f"{Fore.GREEN}Confidence: {final_proba[0]*100:.2f}%")
+        # Model breakdown
+        print("\n" + "-"*80)
+        print("Model Predictions:")
+        print("-"*80)
+        for model_name, result in predictions.items():
+            pred = result['prediction']
+            proba = result['probability']
+            pred_text = 'PHISHING' if pred == 1 else 'LEGITIMATE'
+            color = Fore.RED if pred == 1 else Fore.GREEN
+            icon = "⚠" if pred == 1 else "✓"
+            print(f"  {icon} {model_name:15s}: {color}{pred_text:12s}{Style.RESET_ALL} "
+                  f"(Legit: {proba[0]*100:5.1f}%, Phish: {proba[1]*100:5.1f}%)")
+        # Show key features
+        print("\n" + "-"*80)
+        print("Key HTML Features:")
+        print("-"*80)
+        important_features = [
+            ('num_forms', 'Number of forms'),
+            ('num_password_fields', 'Password fields'),
+            ('num_external_links', 'External links'),
+            ('num_scripts', 'Scripts'),
+            ('num_urgency_keywords', 'Urgency keywords'),
+            ('num_brand_mentions', 'Brand mentions'),
+            ('has_meta_refresh', 'Meta refresh redirect'),
+            ('num_iframes', 'Iframes'),
+        ]
+        for feat, desc in important_features:
+            if feat in features:
+                value = features[feat]
+                print(f"  {desc:25s}: {value}")
+        print("="*80)
+def interactive_mode():
+    """Interactive mode for testing multiple inputs."""
+    print("\n" + "="*80)
+    print(f"{Fore.CYAN}{'HTML PHISHING DETECTOR - INTERACTIVE MODE':^80}")
+    print("="*80)
+    # Load predictor
+    try:
+        predictor = HTMLPhishingPredictor()
+    except Exception as e:
+        print(f"{Fore.RED}Error loading model: {e}")
+        print("\nTrain a model first using:")
+        print("  python models/html_enhanced/random_forest_html.py")
+        return
+    print("\nCommands:")
+    print("  file <path>  - Analyze HTML file")
+    print("  url <url>    - Download and analyze URL")
+    print("  quit         - Exit")
+    print("-"*80)
+    while True:
+        try:
+            user_input = input(f"\n{Fore.CYAN}Enter command: {Style.RESET_ALL}").strip()
+            if not user_input:
+                continue
+            if user_input.lower() in ['quit', 'exit', 'q']:
+                print("\nGoodbye!")
+                break
+            parts = user_input.split(maxsplit=1)
+            command = parts[0].lower()
+            if command == 'file' and len(parts) == 2:
+                file_path = parts[1].strip()
+                if Path(file_path).exists():
+                    predictor.predict_from_file(file_path)
+                else:
+                    print(f"{Fore.RED}File not found: {file_path}")
+            elif command == 'url' and len(parts) == 2:
+                url = parts[1].strip()
+                predictor.predict_from_url(url)
+            else:
+                print(f"{Fore.YELLOW}Invalid command. Use: file <path> or url <url>")
+        except KeyboardInterrupt:
+            print("\n\nGoodbye!")
+            break
+        except Exception as e:
+            print(f"{Fore.RED}Error: {e}")
+def main():
+    """Main function."""
+    if len(sys.argv) > 1:
+        # Command line mode
+        predictor = HTMLPhishingPredictor()
+        arg = sys.argv[1]
+        if Path(arg).exists():
+            # File path
+            predictor.predict_from_file(arg)
+        elif arg.startswith('http'):
+            # URL
+            predictor.predict_from_url(arg)
+        else:
+            print(f"Invalid input: {arg}")
+            print("\nUsage:")
+            print("  python scripts/predict_html.py <html_file>")
+            print("  python scripts/predict_html.py <url>")
+            print("  python scripts/predict_html.py  (interactive mode)")
+    else:
+        # Interactive mode
+        interactive_mode()
+if __name__ == '__main__':
+    main()

scripts/predict_url.py ADDED Viewed

	@@ -0,0 +1,367 @@

+"""
+URL Phishing Detector - Interactive Demo
+Test any URL with all trained models and see predictions with confidence scores.
+"""
+import sys
+import pandas as pd
+import joblib
+from pathlib import Path
+from colorama import init, Fore, Style
+# Initialize colorama for colored output
+init(autoreset=True)
+import logging
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger("url_predictor")
+sys.path.append(str(Path(__file__).parent.parent))
+from scripts.feature_extraction.url.url_features_v2 import URLFeatureExtractorV2
+class URLPhishingDetector:
+    """Detect phishing URLs using trained models."""
+    def __init__(self):
+        """Initialize detector with all models."""
+        self.script_dir = Path(__file__).parent.parent
+        self.models_dir = (self.script_dir / 'saved_models').resolve()
+        # Whitelist of trusted domains
+        self.trusted_domains = {
+            # Tech giants
+            'youtube.com', 'facebook.com', 'twitter.com', 'x.com',
+            'linkedin.com', 'microsoft.com', 'apple.com', 'amazon.com',
+            # Development
+            'github.com', 'gitlab.com', 'stackoverflow.com', 'npmjs.com',
+            # AI Services
+            'claude.ai', 'anthropic.com', 'openai.com', 'chatgpt.com',
+            # Education & Info
+            'wikipedia.org', 'reddit.com', 'quora.com', 'medium.com',
+            # Cloud & Services
+            'aws.amazon.com', 'azure.microsoft.com', 'cloud.google.com',
+            'vercel.com', 'netlify.com', 'heroku.com',
+            # Communication
+            'slack.com', 'discord.com', 'zoom.us', 'teams.microsoft.com',
+            # Finance (major)
+            'paypal.com', 'stripe.com', 'visa.com', 'mastercard.com',
+            # E-commerce
+            'ebay.com', 'shopify.com', 'etsy.com', 'walmart.com',
+        }
+        # Custom thresholds for each model (reduce false positives)
+        self.thresholds = {
+            'Logistic Regression': 0.5,  # Standard threshold
+            'Random Forest': 0.5,         # Standard threshold
+            'XGBoost': 0.5                # Standard threshold
+        }
+        # Load feature extractor
+        self.extractor = URLFeatureExtractorV2()
+        # Load scaler (only needed for Logistic Regression)
+        scaler_path = self.models_dir / 'scaler.joblib'
+        if scaler_path.exists():
+            self.scaler = joblib.load(scaler_path)
+            logger.info("✓ Loaded scaler")
+        else:
+            self.scaler = None
+            logger.warning("✗ Scaler not found (only needed for Logistic Regression)")
+        # Load all models
+        self.models = {}
+        self.feature_names = {}
+        self._load_models()
+    def _load_models(self):
+        """Load all trained models."""
+        model_files = {
+            'Logistic Regression': 'logistic_regression.joblib',
+            'Random Forest': 'random_forest.joblib',
+            'XGBoost': 'xgboost.joblib'
+        }
+        for name, filename in model_files.items():
+            model_path = self.models_dir / filename
+            if model_path.exists():
+                model = joblib.load(model_path)
+                self.models[name] = model
+                # Store expected feature names from model
+                if hasattr(model, 'feature_names_in_'):
+                    self.feature_names[name] = list(model.feature_names_in_)
+                    logger.info(f"✓ Loaded {name} ({len(self.feature_names[name])} features)")
+                elif self.scaler is not None and hasattr(self.scaler, 'feature_names_in_'):
+                    # Use scaler's feature names for models without them (like Logistic Regression)
+                    self.feature_names[name] = list(self.scaler.feature_names_in_)
+                    logger.info(f"✓ Loaded {name} (using scaler features: {len(self.feature_names[name])} features)")
+                else:
+                    logger.info(f"✓ Loaded {name}")
+            else:
+                logger.warning(f"✗ Model not found: {filename}")
+    def predict_url(self, url: str) -> dict:
+        """
+        Predict if URL is phishing or legitimate.
+        Args:
+            url: URL string to analyze
+        Returns:
+            Dictionary with predictions from all models
+        """
+        # Check if domain is in whitelist
+        from urllib.parse import urlparse
+        parsed = urlparse(url)
+        domain = parsed.netloc.lower().replace('www.', '')
+        # If trusted domain, override predictions
+        is_whitelisted = any(domain.endswith(trusted) for trusted in self.trusted_domains)
+        # Extract features
+        features_dict = self.extractor.extract_features(url)
+        # Convert to DataFrame (excluding label)
+        features_df = pd.DataFrame([features_dict])
+        if 'label' in features_df.columns:
+            features_df = features_df.drop('label', axis=1)
+        # Get predictions from all models
+        results = {}
+        for model_name, model in self.models.items():
+            # Override for whitelisted domains
+            if is_whitelisted:
+                results[model_name] = {
+                    'prediction': 'LEGITIMATE',
+                    'prediction_code': 0,
+                    'confidence': 99.99,
+                    'phishing_probability': 0.01,
+                    'legitimate_probability': 99.99,
+                    'whitelisted': True
+                }
+                continue
+            # Align features with model's expected features
+            if model_name in self.feature_names:
+                expected_features = self.feature_names[model_name]
+                # Create aligned DataFrame with correct column order
+                features_aligned = pd.DataFrame(columns=expected_features)
+                for feat in expected_features:
+                    if feat in features_df.columns:
+                        features_aligned[feat] = features_df[feat].values
+                    else:
+                        features_aligned[feat] = 0  # Fill missing features with 0
+                # Convert to numpy to avoid sklearn feature name validation
+                features_to_predict = features_aligned.values
+            else:
+                # Fallback: use any stored feature names from other models
+                if self.feature_names:
+                    expected_features = list(self.feature_names.values())[0]
+                    features_aligned = pd.DataFrame(columns=expected_features)
+                    for feat in expected_features:
+                        if feat in features_df.columns:
+                            features_aligned[feat] = features_df[feat].values
+                        else:
+                            features_aligned[feat] = 0
+                    features_to_predict = features_aligned.values
+                else:
+                    features_to_predict = features_df.values
+            # Scale features only for Logistic Regression
+            if model_name == 'Logistic Regression' and self.scaler is not None:
+                features_to_use = self.scaler.transform(features_to_predict)
+            else:
+                features_to_use = features_to_predict
+            # Get probability/confidence (features are already numpy arrays)
+            if hasattr(model, 'predict_proba'):
+                probabilities = model.predict_proba(features_to_use)[0]
+                phishing_prob = probabilities[1] * 100
+                legitimate_prob = probabilities[0] * 100
+                # Apply custom threshold
+                threshold = self.thresholds.get(model_name, 0.5)
+                prediction = 1 if probabilities[1] > threshold else 0
+                confidence = probabilities[prediction] * 100
+            else:
+                # For models without predict_proba (fallback)
+                prediction = model.predict(features_to_use)[0]
+                confidence = 100.0
+                phishing_prob = 100.0 if prediction == 1 else 0.0
+                legitimate_prob = 0.0 if prediction == 1 else 100.0
+            results[model_name] = {
+                'prediction': 'PHISHING' if prediction == 1 else 'LEGITIMATE',
+                'prediction_code': int(prediction),
+                'confidence': confidence,
+                'phishing_probability': phishing_prob,
+                'legitimate_probability': legitimate_prob,
+                'whitelisted': False,
+                'threshold': self.thresholds.get(model_name, 0.5)
+            }
+        return results, features_dict # type: ignore
+    def print_results(self, url: str, results: dict, features: dict):
+        """Print formatted results."""
+        print("\n" + "=" * 80)
+        print(f"{Fore.CYAN}{Style.BRIGHT}URL PHISHING DETECTION RESULTS{Style.RESET_ALL}")
+        print("=" * 80)
+        # Print URL
+        print(f"\n{Fore.YELLOW}URL:{Style.RESET_ALL} {url}")
+        # Print model predictions
+        print(f"\n{Fore.CYAN}{Style.BRIGHT}MODEL PREDICTIONS:{Style.RESET_ALL}")
+        print("-" * 80)
+        for model_name, result in results.items():
+            prediction = result['prediction']
+            confidence = result['confidence']
+            phishing_prob = result['phishing_probability']
+            legitimate_prob = result['legitimate_probability']
+            threshold = result.get('threshold', 0.5)
+            # Color based on prediction
+            if prediction == 'PHISHING':
+                color = Fore.RED
+                icon = "⚠️"
+            else:
+                color = Fore.GREEN
+                icon = "✓"
+            print(f"\n{Style.BRIGHT}{model_name}:{Style.RESET_ALL}")
+            print(f"  {icon} Prediction: {color}{Style.BRIGHT}{prediction}{Style.RESET_ALL}")
+            # Show if whitelisted
+            if result.get('whitelisted', False):
+                print(f"  {Fore.CYAN}ℹ️  Trusted domain (whitelisted){Style.RESET_ALL}")
+            else:
+                print(f"  Decision Threshold: {threshold*100:.0f}%")
+            print(f"  Confidence: {confidence:.2f}%")
+            print(f"  Probabilities:")
+            print(f"    • Phishing:   {Fore.RED}{phishing_prob:6.2f}%{Style.RESET_ALL}")
+            print(f"    • Legitimate: {Fore.GREEN}{legitimate_prob:6.2f}%{Style.RESET_ALL}")
+        # Consensus
+        print(f"\n{Fore.CYAN}{Style.BRIGHT}CONSENSUS:{Style.RESET_ALL}")
+        print("-" * 80)
+        phishing_votes = sum(1 for r in results.values() if r['prediction'] == 'PHISHING')
+        total_models = len(results)
+        if phishing_votes == total_models:
+            consensus_color = Fore.RED
+            consensus_icon = "🚨"
+            consensus_text = "ALL MODELS AGREE: PHISHING"
+        elif phishing_votes == 0:
+            consensus_color = Fore.GREEN
+            consensus_icon = "✅"
+            consensus_text = "ALL MODELS AGREE: LEGITIMATE"
+        else:
+            consensus_color = Fore.YELLOW
+            consensus_icon = "⚠️"
+            consensus_text = f"MIXED RESULTS: {phishing_votes}/{total_models} models say PHISHING"
+        print(f"{consensus_icon} {consensus_color}{Style.BRIGHT}{consensus_text}{Style.RESET_ALL}")
+        # Key features (based on top features from models)
+        print(f"\n{Fore.CYAN}{Style.BRIGHT}TOP FEATURES (Model Importance):{Style.RESET_ALL}")
+        print("-" * 80)
+        # Top features from Random Forest and XGBoost analysis
+        top_features = [
+            ('Num Domain Parts', features.get('num_domain_parts', 0), None),
+            ('Domain Dots', features.get('domain_dots', 0), None),
+            ('URL Shortener', '✓ Yes' if features.get('is_shortened', 0) == 1 else '✗ No',
+             features.get('is_shortened', 0)),
+            ('Num Subdomains', features.get('num_subdomains', 0), None),
+            ('Domain Hyphens', features.get('domain_hyphens', 0), None),
+            ('Free Platform', '✓ Yes' if features.get('is_free_platform', 0) == 1 else '✗ No',
+             features.get('is_free_platform', 0)),
+            ('Free Hosting', '✓ Yes' if features.get('is_free_hosting', 0) == 1 else '✗ No',
+             features.get('is_free_hosting', 0)),
+            ('Platform Subdomain Len', features.get('platform_subdomain_length', 0), None),
+            ('Avg Domain Part Len', f"{features.get('avg_domain_part_len', 0):.2f}", None),
+            ('Domain Length Category', features.get('domain_length_category', 0), None),
+            ('Path Digits', features.get('path_digits', 0), None),
+            ('Is HTTP', '✓ Yes' if features.get('is_http', 0) == 1 else '✗ No',
+             features.get('is_http', 0)),
+            ('Multiple Brands in URL', '✓ Yes' if features.get('multiple_brands_in_url', 0) == 1 else '✗ No',
+             features.get('multiple_brands_in_url', 0)),
+            ('Brand in Path', '✓ Yes' if features.get('brand_in_path', 0) == 1 else '✗ No',
+             features.get('brand_in_path', 0)),
+            ('Path Slashes', features.get('path_slashes', 0), None),
+            ('Encoding Diff', f"{features.get('encoding_diff', 0):.3f}", None),
+            ('Symbol Ratio (Domain)', f"{features.get('symbol_ratio_domain', 0):.3f}", None),
+            ('Domain Length', features.get('domain_length', 0), None),
+            ('Has @ Symbol', '✓ Yes' if features.get('has_at_symbol', 0) == 1 else '✗ No',
+             features.get('has_at_symbol', 0)),
+            ('TLD Length', features.get('tld_length', 0), None),
+        ]
+        for feature_name, value, risk_flag in top_features:
+            # Color code risky features
+            if risk_flag is not None:
+                if risk_flag == 1:  # Risky feature is present
+                    value_display = f"{Fore.RED}{value}{Style.RESET_ALL}"
+                else:
+                    value_display = f"{Fore.GREEN}{value}{Style.RESET_ALL}"
+            else:
+                value_display = str(value)
+            print(f"  • {feature_name:25s}: {value_display}")
+        print("\n" + "=" * 80 + "\n")
+def main():
+    """Main interactive function."""
+    print(f"\n{Fore.CYAN}{Style.BRIGHT}╔══════════════════════════════════════════════════════════════╗")
+    print(f"║          URL PHISHING DETECTOR - INTERACTIVE DEMO            ║")
+    print(f"╚══════════════════════════════════════════════════════════════╝{Style.RESET_ALL}\n")
+    # Initialize detector
+    print(f"{Fore.YELLOW}Loading models...{Style.RESET_ALL}")
+    detector = URLPhishingDetector()
+    print(f"{Fore.GREEN}✓ All models loaded successfully!{Style.RESET_ALL}\n")
+    # Interactive loop
+    while True:
+        print(f"{Fore.CYAN}{'─' * 80}{Style.RESET_ALL}")
+        url = input(f"{Fore.YELLOW}Enter URL to test (or 'quit' to exit):{Style.RESET_ALL} ").strip()
+        if url.lower() in ['quit', 'exit', 'q']:
+            print(f"\n{Fore.GREEN}Thank you for using URL Phishing Detector!{Style.RESET_ALL}\n")
+            break
+        if not url:
+            print(f"{Fore.RED}Please enter a valid URL{Style.RESET_ALL}\n")
+            continue
+        # Add http:// if no scheme
+        if not url.startswith(('http://', 'https://')):
+            url = 'http://' + url
+        try:
+            # Get predictions
+            results, features = detector.predict_url(url)
+            # Print results
+            detector.print_results(url, results, features)
+        except Exception as e:
+            print(f"\n{Fore.RED}Error analyzing URL: {str(e)}{Style.RESET_ALL}\n")
+            logger.error(f"Error: {str(e)}")
+if __name__ == "__main__":
+    main()

scripts/predict_url_cnn.py ADDED Viewed

	@@ -0,0 +1,332 @@

+"""
+CNN Phishing Detector - Interactive Demo
+Test any URL with both character-level CNN models:
+  1. CNN URL  — analyzes the URL string itself
+  2. CNN HTML — fetches the page and analyzes its HTML source
+Usage:
+    python scripts/predict_url_cnn.py
+"""
+import sys
+import json
+import logging
+import warnings
+from pathlib import Path
+import os
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
+os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
+import numpy as np
+from colorama import init, Fore, Style
+init(autoreset=True)
+warnings.filterwarnings('ignore')
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S',
+)
+logger = logging.getLogger('cnn_predictor')
+# ---------------------------------------------------------------------------
+# Project paths
+# ---------------------------------------------------------------------------
+PROJECT_ROOT = Path(__file__).resolve().parents[1]  # src/
+MODELS_DIR = PROJECT_ROOT / 'saved_models'
+# URL CNN
+URL_MODEL_PATH = MODELS_DIR / 'cnn_url_model.keras'
+URL_VOCAB_PATH = MODELS_DIR / 'cnn_url_vocab.json'
+# HTML CNN
+HTML_MODEL_PATH = MODELS_DIR / 'cnn_html_model.keras'
+HTML_VOCAB_PATH = MODELS_DIR / 'cnn_html_vocab.json'
+class CNNPhishingDetector:
+    """Detect phishing URLs using both character-level CNN models."""
+    def __init__(self):
+        self.url_model = None
+        self.html_model = None
+        self.url_vocab = None
+        self.html_vocab = None
+        self._load_url_model()
+        self._load_html_model()
+    # ── Loading ────────────────────────────────────────────────────
+    def _load_url_model(self):
+        """Load URL CNN model and vocabulary."""
+        if not URL_VOCAB_PATH.exists() or not URL_MODEL_PATH.exists():
+            logger.warning("URL CNN model not found — skipping")
+            return
+        with open(URL_VOCAB_PATH, 'r') as f:
+            self.url_vocab = json.load(f)
+        import tensorflow as tf
+        self.url_model = tf.keras.models.load_model(str(URL_MODEL_PATH))
+        logger.info(f"✓ URL CNN loaded (vocab={self.url_vocab['vocab_size']}, "
+                     f"max_len={self.url_vocab['max_len']})")
+    def _load_html_model(self):
+        """Load HTML CNN model and vocabulary."""
+        if not HTML_VOCAB_PATH.exists() or not HTML_MODEL_PATH.exists():
+            logger.warning("HTML CNN model not found — skipping")
+            return
+        with open(HTML_VOCAB_PATH, 'r') as f:
+            self.html_vocab = json.load(f)
+        import tensorflow as tf
+        self.html_model = tf.keras.models.load_model(str(HTML_MODEL_PATH))
+        logger.info(f"✓ HTML CNN loaded (vocab={self.html_vocab['vocab_size']}, "
+                     f"max_len={self.html_vocab['max_len']})")
+    # ── Encoding ───────────────────────────────────────────────────
+    def _encode_url(self, url: str) -> np.ndarray:
+        """Encode a URL string for the URL CNN."""
+        char_to_idx = self.url_vocab['char_to_idx']
+        max_len = self.url_vocab['max_len']
+        encoded = [char_to_idx.get(c, 1) for c in url[:max_len]]
+        encoded += [0] * (max_len - len(encoded))
+        return np.array([encoded], dtype=np.int32)
+    def _encode_html(self, html: str) -> np.ndarray:
+        """Encode an HTML string for the HTML CNN."""
+        char_to_idx = self.html_vocab['char_to_idx']
+        max_len = self.html_vocab['max_len']
+        encoded = [char_to_idx.get(c, 1) for c in html[:max_len]]
+        encoded += [0] * (max_len - len(encoded))
+        return np.array([encoded], dtype=np.int32)
+    # ── HTML fetching ──────────────────────────────────────────────
+    @staticmethod
+    def _fetch_html(url: str, timeout: int = 10) -> str | None:
+        """Fetch HTML content from a URL. Returns None on failure."""
+        try:
+            import requests
+            headers = {
+                'User-Agent': ('Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
+                               'AppleWebKit/537.36 (KHTML, like Gecko) '
+                               'Chrome/120.0.0.0 Safari/537.36'),
+            }
+            resp = requests.get(url, headers=headers, timeout=timeout,
+                                verify=False, allow_redirects=True)
+            resp.raise_for_status()
+            return resp.text
+        except Exception as e:
+            logger.warning(f"  Could not fetch HTML: {e}")
+            return None
+    # ── Prediction ─────────────────────────────���───────────────────
+    def predict_url(self, url: str, threshold: float = 0.5) -> dict | None:
+        """Predict using the URL CNN model."""
+        if self.url_model is None:
+            return None
+        X = self._encode_url(url)
+        phishing_prob = float(self.url_model.predict(X, verbose=0)[0][0])
+        legitimate_prob = 1.0 - phishing_prob
+        is_phishing = phishing_prob >= threshold
+        return {
+            'model_name': 'CNN URL (Char-level)',
+            'prediction': 'PHISHING' if is_phishing else 'LEGITIMATE',
+            'prediction_code': int(is_phishing),
+            'confidence': (phishing_prob if is_phishing else legitimate_prob) * 100,
+            'phishing_probability': phishing_prob * 100,
+            'legitimate_probability': legitimate_prob * 100,
+            'threshold': threshold,
+        }
+    def predict_html(self, html: str, threshold: float = 0.5) -> dict | None:
+        """Predict using the HTML CNN model."""
+        if self.html_model is None:
+            return None
+        X = self._encode_html(html)
+        phishing_prob = float(self.html_model.predict(X, verbose=0)[0][0])
+        legitimate_prob = 1.0 - phishing_prob
+        is_phishing = phishing_prob >= threshold
+        return {
+            'model_name': 'CNN HTML (Char-level)',
+            'prediction': 'PHISHING' if is_phishing else 'LEGITIMATE',
+            'prediction_code': int(is_phishing),
+            'confidence': (phishing_prob if is_phishing else legitimate_prob) * 100,
+            'phishing_probability': phishing_prob * 100,
+            'legitimate_probability': legitimate_prob * 100,
+            'threshold': threshold,
+            'html_length': len(html),
+        }
+    def predict_full(self, url: str, threshold: float = 0.5) -> dict:
+        """
+        Run both CNN models on a URL.
+        Returns dict with url_result, html_result, and combined verdict.
+        """
+        # URL CNN
+        url_result = self.predict_url(url, threshold)
+        # HTML CNN — fetch page first
+        html_result = None
+        html_content = None
+        if self.html_model is not None:
+            html_content = self._fetch_html(url)
+            if html_content and len(html_content) >= 100:
+                html_result = self.predict_html(html_content, threshold)
+        # Combined verdict
+        results = [r for r in [url_result, html_result] if r is not None]
+        if len(results) == 2:
+            avg_phish = (url_result['phishing_probability'] +
+                         html_result['phishing_probability']) / 2
+            combined_is_phishing = avg_phish >= (threshold * 100)
+            combined = {
+                'prediction': 'PHISHING' if combined_is_phishing else 'LEGITIMATE',
+                'phishing_probability': avg_phish,
+                'legitimate_probability': 100 - avg_phish,
+                'confidence': avg_phish if combined_is_phishing else 100 - avg_phish,
+                'agree': url_result['prediction'] == html_result['prediction'],
+            }
+        elif len(results) == 1:
+            r = results[0]
+            combined = {
+                'prediction': r['prediction'],
+                'phishing_probability': r['phishing_probability'],
+                'legitimate_probability': r['legitimate_probability'],
+                'confidence': r['confidence'],
+                'agree': True,
+            }
+        else:
+            combined = None
+        return {
+            'url_result': url_result,
+            'html_result': html_result,
+            'html_fetched': html_content is not None,
+            'html_length': len(html_content) if html_content else 0,
+            'combined': combined,
+        }
+    # ── Pretty print ───────────────────────────────────────────────
+    def print_results(self, url: str, full: dict):
+        """Print formatted prediction results from both models."""
+        print("\n" + "=" * 80)
+        print(f"{Fore.CYAN}{Style.BRIGHT}CNN PHISHING DETECTION RESULTS{Style.RESET_ALL}")
+        print("=" * 80)
+        print(f"\n{Fore.YELLOW}URL:{Style.RESET_ALL} {url}")
+        # ── URL CNN ──
+        url_r = full['url_result']
+        if url_r:
+            pred = url_r['prediction']
+            color = Fore.RED if pred == 'PHISHING' else Fore.GREEN
+            icon = "⚠️" if pred == 'PHISHING' else "✓"
+            print(f"\n{Style.BRIGHT}1. CNN URL (Character-level):{Style.RESET_ALL}")
+            print(f"   {icon} Prediction: {color}{Style.BRIGHT}{pred}{Style.RESET_ALL}")
+            print(f"   Confidence:  {url_r['confidence']:.2f}%")
+            print(f"   Phishing:    {Fore.RED}{url_r['phishing_probability']:6.2f}%{Style.RESET_ALL}")
+            print(f"   Legitimate:  {Fore.GREEN}{url_r['legitimate_probability']:6.2f}%{Style.RESET_ALL}")
+        else:
+            print(f"\n{Style.BRIGHT}1. CNN URL:{Style.RESET_ALL} {Fore.YELLOW}Not available{Style.RESET_ALL}")
+        # ── HTML CNN ──
+        html_r = full['html_result']
+        if html_r:
+            pred = html_r['prediction']
+            color = Fore.RED if pred == 'PHISHING' else Fore.GREEN
+            icon = "⚠️" if pred == 'PHISHING' else "✓"
+            print(f"\n{Style.BRIGHT}2. CNN HTML (Character-level):{Style.RESET_ALL}")
+            print(f"   {icon} Prediction: {color}{Style.BRIGHT}{pred}{Style.RESET_ALL}")
+            print(f"   Confidence:  {html_r['confidence']:.2f}%")
+            print(f"   Phishing:    {Fore.RED}{html_r['phishing_probability']:6.2f}%{Style.RESET_ALL}")
+            print(f"   Legitimate:  {Fore.GREEN}{html_r['legitimate_probability']:6.2f}%{Style.RESET_ALL}")
+            print(f"   HTML length: {html_r['html_length']:,} chars")
+        elif full['html_fetched']:
+            print(f"\n{Style.BRIGHT}2. CNN HTML:{Style.RESET_ALL} "
+                  f"{Fore.YELLOW}HTML too short for analysis{Style.RESET_ALL}")
+        else:
+            print(f"\n{Style.BRIGHT}2. CNN HTML:{Style.RESET_ALL} "
+                  f"{Fore.YELLOW}Could not fetch page HTML{Style.RESET_ALL}")
+        # ── Combined verdict ──
+        combined = full['combined']
+        if combined:
+            pred = combined['prediction']
+            color = Fore.RED if pred == 'PHISHING' else Fore.GREEN
+            icon = "⚠️" if pred == 'PHISHING' else "✓"
+            agree_str = (f"{Fore.GREEN}YES{Style.RESET_ALL}" if combined['agree']
+                         else f"{Fore.YELLOW}NO{Style.RESET_ALL}")
+            print(f"\n{'─' * 80}")
+            print(f"{Style.BRIGHT}COMBINED VERDICT:{Style.RESET_ALL}")
+            print(f"   {icon} {color}{Style.BRIGHT}{pred}{Style.RESET_ALL}  "
+                  f"(confidence: {combined['confidence']:.2f}%)")
+            print(f"   Phishing:    {Fore.RED}{combined['phishing_probability']:6.2f}%{Style.RESET_ALL}")
+            print(f"   Legitimate:  {Fore.GREEN}{combined['legitimate_probability']:6.2f}%{Style.RESET_ALL}")
+            if url_r and html_r:
+                print(f"   Models agree: {agree_str}")
+        print("\n" + "=" * 80 + "\n")
+def main():
+    """Interactive prediction loop."""
+    print(f"\n{Fore.CYAN}{Style.BRIGHT}╔══════════════════════════════════════════════════════════════╗")
+    print(f"║         CNN PHISHING DETECTOR - INTERACTIVE DEMO            ║")
+    print(f"║            URL CNN + HTML CNN (Dual Analysis)               ║")
+    print(f"╚══════════════════════════════════════════════════════════════╝{Style.RESET_ALL}\n")
+    print(f"{Fore.YELLOW}Loading CNN models...{Style.RESET_ALL}")
+    detector = CNNPhishingDetector()
+    available = []
+    if detector.url_model is not None:
+        available.append("URL CNN")
+    if detector.html_model is not None:
+        available.append("HTML CNN")
+    if not available:
+        print(f"{Fore.RED}No CNN models found! Train models first.{Style.RESET_ALL}")
+        sys.exit(1)
+    print(f"{Fore.GREEN}✓ Models loaded: {', '.join(available)}{Style.RESET_ALL}\n")
+    while True:
+        print(f"{Fore.CYAN}{'─' * 80}{Style.RESET_ALL}")
+        url = input(f"{Fore.YELLOW}Enter URL to test (or 'quit' to exit):{Style.RESET_ALL} ").strip()
+        if url.lower() in ('quit', 'exit', 'q'):
+            print(f"\n{Fore.GREEN}Goodbye!{Style.RESET_ALL}\n")
+            break
+        if not url:
+            print(f"{Fore.RED}Please enter a valid URL.{Style.RESET_ALL}\n")
+            continue
+        if not url.startswith(('http://', 'https://')):
+            url = 'http://' + url
+        try:
+            full = detector.predict_full(url)
+            detector.print_results(url, full)
+        except Exception as e:
+            print(f"\n{Fore.RED}Error: {e}{Style.RESET_ALL}\n")
+            logger.error(str(e))
+if __name__ == '__main__':
+    main()

scripts/testing/data_leakage_test.py ADDED Viewed

	@@ -0,0 +1,291 @@

+"""
+Data Leakage Detection Script
+Checks for common data leakage issues:
+1. Duplicate URLs in train/test split
+2. Feature extraction timing (done before split - CORRECT)
+3. Scaler fitting (only on train data - CORRECT)
+4. Feature contamination checks
+"""
+import pandas as pd
+import numpy as np
+from pathlib import Path
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import StandardScaler
+import logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger("data_leakage_check")
+def check_1_duplicate_urls_in_splits():
+    """Check if same URLs appear in both train and test sets."""
+    logger.info("\n" + "="*80)
+    logger.info("CHECK 1: DUPLICATE URLs IN TRAIN/TEST SPLITS")
+    logger.info("="*80)
+    # Load original dataset with URLs
+    data_dir = Path('data/processed')
+    original_df = pd.read_csv(data_dir / 'clean_dataset_no_duplicates.csv')
+    logger.info(f"\nOriginal dataset: {len(original_df):,} URLs")
+    # Check for duplicates in original dataset
+    duplicates = original_df['url'].duplicated().sum()
+    logger.info(f"Duplicates in original dataset: {duplicates}")
+    if duplicates > 0:
+        logger.warning(f"⚠️  Found {duplicates} duplicate URLs in original dataset!")
+        dup_urls = original_df[original_df['url'].duplicated(keep=False)]['url'].value_counts()
+        logger.info(f"Top duplicated URLs:\n{dup_urls.head(10)}")
+    else:
+        logger.info("✓ No duplicates in original dataset")
+    # Simulate train/test split (same as in training)
+    X = original_df['url']
+    y = original_df['label']
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=0.2, random_state=42, stratify=y
+    )
+    logger.info(f"\nTrain set: {len(X_train):,} URLs")
+    logger.info(f"Test set:  {len(X_test):,} URLs")
+    # Check for overlap
+    train_urls = set(X_train)
+    test_urls = set(X_test)
+    overlap = train_urls.intersection(test_urls)
+    logger.info(f"\nOverlapping URLs between train/test: {len(overlap)}")
+    if len(overlap) > 0:
+        logger.error(f"❌ DATA LEAKAGE DETECTED! {len(overlap)} URLs in both train and test!")
+        logger.info(f"Sample overlapping URLs:\n{list(overlap)[:5]}")
+        return False
+    else:
+        logger.info("✓ No URL overlap between train and test sets")
+        return True
+def check_2_feature_extraction_timing():
+    """Check if features were extracted before split (CORRECT) or after (WRONG)."""
+    logger.info("\n" + "="*80)
+    logger.info("CHECK 2: FEATURE EXTRACTION TIMING")
+    logger.info("="*80)
+    # Load feature dataset
+    features_df = pd.read_csv('data/features/url_features.csv')
+    logger.info(f"\nFeature dataset: {len(features_df):,} rows")
+    logger.info(f"Features: {len(features_df.columns) - 1}")
+    # Load original dataset
+    original_df = pd.read_csv('data/processed/clean_dataset.csv')
+    logger.info(f"Original dataset: {len(original_df):,} rows")
+    # Check sizes match
+    if len(features_df) == len(original_df):
+        logger.info("✓ Feature extraction done on ENTIRE dataset (before split)")
+        logger.info("  This is CORRECT - prevents data leakage")
+        return True
+    else:
+        logger.warning("⚠️  Dataset sizes don't match - check extraction process")
+        logger.info(f"  Difference: {abs(len(features_df) - len(original_df))}")
+        return False
+def check_3_scaler_fitting():
+    """Check if scaler was fitted only on train data."""
+    logger.info("\n" + "="*80)
+    logger.info("CHECK 3: SCALER FITTING (Logistic Regression only)")
+    logger.info("="*80)
+    # Load features
+    features_df = pd.read_csv('data/features/url_features.csv')
+    X = features_df.drop('label', axis=1)
+    y = features_df['label']
+    # Split
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=0.2, random_state=42, stratify=y
+    )
+    # CORRECT way: fit on train, transform both
+    scaler_correct = StandardScaler()
+    X_train_scaled_correct = scaler_correct.fit_transform(X_train)
+    X_test_scaled_correct = scaler_correct.transform(X_test)
+    # WRONG way: fit on all data
+    scaler_wrong = StandardScaler()
+    X_all_scaled_wrong = scaler_wrong.fit_transform(X)
+    X_train_wrong = X_all_scaled_wrong[:len(X_train)]
+    X_test_wrong = X_all_scaled_wrong[len(X_train):]
+    # Compare statistics
+    logger.info("\nScaler statistics comparison:")
+    logger.info("\nCORRECT (fitted on train only):")
+    logger.info(f"  Train mean: {scaler_correct.mean_[:5]}")
+    logger.info(f"  Train std:  {scaler_correct.scale_[:5]}")
+    logger.info("\nWRONG (fitted on all data):")
+    logger.info(f"  All mean: {scaler_wrong.mean_[:5]}")
+    logger.info(f"  All std:  {scaler_wrong.scale_[:5]}")
+    # Check difference
+    mean_diff = np.abs(scaler_correct.mean_ - scaler_wrong.mean_).mean()
+    std_diff = np.abs(scaler_correct.scale_ - scaler_wrong.scale_).mean()
+    logger.info(f"\nAverage difference:")
+    logger.info(f"  Mean: {mean_diff:.6f}")
+    logger.info(f"  Std:  {std_diff:.6f}")
+    if mean_diff < 0.01 and std_diff < 0.01:
+        logger.info("✓ Minimal difference - scaler likely fitted correctly on train only")
+        return True
+    else:
+        logger.warning("⚠️  Significant difference detected - review scaler fitting")
+        return False
+def check_4_feature_contamination():
+    """Check for features that could leak information."""
+    logger.info("\n" + "="*80)
+    logger.info("CHECK 4: FEATURE CONTAMINATION")
+    logger.info("="*80)
+    features_df = pd.read_csv('data/features/url_features.csv')
+    # Check for suspiciously perfect features
+    logger.info("\nChecking for suspiciously perfect correlations with label...")
+    X = features_df.drop('label', axis=1)
+    y = features_df['label']
+    correlations = X.corrwith(y).abs().sort_values(ascending=False)
+    logger.info("\nTop 10 features correlated with label:")
+    for feat, corr in correlations.head(10).items():
+        logger.info(f"  {feat:30s}: {corr:.4f}")
+    # Check for suspiciously high correlations (> 0.9 is suspicious)
+    suspicious = correlations[correlations > 0.9]
+    if len(suspicious) > 0:
+        logger.warning(f"⚠️  Found {len(suspicious)} features with >0.9 correlation!")
+        logger.warning(f"  These might be leaking information:\n{suspicious}")
+        return False
+    else:
+        logger.info("✓ No suspiciously high correlations detected")
+        return True
+def check_5_train_test_distribution():
+    """Check if train/test have similar distributions."""
+    logger.info("\n" + "="*80)
+    logger.info("CHECK 5: TRAIN/TEST DISTRIBUTION SIMILARITY")
+    logger.info("="*80)
+    features_df = pd.read_csv('data/features/url_features.csv')
+    X = features_df.drop('label', axis=1)
+    y = features_df['label']
+    X_train, X_test, y_train, y_test = train_test_split(
+        X, y, test_size=0.2, random_state=42, stratify=y
+    )
+    # Check label distribution
+    logger.info("\nLabel distribution:")
+    logger.info(f"  Train: {y_train.value_counts().to_dict()}")
+    logger.info(f"  Test:  {y_test.value_counts().to_dict()}")
+    train_phishing_ratio = (y_train == 1).sum() / len(y_train)
+    test_phishing_ratio = (y_test == 1).sum() / len(y_test)
+    logger.info(f"\nPhishing ratio:")
+    logger.info(f"  Train: {train_phishing_ratio:.4f}")
+    logger.info(f"  Test:  {test_phishing_ratio:.4f}")
+    logger.info(f"  Difference: {abs(train_phishing_ratio - test_phishing_ratio):.4f}")
+    if abs(train_phishing_ratio - test_phishing_ratio) < 0.01:
+        logger.info("✓ Train/test distributions are well balanced")
+        return True
+    else:
+        logger.warning("⚠️  Train/test distributions differ significantly")
+        return False
+def main():
+    """Run all data leakage checks."""
+    logger.info("="*80)
+    logger.info("DATA LEAKAGE DETECTION")
+    logger.info("="*80)
+    results = {}
+    try:
+        results['duplicates'] = check_1_duplicate_urls_in_splits()
+    except Exception as e:
+        logger.error(f"Error in duplicate check: {e}")
+        results['duplicates'] = None
+    try:
+        results['extraction_timing'] = check_2_feature_extraction_timing()
+    except Exception as e:
+        logger.error(f"Error in extraction timing check: {e}")
+        results['extraction_timing'] = None
+    try:
+        results['scaler'] = check_3_scaler_fitting()
+    except Exception as e:
+        logger.error(f"Error in scaler check: {e}")
+        results['scaler'] = None
+    try:
+        results['contamination'] = check_4_feature_contamination()
+    except Exception as e:
+        logger.error(f"Error in contamination check: {e}")
+        results['contamination'] = None
+    try:
+        results['distribution'] = check_5_train_test_distribution()
+    except Exception as e:
+        logger.error(f"Error in distribution check: {e}")
+        results['distribution'] = None
+    # Final summary
+    logger.info("\n" + "="*80)
+    logger.info("SUMMARY")
+    logger.info("="*80)
+    passed = sum(1 for v in results.values() if v is True)
+    failed = sum(1 for v in results.values() if v is False)
+    errors = sum(1 for v in results.values() if v is None)
+    logger.info(f"\nChecks passed: {passed}")
+    logger.info(f"Checks failed: {failed}")
+    logger.info(f"Checks errored: {errors}")
+    for check, result in results.items():
+        status = "✓ PASS" if result else ("❌ FAIL" if result is False else "⚠️  ERROR")
+        logger.info(f"  {check:20s}: {status}")
+    if failed == 0 and errors == 0:
+        logger.info("\n🎉 ALL CHECKS PASSED - No data leakage detected!")
+        logger.info("Your results are LEGITIMATE!")
+    elif failed > 0:
+        logger.warning(f"\n⚠️  {failed} checks failed - review your pipeline!")
+    logger.info("\n" + "="*80)
+if __name__ == "__main__":
+    main()

scripts/testing/test_feature_alignment.py ADDED Viewed

	@@ -0,0 +1,123 @@

+"""
+Test feature alignment between extractor and models
+"""
+import sys
+from pathlib import Path
+import joblib
+import pandas as pd
+sys.path.append(str(Path(__file__).parent))
+from scripts.feature_extraction.url_features_v2 import URLFeatureExtractorV2
+def test_feature_alignment():
+    """Test that feature extraction produces features in the correct order for models"""
+    # Load models
+    models_dir = Path(__file__).parent / 'saved_models'
+    model_files = {
+        'Logistic Regression': 'logistic_regression.joblib',
+        'Random Forest': 'random_forest.joblib',
+        'XGBoost': 'xgboost.joblib'
+    }
+    # Load scaler
+    scaler_path = models_dir / 'scaler.joblib'
+    scaler = None
+    if scaler_path.exists():
+        scaler = joblib.load(scaler_path)
+        print(f"✓ Loaded scaler")
+        if hasattr(scaler, 'feature_names_in_'):
+            print(f"  Scaler has {len(scaler.feature_names_in_)} feature names\n")
+    # Initialize extractor
+    extractor = URLFeatureExtractorV2()
+    # Test URL
+    test_url = "https://github.com/user/repo"
+    print("Testing feature alignment...\n")
+    print(f"Test URL: {test_url}\n")
+    # Extract features
+    features_dict = extractor.extract_features(test_url)
+    features_df = pd.DataFrame([features_dict])
+    if 'label' in features_df.columns:
+        features_df = features_df.drop('label', axis=1)
+    print(f"Extracted {len(features_df.columns)} features\n")
+    # Store feature names for fallback
+    feature_names_store = {}
+    # Check each model
+    for name, filename in model_files.items():
+        model_path = models_dir / filename
+        if not model_path.exists():
+            print(f"❌ {name}: Model file not found")
+            continue
+        model = joblib.load(model_path)
+        # Determine expected features
+        expected_features = None
+        source = None
+        if hasattr(model, 'feature_names_in_'):
+            expected_features = list(model.feature_names_in_)
+            source = "model"
+        elif hasattr(scaler, 'feature_names_in_'):
+            expected_features = list(scaler.feature_names_in_)
+            source = "scaler"
+        elif feature_names_store:
+            expected_features = list(feature_names_store.values())[0]
+            source = "fallback"
+        if expected_features:
+            feature_names_store[name] = expected_features
+            print(f"✓ {name}:")
+            print(f"  Expected features: {len(expected_features)} (from {source})")
+            print(f"  Expected features: {len(expected_features)} (from {source})")
+            # Check missing features
+            missing = set(expected_features) - set(features_df.columns)
+            extra = set(features_df.columns) - set(expected_features)
+            if missing:
+                print(f"  ⚠ Missing features: {len(missing)}")
+                print(f"    {list(missing)[:5]}...")
+            if extra:
+                print(f"  ⚠ Extra features: {len(extra)}")
+                print(f"    {list(extra)[:5]}...")
+            if not missing and not extra:
+                print(f"  ✓ Perfect match!")
+            # Try prediction with alignment
+            features_aligned = pd.DataFrame(columns=expected_features)
+            for feat in expected_features:
+                if feat in features_df.columns:
+                    features_aligned[feat] = features_df[feat].values
+                else:
+                    features_aligned[feat] = 0
+            # Scale for Logistic Regression
+            if name == 'Logistic Regression' and scaler is not None:
+                features_to_use = scaler.transform(features_aligned)
+            else:
+                features_to_use = features_aligned
+            try:
+                pred = model.predict(features_to_use)[0]
+                proba = model.predict_proba(features_to_use)[0]
+                print(f"  ✓ Prediction successful: {'PHISHING' if pred == 1 else 'LEGITIMATE'} ({proba[pred]*100:.1f}%)")
+            except Exception as e:
+                print(f"  ❌ Prediction failed: {e}")
+        else:
+            print(f"⚠ {name}: No feature names available")
+        print()
+if __name__ == "__main__":
+    test_feature_alignment()

scripts/testing/test_normalization.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""
+Test URL normalization - verify www/http variants produce same features
+"""
+import sys
+from pathlib import Path
+import pandas as pd
+sys.path.append(str(Path(__file__).parent))
+from scripts.feature_extraction.url_features_v3 import URLFeatureExtractorOptimized
+def test_normalization():
+    """Test that www/http variants produce identical features."""
+    extractor = URLFeatureExtractorOptimized()
+    print("=" * 80)
+    print("URL NORMALIZATION TEST")
+    print("=" * 80)
+    print()
+    # Test cases - should all have IDENTICAL features (except is_http)
+    test_cases = [
+        [
+            "https://github.com/user/repo",
+            "http://github.com/user/repo",
+            "https://www.github.com/user/repo",
+            "http://www.github.com/user/repo",
+            "www.github.com/user/repo",
+            "github.com/user/repo"
+        ],
+        [
+            "https://example.com/login?user=test",
+            "www.example.com/login?user=test",
+            "http://www.example.com/login?user=test"
+        ]
+    ]
+    for i, urls in enumerate(test_cases, 1):
+        print(f"Test Case {i}: {urls[0].split('/')[2]}")
+        print("-" * 80)
+        features_list = []
+        for url in urls:
+            features = extractor.extract_features(url)
+            features_list.append(features)
+            # Show normalization
+            norm_url, orig, norm_domain, is_http = extractor.normalize_url(url)
+            print(f"  {url:45s} → {norm_domain:20s} http={is_http}")
+        # Compare key features (should be identical except is_http)
+        key_features = [
+            'domain_length', 'domain_dots', 'num_subdomains', 'domain_entropy',
+            'path_length', 'url_entropy', 'is_shortened', 'is_free_platform',
+            'has_suspicious_tld', 'num_phishing_keywords'
+        ]
+        print("\n  Key Features Comparison:")
+        print("  " + "-" * 76)
+        # Check if all features are identical (except www/http)
+        first_features = features_list[0]
+        all_identical = True
+        for feat in key_features:
+            values = [f[feat] for f in features_list]
+            unique_vals = set(values)
+            if len(unique_vals) == 1:
+                status = "✓"
+            else:
+                status = "✗"
+                all_identical = False
+            print(f"  {status} {feat:30s}: {values[0]}")
+        # Check is_http (should vary)
+        print("\n  HTTP Flag (should vary based on input):")
+        print("  " + "-" * 76)
+        for j, url in enumerate(urls):
+            http_flag = features_list[j]['is_http']
+            print(f"  {url:45s} → http={http_flag}")
+        print()
+        if all_identical:
+            print(f"  ✅ TEST PASSED: All key features identical!")
+        else:
+            print(f"  ❌ TEST FAILED: Features differ!")
+        print("\n")
+    print("=" * 80)
+    print("FEATURE COUNT")
+    print("=" * 80)
+    feature_names = extractor.get_feature_names() # pyright: ignore[reportAttributeAccessIssue]
+    print(f"Total features: {len(feature_names)}")
+    print()
+    print("Top 30 features:")
+    for i, name in enumerate(feature_names[:30], 1):
+        print(f"  {i:2d}. {name}")
+if __name__ == "__main__":
+    test_normalization()

scripts/testing/test_server.py ADDED Viewed

	@@ -0,0 +1,255 @@

+"""
+Test Server Predictions Against Dataset
+Validates server API predictions vs actual labels
+"""
+import pandas as pd
+import requests
+from pathlib import Path
+import logging
+from tqdm import tqdm
+import time
+from sklearn.metrics import (
+    accuracy_score, precision_score, recall_score, f1_score,
+    confusion_matrix, classification_report
+)
+# Setup logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    datefmt='%H:%M:%S'
+)
+logger = logging.getLogger(__name__)
+class ServerTester:
+    """Test phishing detection server against dataset"""
+    def __init__(self, server_url='http://localhost:8000', batch_size=100):
+        self.server_url = server_url
+        self.batch_size = batch_size
+        self.results = []
+    def check_server_health(self):
+        """Check if server is running"""
+        try:
+            response = requests.get(f"{self.server_url}/api/health", timeout=5)
+            if response.status_code == 200:
+                health = response.json()
+                logger.info(f"✓ Server is healthy")
+                logger.info(f"  URL models: {health.get('url_models', 0)}")
+                logger.info(f"  HTML models: {health.get('html_models', 0)}")
+                return True
+            else:
+                logger.error(f"Server health check failed: {response.status_code}")
+                return False
+        except Exception as e:
+            logger.error(f"Cannot connect to server: {e}")
+            logger.error(f"Make sure server is running: python server/app.py")
+            return False
+    def predict_url(self, url):
+        """Get prediction from server for a URL"""
+        try:
+            response = requests.post(
+                f"{self.server_url}/api/predict/url",
+                json={"url": url},
+                timeout=10
+            )
+            if response.status_code == 200:
+                result = response.json()
+                return {
+                    'predicted': 1 if result['is_phishing'] else 0,
+                    'consensus': result['consensus'],
+                    'predictions': result['predictions']
+                }
+            else:
+                logger.warning(f"Server error for {url}: {response.status_code}")
+                return None
+        except Exception as e:
+            logger.warning(f"Request error for {url}: {e}")
+            return None
+    def test_dataset(self, dataset_path, limit=None, sample_frac=None):
+        """
+        Test server predictions against dataset.
+        Args:
+            dataset_path: Path to CSV with 'url' and 'label' columns
+            limit: Maximum number of URLs to test (None = all)
+            sample_frac: Random sample fraction (e.g., 0.1 = 10%)
+        """
+        logger.info("="*80)
+        logger.info("SERVER PREDICTION TESTING")
+        logger.info("="*80)
+        # Load dataset
+        logger.info(f"\n1. Loading dataset: {dataset_path}")
+        df = pd.read_csv(dataset_path)
+        # Ensure we have required columns
+        if 'label' not in df.columns:
+            # Assume first column is URL, second is label
+            df.columns = ['url', 'label']
+        logger.info(f"   Total URLs: {len(df):,}")
+        logger.info(f"   Phishing: {(df['label']==1).sum():,}")
+        logger.info(f"   Legitimate: {(df['label']==0).sum():,}")
+        # Sample if requested
+        if sample_frac:
+            df = df.sample(frac=sample_frac, random_state=42)
+            logger.info(f"\n   Sampled {sample_frac*100:.1f}%: {len(df):,} URLs")
+        # Limit if requested
+        if limit and limit < len(df):
+            df = df.head(limit)
+            logger.info(f"   Limited to: {limit:,} URLs")
+        # Check server
+        logger.info("\n2. Checking server health...")
+        if not self.check_server_health():
+            return None
+        # Test predictions
+        logger.info("\n3. Testing predictions...")
+        y_true = []
+        y_pred = []
+        errors = 0
+        for idx, row in tqdm(df.iterrows(), total=len(df), desc="Testing URLs"):
+            url = row['url'] if 'url' in row else row.iloc[0]
+            true_label = int(row['label']) if 'label' in row else int(row.iloc[1])
+            # Get prediction
+            result = self.predict_url(url)
+            if result:
+                y_true.append(true_label)
+                y_pred.append(result['predicted'])
+                self.results.append({
+                    'url': url,
+                    'true_label': true_label,
+                    'predicted_label': result['predicted'],
+                    'consensus': result['consensus'],
+                    'correct': true_label == result['predicted']
+                })
+            else:
+                errors += 1
+            # Rate limiting
+            time.sleep(0.01)  # 10ms delay between requests
+        logger.info(f"\n   Processed: {len(y_pred):,} URLs")
+        if errors > 0:
+            logger.warning(f"   Errors: {errors:,}")
+        # Calculate metrics
+        self._display_results(y_true, y_pred)
+        return {
+            'y_true': y_true,
+            'y_pred': y_pred,
+            'results': self.results
+        }
+    def _display_results(self, y_true, y_pred):
+        """Display test results and metrics"""
+        logger.info("\n" + "="*80)
+        logger.info("TEST RESULTS")
+        logger.info("="*80)
+        # Calculate metrics
+        accuracy = accuracy_score(y_true, y_pred)
+        precision = precision_score(y_true, y_pred)
+        recall = recall_score(y_true, y_pred)
+        f1 = f1_score(y_true, y_pred)
+        logger.info(f"\nOverall Metrics:")
+        logger.info(f"  Accuracy:  {accuracy:.4f} ({accuracy*100:.2f}%)")
+        logger.info(f"  Precision: {precision:.4f} ({precision*100:.2f}%)")
+        logger.info(f"  Recall:    {recall:.4f} ({recall*100:.2f}%)")
+        logger.info(f"  F1-Score:  {f1:.4f} ({f1*100:.2f}%)")
+        # Confusion matrix
+        cm = confusion_matrix(y_true, y_pred)
+        tn, fp, fn, tp = cm.ravel()
+        logger.info(f"\nConfusion Matrix:")
+        logger.info(f"                 Predicted")
+        logger.info(f"                 Legit  Phish")
+        logger.info(f"Actual Legit   {tn:6,} {fp:6,}")
+        logger.info(f"       Phish   {fn:6,} {tp:6,}")
+        logger.info(f"\nError Analysis:")
+        logger.info(f"  True Negatives:  {tn:,} (correctly identified legitimate)")
+        logger.info(f"  True Positives:  {tp:,} (correctly identified phishing)")
+        logger.info(f"  False Positives: {fp:,} ({fp/(tn+fp)*100:.2f}% of legitimate marked as phishing)")
+        logger.info(f"  False Negatives: {fn:,} ({fn/(tp+fn)*100:.2f}% of phishing marked as legitimate) ⚠️")
+        # Classification report
+        logger.info(f"\nDetailed Classification Report:")
+        logger.info(classification_report(
+            y_true, y_pred,
+            target_names=['Legitimate', 'Phishing'],
+            digits=4
+        ))
+    def save_results(self, output_path):
+        """Save test results to CSV"""
+        if not self.results:
+            logger.warning("No results to save")
+            return
+        df = pd.DataFrame(self.results)
+        df.to_csv(output_path, index=False)
+        logger.info(f"\n✓ Results saved: {output_path}")
+        logger.info(f"  Total: {len(df):,} predictions")
+        logger.info(f"  Correct: {df['correct'].sum():,} ({df['correct'].mean()*100:.2f}%)")
+        logger.info(f"  Incorrect: {(~df['correct']).sum():,}")
+def main():
+    """Main testing function"""
+    # Paths
+    dataset_path = Path('data/processed/mega_dataset_full_912357.csv')
+    output_path = Path('results/server_test_results.csv')
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    # Check dataset exists
+    if not dataset_path.exists():
+        logger.error(f"Dataset not found: {dataset_path}")
+        logger.info("Available datasets:")
+        for csv_file in Path('data/processed').glob('*.csv'):
+            logger.info(f"  - {csv_file}")
+        return
+    # Create tester
+    tester = ServerTester(server_url='http://localhost:8000')
+    # Test with sample (10% of dataset for quick test)
+    logger.info("\nTesting with 10% sample for quick validation...")
+    logger.info("(Use sample_frac=1.0 or remove it to test full dataset)")
+    results = tester.test_dataset(
+        dataset_path,
+        # sample_frac=0.1  # 0.1 for 10% sample (91k URLs) 1.0 for full dataset
+        limit=1000  # Or use limit for exact number
+    )
+    if results:
+        # Save results
+        tester.save_results(output_path)
+        logger.info("\n" + "="*80)
+        logger.info("✓ SERVER TESTING COMPLETE!")
+        logger.info("="*80)
+        logger.info(f"\nResults saved to: {output_path}")
+        logger.info("\nTo test full dataset, change sample_frac=1.0")
+if __name__ == '__main__':
+    main()

scripts/utils/analyze_dataset.py ADDED Viewed

	@@ -0,0 +1,42 @@

+import pandas as pd
+import sys
+# Cesta k datasetu
+dataset_path = 'data/processed/url_dataset_cleaned.csv'
+try:
+    # Načítanie datasetu
+    df = pd.read_csv(dataset_path)
+    # Analýza labelov
+    print("=" * 50)
+    print("ANALÝZA DATASETU")
+    print("=" * 50)
+    print(f"\nCelkový počet záznamov: {len(df)}")
+    print(f"\nRozdělenie labelov:")
+    print("-" * 50)
+    label_counts = df['label'].value_counts().sort_index()
+    for label, count in label_counts.items():
+        percentage = (count / len(df)) * 100
+        print(f"Label {label}: {count} záznamov ({percentage:.2f}%)")
+    print("-" * 50)
+    print(f"\nPomer label 0 / label 1: {label_counts.get(0, 0) / label_counts.get(1, 1):.2f}")
+    # Kontrola chýbajúcich hodnôt
+    missing = df['label'].isna().sum()
+    if missing > 0:
+        print(f"\nChýbajúce labely: {missing}")
+    print("\n" + "=" * 50)
+except FileNotFoundError:
+    print(f"Súbor '{dataset_path}' nebol nájdený")
+    print(f"Aktuálny adresár: {sys.path[0]}")
+except KeyError:
+    print("Stĺpec 'label' neexistuje v datasete")
+    print(f"Dostupné stĺpce: {list(df.columns)}") # type: ignore
+except Exception as e:
+    print(f"Chyba: {e}")

scripts/utils/balance_dataset.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import pandas as pd
+input_path = "data/processed/url_dataset_cleaned.csv"
+output_path = "data/processed/url_dataset_balanced.csv"
+print("Loading dataset...")
+df = pd.read_csv(input_path)
+print(f"Total rows: {len(df):,}")
+label_counts = df["label"].value_counts()
+print(f"Label 0: {label_counts[0]:,}  |  Label 1: {label_counts[1]:,}")
+minority_count = label_counts.min()
+minority_label = label_counts.idxmin()
+majority_label = label_counts.idxmax()
+print(f"\nBalancing to {minority_count:,} per label (matching label {minority_label})...")
+df_minority = df[df["label"] == minority_label]
+df_majority = df[df["label"] == majority_label].sample(n=minority_count, random_state=42)
+df_balanced = pd.concat([df_minority, df_majority]).sample(frac=1, random_state=42).reset_index(drop=True)
+label_counts_new = df_balanced["label"].value_counts().sort_index()
+print(f"\nBalanced dataset:")
+for label, count in label_counts_new.items():
+    print(f"  Label {label}: {count:,}")
+df_balanced.to_csv(output_path, index=False)
+print(f"\nSaved {len(df_balanced):,} rows to {output_path}")

scripts/utils/clean_urls.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import pandas as pd
+import sys
+import os
+def clean_url(url):
+    """Remove www. and ensure http(s):// prefix."""
+    url = str(url).strip()
+    # Remove www. (handles http://www., https://www., and bare www.)
+    if url.startswith("https://www."):
+        url = "https://" + url[len("https://www."):]
+    elif url.startswith("http://www."):
+        url = "http://" + url[len("http://www."):]
+    elif url.startswith("www."):
+        url = url[len("www."):]
+    # Add http:// if no scheme present
+    if not url.startswith("http://") and not url.startswith("https://"):
+        url = "http://" + url
+    return url
+def main():
+    input_path = sys.argv[1] if len(sys.argv) > 1 else "data/raw/top-1m.csv"
+    base, ext = os.path.splitext(input_path)
+    output_path = sys.argv[2] if len(sys.argv) > 2 else f"{base}_cleaned{ext}"
+    print(f"Reading {input_path}...")
+    df = pd.read_csv(input_path)
+    print(f"Loaded {len(df):,} rows")
+    print("Cleaning URLs...")
+    df["url"] = df["url"].apply(clean_url)
+    # Drop duplicates that may appear after www. removal
+    before = len(df)
+    df.drop_duplicates(subset=["url"], keep="first", inplace=True)
+    after = len(df)
+    if before != after:
+        print(f"Removed {before - after:,} duplicates after cleaning")
+    df.to_csv(output_path, index=False)
+    print(f"Saved {len(df):,} rows to {output_path}")
+if __name__ == "__main__":
+    main()

scripts/utils/merge_datasets.py ADDED Viewed

	@@ -0,0 +1,19 @@

+import pandas as pd
+df1 = pd.read_csv('data/raw/phishing.csv')
+df2 = pd.read_csv('data/raw/legitimate.csv')
+df2.columns = df2.columns.str.lower()
+combined_df = pd.concat([df1, df2], ignore_index=True)
+combined_df = combined_df.drop_duplicates()
+combined_df.to_csv('data/processed/clean_dataset.csv', index=False)
+print(f"Datasety boli úspešne spojené")
+print(f"Počet záznamov v prvom súbore: {len(df1)}")
+print(f"Počet záznamov v druhom súbore: {len(df2)}")
+print(f"Celkový počet záznamov: {len(combined_df)}")
+print(f"\nPrvých 5 riadkov spojeného datasetu:")
+print(combined_df.head())

scripts/utils/remove_duplicates.py ADDED Viewed

	@@ -0,0 +1,23 @@

+"""
+Remove duplicates from clean_dataset.csv
+"""
+import pandas as pd
+# Load dataset
+df = pd.read_csv('data/processed/clean_dataset.csv')
+print(f"Original: {len(df):,} URLs")
+# Check duplicates
+print(f"Duplicates: {df.duplicated(subset='url').sum():,}")
+# Keep first occurrence of each URL
+df_clean = df.drop_duplicates(subset='url', keep='first')
+print(f"After removing duplicates: {len(df_clean):,} URLs")
+# Check label distribution
+print(f"\nLabel distribution:")
+print(df_clean['label'].value_counts())
+# Save
+df_clean.to_csv('data/processed/clean_dataset_no_duplicates.csv', index=False)
+print(f"\n✓ Saved to: data/processed/clean_dataset_no_duplicates.csv")

server/__pycache__/app.cpython-313.pyc ADDED Viewed

Binary file (40 kB). View file

server/app.py ADDED Viewed

	@@ -0,0 +1,819 @@

+"""
+Phishing Detection API Server
+FastAPI server combining URL and HTML phishing detection
+"""
+import os
+os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
+os.environ['TF_ENABLE_ONEDNN_OPTS'] = '0'
+import sys
+from pathlib import Path
+from typing import Optional
+import warnings
+# Suppress warnings before importing other libraries
+warnings.filterwarnings('ignore', category=UserWarning)
+warnings.filterwarnings('ignore', message='.*XGBoost.*')
+warnings.filterwarnings('ignore', message='.*Unverified HTTPS.*')
+from fastapi import FastAPI, HTTPException
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import HTMLResponse, JSONResponse
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+import json
+import joblib
+import pandas as pd
+import numpy as np
+import requests
+from urllib.parse import urlparse
+import logging
+import urllib3
+urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)
+# Add parent directory to path
+sys.path.append(str(Path(__file__).parent.parent))
+# Use OPTIMIZED URL feature extractor with normalization
+from scripts.feature_extraction.url.url_features_v3 import URLFeatureExtractorOptimized
+from scripts.feature_extraction.html.html_feature_extractor import HTMLFeatureExtractor
+from scripts.feature_extraction.html.feature_engineering import engineer_features
+# Setup logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Helper function to convert numpy/pandas types to Python native types
+def convert_to_json_serializable(obj):
+    """Convert numpy/pandas types to JSON-serializable Python types"""
+    if isinstance(obj, dict):
+        return {key: convert_to_json_serializable(value) for key, value in obj.items()}
+    elif isinstance(obj, list):
+        return [convert_to_json_serializable(item) for item in obj]
+    elif isinstance(obj, (np.integer, np.int64, np.int32)): # type: ignore
+        return int(obj)
+    elif isinstance(obj, (np.floating, np.float64, np.float32)): # type: ignore
+        return float(obj)
+    elif isinstance(obj, np.ndarray):
+        return convert_to_json_serializable(obj.tolist())
+    elif isinstance(obj, (pd.Series, pd.DataFrame)):
+        return convert_to_json_serializable(obj.to_dict())
+    elif isinstance(obj, np.bool_):
+        return bool(obj)
+    else:
+        return obj
+# Initialize FastAPI app
+app = FastAPI(
+    title="Phishing Detection API",
+    description="API for detecting phishing URLs and HTML content",
+    version="1.0.0"
+)
+# CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# Mount static files
+static_dir = Path(__file__).parent / 'static'
+app.mount("/static", StaticFiles(directory=str(static_dir)), name="static")
+# Request models
+class URLRequest(BaseModel):
+    url: str
+class HTMLRequest(BaseModel):
+    html_content: str
+    url: Optional[str] = None
+# Response models
+class PredictionResult(BaseModel):
+    model_name: str
+    prediction: str
+    confidence: float
+    phishing_probability: float
+    legitimate_probability: float
+class URLPredictionResponse(BaseModel):
+    url: str
+    is_phishing: bool
+    consensus: str
+    predictions: list[PredictionResult]
+    features: dict
+class HTMLPredictionResponse(BaseModel):
+    source: str
+    is_phishing: bool
+    consensus: str
+    predictions: list[PredictionResult]
+    features: dict
+class PhishingDetectorService:
+    """Singleton service for phishing detection with pre-loaded models."""
+    _instance = None
+    _initialized = False
+    TRUSTED_DOMAINS = frozenset({
+        'youtube.com', 'facebook.com', 'twitter.com', 'x.com',
+        'linkedin.com', 'microsoft.com', 'apple.com', 'amazon.com',
+        'github.com', 'gitlab.com', 'stackoverflow.com',
+        'claude.ai', 'anthropic.com', 'openai.com', 'chatgpt.com',
+        'wikipedia.org', 'reddit.com', 'instagram.com', 'whatsapp.com',
+    })
+    DEFAULT_THRESHOLD = 0.5
+    HTML_DOWNLOAD_HEADERS = {
+        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
+    }
+    def __new__(cls):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+    def __init__(self):
+        if self._initialized:
+            return
+        logger.info("Initializing Phishing Detector Service...")
+        self.models_dir = Path(__file__).parent.parent / 'saved_models'
+        # Initialize extractors
+        self.url_extractor = URLFeatureExtractorOptimized()
+        self.html_extractor = HTMLFeatureExtractor()
+        # Load models
+        self.url_models = {}
+        self.url_feature_names = {}
+        self.scaler = None
+        self._load_url_models()
+        self.html_models = {}
+        self._load_html_models()
+        self.combined_models = {}
+        self._load_combined_models()
+        # CNN models
+        self.cnn_url_model = None
+        self.cnn_url_vocab = None
+        self.cnn_html_model = None
+        self.cnn_html_vocab = None
+        self._load_cnn_url_model()
+        self._load_cnn_html_model()
+        self._initialized = True
+        logger.info("✓ Service initialized successfully")
+    def _load_url_models(self):
+        """Load URL prediction models"""
+        # Load scaler
+        scaler_path = self.models_dir / 'scaler.joblib'
+        if scaler_path.exists():
+            self.scaler = joblib.load(scaler_path)
+            logger.info("✓ Loaded scaler for URL models")
+        # Load models
+        url_model_files = {
+            'Logistic Regression': 'logistic_regression.joblib',
+            'Random Forest': 'random_forest.joblib',
+            'XGBoost': 'xgboost.joblib'
+        }
+        for name, filename in url_model_files.items():
+            model_path = self.models_dir / filename
+            if model_path.exists():
+                model = joblib.load(model_path)
+                self.url_models[name] = model
+                # Store expected feature names from model
+                if hasattr(model, 'feature_names_in_'):
+                    self.url_feature_names[name] = list(model.feature_names_in_)
+                    logger.info(f"✓ Loaded URL model: {name} ({len(self.url_feature_names[name])} features)")
+                elif self.scaler and hasattr(self.scaler, 'feature_names_in_'):
+                    # Use scaler's feature names for models without them (like Logistic Regression)
+                    self.url_feature_names[name] = list(self.scaler.feature_names_in_)
+                    logger.info(f"✓ Loaded URL model: {name} (using scaler features: {len(self.url_feature_names[name])} features)")
+                else:
+                    logger.info(f"✓ Loaded URL model: {name}")
+    def _load_html_models(self):
+        """Load HTML prediction models."""
+        html_model_files = {
+            'Random Forest': ('random_forest_html.joblib', 'random_forest_html_feature_names.joblib'),
+            'XGBoost': ('xgboost_html.joblib', 'xgboost_html_feature_names.joblib'),
+        }
+        for name, (model_file, features_file) in html_model_files.items():
+            model_path = self.models_dir / model_file
+            features_path = self.models_dir / features_file
+            if model_path.exists():
+                self.html_models[name] = {
+                    'model': joblib.load(model_path),
+                    'features': joblib.load(features_path) if features_path.exists() else None,
+                }
+                logger.info(f"✓ Loaded HTML model: {name}")
+    def _load_combined_models(self):
+        """Load combined URL+HTML prediction models."""
+        combined_model_files = {
+            'Random Forest Combined': ('random_forest_combined.joblib', 'random_forest_combined_feature_names.joblib'),
+            'XGBoost Combined': ('xgboost_combined.joblib', 'xgboost_combined_feature_names.joblib'),
+        }
+        for name, (model_file, features_file) in combined_model_files.items():
+            model_path = self.models_dir / model_file
+            features_path = self.models_dir / features_file
+            if model_path.exists():
+                self.combined_models[name] = {
+                    'model': joblib.load(model_path),
+                    'features': joblib.load(features_path) if features_path.exists() else None,
+                }
+                n = len(self.combined_models[name]['features']) if self.combined_models[name]['features'] else '?'
+                logger.info(f"✓ Loaded combined model: {name} ({n} features)")
+    def _load_cnn_url_model(self):
+        """Load character-level CNN URL model and vocabulary."""
+        model_path = self.models_dir / 'cnn_url_model.keras'
+        vocab_path = self.models_dir / 'cnn_url_vocab.json'
+        if not model_path.exists():
+            logger.warning(f"✗ CNN URL model not found: {model_path}")
+            return
+        if not vocab_path.exists():
+            logger.warning(f"✗ CNN URL vocabulary not found: {vocab_path}")
+            return
+        try:
+            import tensorflow as tf
+            self.cnn_url_model = tf.keras.models.load_model(str(model_path))
+            with open(vocab_path, 'r') as f:
+                self.cnn_url_vocab = json.load(f)
+            logger.info(f"✓ Loaded CNN URL model (vocab_size={self.cnn_url_vocab['vocab_size']}, max_len={self.cnn_url_vocab['max_len']})")
+        except Exception as e:
+            logger.warning(f"✗ Failed to load CNN URL model: {e}")
+            self.cnn_url_model = None
+            self.cnn_url_vocab = None
+    def _load_cnn_html_model(self):
+        """Load character-level CNN HTML model and vocabulary."""
+        model_path = self.models_dir / 'cnn_html_model.keras'
+        vocab_path = self.models_dir / 'cnn_html_vocab.json'
+        if not model_path.exists():
+            logger.warning(f"✗ CNN HTML model not found: {model_path}")
+            return
+        if not vocab_path.exists():
+            logger.warning(f"✗ CNN HTML vocabulary not found: {vocab_path}")
+            return
+        try:
+            import tensorflow as tf
+            self.cnn_html_model = tf.keras.models.load_model(str(model_path))
+            with open(vocab_path, 'r') as f:
+                self.cnn_html_vocab = json.load(f)
+            logger.info(f"✓ Loaded CNN HTML model (vocab_size={self.cnn_html_vocab['vocab_size']}, max_len={self.cnn_html_vocab['max_len']})")
+        except Exception as e:
+            logger.warning(f"✗ Failed to load CNN HTML model: {e}")
+            self.cnn_html_model = None
+            self.cnn_html_vocab = None
+    def _encode_for_cnn(self, text: str, vocab: dict) -> np.ndarray:
+        """Encode text to a padded integer sequence for a CNN model."""
+        char_to_idx = vocab['char_to_idx']
+        max_len = vocab['max_len']
+        PAD_IDX = 0
+        UNK_IDX = 1
+        encoded = [char_to_idx.get(c, UNK_IDX) for c in text[:max_len]]
+        encoded += [PAD_IDX] * (max_len - len(encoded))
+        return np.array([encoded], dtype=np.int32)
+    # ── Shared helpers ─────────────────────────────────────────────
+    @staticmethod
+    def _calculate_consensus(predictions: list[dict]) -> tuple[bool, str]:
+        """Return (is_phishing, consensus_text) from a list of prediction dicts."""
+        total = len(predictions)
+        phishing_votes = sum(1 for p in predictions if p['prediction'] == 'PHISHING')
+        is_phishing = phishing_votes > total / 2
+        if phishing_votes == total:
+            consensus = "ALL MODELS AGREE: PHISHING"
+        elif phishing_votes == 0:
+            consensus = "ALL MODELS AGREE: LEGITIMATE"
+        else:
+            consensus = f"MIXED: {phishing_votes}/{total} models say PHISHING"
+        return is_phishing, consensus
+    def _align_features(self, features_df: pd.DataFrame, model_name: str) -> np.ndarray:
+        """Align extracted features to a model's expected feature order."""
+        expected = self.url_feature_names.get(model_name)
+        if expected is None and self.url_feature_names:
+            expected = next(iter(self.url_feature_names.values()))
+        if expected is not None:
+            aligned = pd.DataFrame(columns=expected)
+            for feat in expected:
+                aligned[feat] = features_df[feat].values if feat in features_df.columns else 0
+            return aligned.values
+        return features_df.values
+    @staticmethod
+    def _build_prediction(model_name: str, model, features: np.ndarray, threshold: float = 0.5) -> dict:
+        """Run a single model and return a standardised prediction dict."""
+        if hasattr(model, 'predict_proba'):
+            probabilities = model.predict_proba(features)[0]
+            pred = 1 if probabilities[1] > threshold else 0
+            confidence = probabilities[pred] * 100
+            phishing_prob = probabilities[1] * 100
+            legitimate_prob = probabilities[0] * 100
+        else:
+            pred = model.predict(features)[0]
+            confidence = 100.0
+            phishing_prob = 100.0 if pred == 1 else 0.0
+            legitimate_prob = 0.0 if pred == 1 else 100.0
+        return {
+            'model_name': model_name,
+            'prediction': 'PHISHING' if pred == 1 else 'LEGITIMATE',
+            'confidence': confidence,
+            'phishing_probability': phishing_prob,
+            'legitimate_probability': legitimate_prob,
+        }
+    @staticmethod
+    def _whitelisted_prediction(model_name: str) -> dict:
+        """Return a pre-built LEGITIMATE prediction for whitelisted domains."""
+        return {
+            'model_name': model_name,
+            'prediction': 'LEGITIMATE',
+            'confidence': 99.99,
+            'phishing_probability': 0.01,
+            'legitimate_probability': 99.99,
+        }
+    # ── URL prediction ────────────────────────────────────────────
+    def predict_url(self, url: str) -> dict:
+        """Predict if a URL is phishing using all URL models."""
+        parsed = urlparse(url)
+        domain = parsed.netloc.lower().replace('www.', '')
+        is_whitelisted = any(domain.endswith(d) for d in self.TRUSTED_DOMAINS)
+        # Extract features
+        features_dict = self.url_extractor.extract_features(url)
+        features_df = pd.DataFrame([features_dict]).drop(columns=['label'], errors='ignore')
+        # Get predictions from each URL model
+        predictions = []
+        for model_name, model in self.url_models.items():
+            if is_whitelisted:
+                predictions.append(self._whitelisted_prediction(model_name))
+                continue
+            aligned = self._align_features(features_df, model_name)
+            if model_name == 'Logistic Regression' and self.scaler:
+                aligned = self.scaler.transform(aligned)
+            predictions.append(
+                self._build_prediction(model_name, model, aligned, self.DEFAULT_THRESHOLD)
+            )
+        is_phishing, consensus = self._calculate_consensus(predictions)
+        return {
+            'url': url,
+            'is_phishing': is_phishing,
+            'consensus': consensus,
+            'predictions': predictions,
+            'features': features_dict,
+        }
+    # ── HTML prediction ───────────────────────────────────────────
+    def predict_html(self, html_content: str, source: str = "") -> dict:
+        """Predict if HTML content is phishing using all HTML models."""
+        features = self.html_extractor.extract_features(html_content)
+        engineered_df = engineer_features(pd.DataFrame([features]))
+        predictions = []
+        for model_name, model_data in self.html_models.items():
+            model = model_data['model']
+            feature_names = model_data['features']
+            if feature_names:
+                feature_list = list(feature_names)
+                feature_values = [
+                    engineered_df[f].iloc[0] if f in engineered_df.columns else features.get(f, 0)
+                    for f in feature_list
+                ]
+                X = np.array([feature_values])
+            else:
+                X = engineered_df.values
+            predictions.append(self._build_prediction(model_name, model, X))
+        is_phishing, consensus = self._calculate_consensus(predictions)
+        return {
+            'source': source or 'HTML Content',
+            'is_phishing': is_phishing,
+            'consensus': consensus,
+            'predictions': predictions,
+            'features': features,
+        }
+    # ── Full scan (URL + HTML) ─────────────────────────────────────
+    def predict_from_url(self, url: str) -> dict:
+        """Download HTML from URL and analyse both URL and HTML."""
+        url_result = self.predict_url(url)
+        try:
+            resp = requests.get(url, timeout=10, verify=False, headers=self.HTML_DOWNLOAD_HEADERS)
+            html_result = self.predict_html(resp.text, source=url)
+            all_predictions = url_result['predictions'] + html_result['predictions']
+            is_phishing, consensus = self._calculate_consensus(all_predictions)
+            return {
+                'url': url,
+                'is_phishing': is_phishing,
+                'url_analysis': url_result,
+                'html_analysis': html_result,
+                'combined_consensus': consensus,
+            }
+        except Exception as e:
+            logger.warning(f"Could not download HTML: {e}")
+            return {
+                'url': url,
+                'is_phishing': url_result['is_phishing'],
+                'url_analysis': url_result,
+                'html_analysis': None,
+                'error': str(e),
+            }
+    # ── CNN prediction ─────────────────────────────────────────────
+    def predict_cnn(self, url: str, html_content: str | None = None) -> dict:
+        """Predict using both character-level CNN models (URL + HTML)."""
+        parsed = urlparse(url)
+        domain = parsed.netloc.lower().replace('www.', '')
+        is_whitelisted = any(domain.endswith(d) for d in self.TRUSTED_DOMAINS)
+        predictions = []
+        # CNN URL model
+        if self.cnn_url_model is not None and self.cnn_url_vocab is not None:
+            if is_whitelisted:
+                predictions.append(self._whitelisted_prediction('CNN URL (Char-level)'))
+            else:
+                X = self._encode_for_cnn(url, self.cnn_url_vocab)
+                phishing_prob = float(self.cnn_url_model.predict(X, verbose=0)[0][0])
+                legitimate_prob = 1.0 - phishing_prob
+                is_phishing_pred = phishing_prob >= self.DEFAULT_THRESHOLD
+                confidence = (phishing_prob if is_phishing_pred else legitimate_prob) * 100
+                predictions.append({
+                    'model_name': 'CNN URL (Char-level)',
+                    'prediction': 'PHISHING' if is_phishing_pred else 'LEGITIMATE',
+                    'confidence': confidence,
+                    'phishing_probability': phishing_prob * 100,
+                    'legitimate_probability': legitimate_prob * 100,
+                })
+        # CNN HTML model
+        if self.cnn_html_model is not None and self.cnn_html_vocab is not None and html_content:
+            if is_whitelisted:
+                predictions.append(self._whitelisted_prediction('CNN HTML (Char-level)'))
+            else:
+                X = self._encode_for_cnn(html_content, self.cnn_html_vocab)
+                phishing_prob = float(self.cnn_html_model.predict(X, verbose=0)[0][0])
+                legitimate_prob = 1.0 - phishing_prob
+                is_phishing_pred = phishing_prob >= self.DEFAULT_THRESHOLD
+                confidence = (phishing_prob if is_phishing_pred else legitimate_prob) * 100
+                predictions.append({
+                    'model_name': 'CNN HTML (Char-level)',
+                    'prediction': 'PHISHING' if is_phishing_pred else 'LEGITIMATE',
+                    'confidence': confidence,
+                    'phishing_probability': phishing_prob * 100,
+                    'legitimate_probability': legitimate_prob * 100,
+                })
+        if not predictions:
+            raise RuntimeError("No CNN models are loaded")
+        is_phishing, consensus = self._calculate_consensus(predictions)
+        return {
+            'url': url,
+            'is_phishing': is_phishing,
+            'consensus': consensus,
+            'predictions': predictions,
+            'features': {},
+        }
+    # ── Combined prediction ────────────────────────────────────────
+    def predict_combined(self, url: str) -> dict:
+        """Predict using combined URL+HTML models (single ensemble)."""
+        if not self.combined_models:
+            raise RuntimeError("No combined models loaded")
+        parsed = urlparse(url)
+        domain = parsed.netloc.lower().replace('www.', '')
+        is_whitelisted = any(domain.endswith(d) for d in self.TRUSTED_DOMAINS)
+        # Extract URL features
+        url_features = self.url_extractor.extract_features(url)
+        url_df = pd.DataFrame([url_features]).drop(columns=['label'], errors='ignore')
+        url_df = url_df.rename(columns={c: f'url_{c}' for c in url_df.columns})
+        # Download + extract HTML features
+        html_features = {}
+        html_error = None
+        eng_df = pd.DataFrame()
+        try:
+            resp = requests.get(url, timeout=10, verify=False, headers=self.HTML_DOWNLOAD_HEADERS)
+            html_features = self.html_extractor.extract_features(resp.text)
+            raw_df = pd.DataFrame([html_features])
+            eng_df = engineer_features(raw_df)
+            eng_df = eng_df.rename(columns={c: f'html_{c}' for c in eng_df.columns})
+        except Exception as e:
+            html_error = str(e)
+            logger.warning(f"Combined: could not download HTML: {e}")
+        # Combine features
+        combined_df = pd.concat([url_df, eng_df], axis=1)
+        # Predict
+        predictions = []
+        for model_name, model_data in self.combined_models.items():
+            if is_whitelisted:
+                predictions.append(self._whitelisted_prediction(model_name))
+                continue
+            model = model_data['model']
+            expected = model_data['features']
+            if expected:
+                feature_list = list(expected)
+                aligned = pd.DataFrame(columns=feature_list)
+                for f in feature_list:
+                    aligned[f] = combined_df[f].values if f in combined_df.columns else 0
+                X = aligned.values
+            else:
+                X = combined_df.values
+            predictions.append(
+                self._build_prediction(model_name, model, X, self.DEFAULT_THRESHOLD)
+            )
+        is_phishing, consensus = self._calculate_consensus(predictions)
+        return {
+            'url': url,
+            'is_phishing': is_phishing,
+            'consensus': consensus,
+            'predictions': predictions,
+            'url_features': url_features,
+            'html_features': html_features,
+            'html_error': html_error,
+        }
+    # ── Unified all-models prediction ──────────────────────────────
+    def predict_all(self, url: str) -> dict:
+        """Run ALL models on a URL and return categorised results."""
+        parsed = urlparse(url)
+        domain = parsed.netloc.lower().replace('www.', '')
+        is_whitelisted = any(domain.endswith(d) for d in self.TRUSTED_DOMAINS)
+        # ── 1. URL feature-based models ───────────────────────────
+        url_result = self.predict_url(url)
+        # ── 2. Download HTML (shared across HTML/combined/CNN-HTML) ─
+        html_content = None
+        html_error = None
+        try:
+            resp = requests.get(url, timeout=10, verify=False, headers=self.HTML_DOWNLOAD_HEADERS)
+            html_content = resp.text
+        except Exception as e:
+            html_error = str(e)
+            logger.warning(f"predict_all: could not download HTML: {e}")
+        # ── 3. HTML feature-based models ─────────────────────────
+        html_result = None
+        if html_content and self.html_models:
+            html_result = self.predict_html(html_content, source=url)
+        # ── 4. Combined URL+HTML feature-based models ────────────
+        combined_result = None
+        if self.combined_models:
+            try:
+                combined_result = self._predict_combined_with_html(url, html_content, is_whitelisted)
+            except Exception as e:
+                logger.warning(f"predict_all: combined prediction failed: {e}")
+        # ── 5. CNN models (URL + HTML) ───────────────────────────
+        cnn_result = None
+        if self.cnn_url_model is not None or self.cnn_html_model is not None:
+            try:
+                cnn_result = self.predict_cnn(url, html_content)
+            except Exception as e:
+                logger.warning(f"predict_all: CNN prediction failed: {e}")
+        # ── Aggregate consensus ──────────────────────────────────
+        all_predictions = []
+        if url_result:
+            all_predictions.extend(url_result.get('predictions', []))
+        if html_result:
+            all_predictions.extend(html_result.get('predictions', []))
+        if combined_result:
+            all_predictions.extend(combined_result.get('predictions', []))
+        if cnn_result:
+            all_predictions.extend(cnn_result.get('predictions', []))
+        is_phishing, consensus = self._calculate_consensus(all_predictions) if all_predictions else (False, "No models available")
+        return {
+            'url': url,
+            'is_phishing': is_phishing,
+            'overall_consensus': consensus,
+            'url_models': url_result,
+            'html_models': html_result,
+            'combined_models': combined_result,
+            'cnn_models': cnn_result,
+            'html_error': html_error,
+        }
+    def _predict_combined_with_html(self, url: str, html_content: str | None, is_whitelisted: bool) -> dict:
+        """Predict using combined models, optionally with pre-fetched HTML."""
+        # Extract URL features
+        url_features = self.url_extractor.extract_features(url)
+        url_df = pd.DataFrame([url_features]).drop(columns=['label'], errors='ignore')
+        url_df = url_df.rename(columns={c: f'url_{c}' for c in url_df.columns})
+        # HTML features
+        html_features = {}
+        html_error = None
+        eng_df = pd.DataFrame()
+        if html_content:
+            try:
+                html_features = self.html_extractor.extract_features(html_content)
+                raw_df = pd.DataFrame([html_features])
+                eng_df = engineer_features(raw_df)
+                eng_df = eng_df.rename(columns={c: f'html_{c}' for c in eng_df.columns})
+            except Exception as e:
+                html_error = str(e)
+        # Combine
+        combined_df = pd.concat([url_df, eng_df], axis=1)
+        # Predict
+        predictions = []
+        for model_name, model_data in self.combined_models.items():
+            if is_whitelisted:
+                predictions.append(self._whitelisted_prediction(model_name))
+                continue
+            model = model_data['model']
+            expected = model_data['features']
+            if expected:
+                feature_list = list(expected)
+                aligned = pd.DataFrame(columns=feature_list)
+                for f in feature_list:
+                    aligned[f] = combined_df[f].values if f in combined_df.columns else 0
+                X = aligned.values
+            else:
+                X = combined_df.values
+            predictions.append(
+                self._build_prediction(model_name, model, X, self.DEFAULT_THRESHOLD)
+            )
+        is_phishing, consensus_text = self._calculate_consensus(predictions)
+        return {
+            'url': url,
+            'is_phishing': is_phishing,
+            'consensus': consensus_text,
+            'predictions': predictions,
+            'url_features': url_features,
+            'html_features': html_features,
+            'html_error': html_error,
+        }
+# Initialize service (singleton)
+detector = PhishingDetectorService()
+# ── Helpers ───────────────────────────────────────────────────────
+def _serve_static_html(filename: str, cache: bool = False) -> HTMLResponse:
+    """Return an HTMLResponse for a file inside static/, or 404."""
+    path = Path(__file__).parent / 'static' / filename
+    if not path.exists():
+        return HTMLResponse(content="<h1>Page not found</h1>", status_code=404)
+    headers = {"Cache-Control": "public, max-age=86400"} if cache else None
+    return HTMLResponse(content=path.read_text(encoding='utf-8'), headers=headers)
+# ── API Endpoints ─────────────────────────────────────────────────
+@app.get("/", response_class=HTMLResponse)
+async def root():
+    """Serve the main web interface."""
+    return _serve_static_html('index.html')
+@app.get("/models", response_class=HTMLResponse)
+async def models_page():
+    """Serve the model details page."""
+    return _serve_static_html('models.html', cache=True)
+async def _safe_predict(label: str, fn, *args) -> JSONResponse:
+    """Run a prediction function with uniform error handling."""
+    try:
+        return JSONResponse(content=convert_to_json_serializable(fn(*args)))
+    except Exception as e:
+        logger.error(f"Error in {label}: {e}")
+        raise HTTPException(status_code=500, detail=str(e))
+@app.post("/api/predict/url", response_model=URLPredictionResponse)
+async def predict_url(request: URLRequest):
+    """Predict if URL is phishing."""
+    return await _safe_predict("predict_url", detector.predict_url, request.url)
+@app.post("/api/predict/html")
+async def predict_html(request: HTMLRequest):
+    """Predict if HTML content is phishing."""
+    return await _safe_predict("predict_html", detector.predict_html, request.html_content, request.url or "")
+@app.post("/api/predict/full")
+async def predict_full(request: URLRequest):
+    """Analyse URL and download HTML for complete analysis."""
+    return await _safe_predict("predict_full", detector.predict_from_url, request.url)
+@app.post("/api/predict/combined")
+async def predict_combined(request: URLRequest):
+    """Predict using combined URL+HTML model."""
+    return await _safe_predict("predict_combined", detector.predict_combined, request.url)
+@app.post("/api/predict/cnn")
+async def predict_cnn(request: URLRequest):
+    """Predict using character-level CNN models."""
+    return await _safe_predict("predict_cnn", detector.predict_cnn, request.url, None)
+@app.post("/api/predict/all")
+async def predict_all(request: URLRequest):
+    """Run ALL models on a URL — unified endpoint."""
+    return await _safe_predict("predict_all", detector.predict_all, request.url)
+@app.get("/api/health")
+async def health():
+    """Health check endpoint"""
+    return {
+        "status": "healthy",
+        "url_models": len(detector.url_models),
+        "html_models": len(detector.html_models),
+        "combined_models": len(detector.combined_models),
+        "cnn_url_model": detector.cnn_url_model is not None,
+        "cnn_html_model": detector.cnn_html_model is not None,
+    }
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

server/static/index.html ADDED Viewed

	@@ -0,0 +1,50 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>PHISHING DETECTION</title>
+    <link rel="stylesheet" href="/static/style.css">
+    <link rel="prefetch" href="/models">
+</head>
+<body>
+    <div class="container">
+        <header>
+            <div class="logo">Phishing Detection System</div>
+            <div class="tagline">Multi-Model URL + HTML Analysis</div>
+        </header>
+        <section class="input-section">
+            <label class="input-label" for="urlInput">Enter URL to analyze</label>
+            <div class="input-wrapper">
+                <input
+                    type="text"
+                    id="urlInput"
+                    placeholder="https://example.com"
+                    value="https://github.com"
+                />
+                <button class="btn" onclick="analyzeAll()">Analyze</button>
+            </div>
+            <div class="btn-group">
+                <button class="btn btn-secondary" onclick="clearResults()">Clear</button>
+            </div>
+        </section>
+        <div class="loading" id="loading">
+            <div class="loading-bar"></div>
+            <div class="loading-text">Analyzing with all models</div>
+        </div>
+        <div class="results" id="results">
+            <!-- Results injected here -->
+        </div>
+        <footer>
+            <div class="footer-text">Machine Learning Phishing Detection</div>
+            <a href="/models" class="learn-more-btn">Learn More</a>
+        </footer>
+    </div>
+    <script src="/static/script.js?v=4"></script>
+</body>
+</html>

server/static/models.html ADDED Viewed

	@@ -0,0 +1,1130 @@

+<!DOCTYPE html>
+<html lang="en">
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>Model Details — Phishing Detection</title>
+    <link rel="stylesheet" href="/static/style.css">
+</head>
+<body class="models-page">
+    <div class="container">
+        <header>
+            <div class="header-left">
+                <div class="logo"><a href="/">Phishing Detection System</a></div>
+                <div class="tagline">Model Performance Details</div>
+            </div>
+            <a href="/" class="back-link">&larr; Back</a>
+        </header>
+        <section class="page-title-section">
+            <h1 class="page-title">Model Details</h1>
+            <p class="page-description">
+                Performance metrics, feature importance, and configuration details for all 9 machine learning models
+                used in the phishing detection pipeline. Models span URL features (125), RAW HTML features (77) + engineered (23),
+                combined features (225), and character-level CNN approaches.
+            </p>
+        </section>
+        <!-- DETECTION PIPELINE -->
+        <section class="section">
+            <div class="section-title">Detection Pipeline</div>
+            <div class="pipeline">
+                <div class="pipeline-step"><span class="step-number">1</span>URL Input</div>
+                <div class="pipeline-step"><span class="step-number">2</span>Feature Extraction</div>
+                <div class="pipeline-step"><span class="step-number">3</span>3 URL Models</div>
+                <div class="pipeline-step"><span class="step-number">4</span>HTML Download</div>
+                <div class="pipeline-step"><span class="step-number">5</span>2 HTML + 2 Combined</div>
+                <div class="pipeline-step"><span class="step-number">6</span>2 CNN Models</div>
+                <div class="pipeline-step"><span class="step-number">7</span>9-Model Consensus</div>
+            </div>
+        </section>
+        <!-- URL FEATURES -->
+        <section class="section">
+            <div class="section-title collapsible-toggle" onclick="toggleFeatures(this)">
+                URL Features <span class="feature-count">125 features</span>
+                <span class="toggle-icon">+</span>
+            </div>
+            <div class="collapsible-content">
+            <div class="section-subtitle">All features extracted from the URL string. Hover over any feature to see its description.</div>
+            <div class="feature-category-label">Length &amp; Structure</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Total character count of the full URL">url_length</div>
+                <div class="feature-chip" data-tip="Character count of the domain name only">domain_length</div>
+                <div class="feature-chip" data-tip="Character count of the URL path component">path_length</div>
+                <div class="feature-chip" data-tip="Character count of the query string">query_length</div>
+                <div class="feature-chip" data-tip="URL length bucket: 0=short (<40), 1=medium, 2=long, 3=very long (>120)">url_length_category</div>
+                <div class="feature-chip" data-tip="Domain length bucket: 0=short (<10), 1=medium, 2=long, 3=very long (>30)">domain_length_category</div>
+            </div>
+            <div class="feature-category-label">Character Counts</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Number of dots (.) in the full URL">num_dots</div>
+                <div class="feature-chip" data-tip="Number of hyphens (-) in the full URL">num_hyphens</div>
+                <div class="feature-chip" data-tip="Number of underscores (_) in the full URL">num_underscores</div>
+                <div class="feature-chip" data-tip="Number of forward slashes (/) in the URL">num_slashes</div>
+                <div class="feature-chip" data-tip="Number of question marks (?) in the URL">num_question_marks</div>
+                <div class="feature-chip" data-tip="Number of ampersands (&amp;) in the URL">num_ampersands</div>
+                <div class="feature-chip" data-tip="Number of equals signs (=) in the URL">num_equals</div>
+                <div class="feature-chip" data-tip="Number of @ symbols — often used to obscure the real destination">num_at</div>
+                <div class="feature-chip" data-tip="Number of percent (%) characters indicating URL encoding">num_percent</div>
+                <div class="feature-chip" data-tip="Total digit characters in the full URL">num_digits_url</div>
+                <div class="feature-chip" data-tip="Total letter characters in the full URL">num_letters_url</div>
+                <div class="feature-chip" data-tip="Number of dots (.) in the domain only">domain_dots</div>
+                <div class="feature-chip" data-tip="Number of hyphens (-) in the domain only">domain_hyphens</div>
+                <div class="feature-chip" data-tip="Number of digit characters in the domain">domain_digits</div>
+                <div class="feature-chip" data-tip="Number of slashes (/) in the path component">path_slashes</div>
+                <div class="feature-chip" data-tip="Number of dots (.) in the path component">path_dots</div>
+                <div class="feature-chip" data-tip="Number of digit characters in the path">path_digits</div>
+            </div>
+            <div class="feature-category-label">Character Ratios</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Proportion of digit characters to total URL length">digit_ratio_url</div>
+                <div class="feature-chip" data-tip="Proportion of letter characters to total URL length">letter_ratio_url</div>
+                <div class="feature-chip" data-tip="Proportion of special (non-alphanumeric) characters in URL">special_char_ratio</div>
+                <div class="feature-chip" data-tip="Proportion of digits in the domain name">digit_ratio_domain</div>
+                <div class="feature-chip" data-tip="Proportion of symbols (hyphens, underscores, dots) in domain">symbol_ratio_domain</div>
+            </div>
+            <div class="feature-category-label">Domain Structure</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Number of subdomains (e.g. sub.example.com = 1)">num_subdomains</div>
+                <div class="feature-chip" data-tip="Total number of dot-separated domain parts">num_domain_parts</div>
+                <div class="feature-chip" data-tip="Character length of the top-level domain (e.g. com=3)">tld_length</div>
+                <div class="feature-chip" data-tip="Character length of the second-level domain">sld_length</div>
+                <div class="feature-chip" data-tip="Length of the longest dot-separated domain part">longest_domain_part</div>
+                <div class="feature-chip" data-tip="Average length of all domain parts">avg_domain_part_len</div>
+                <div class="feature-chip" data-tip="1 if any domain part exceeds 20 characters">longest_part_gt_20</div>
+                <div class="feature-chip" data-tip="1 if any domain part exceeds 30 characters">longest_part_gt_30</div>
+                <div class="feature-chip" data-tip="1 if any domain part exceeds 40 characters">longest_part_gt_40</div>
+                <div class="feature-chip" data-tip="1 if TLD is suspicious (.tk, .ml, .xyz, .top, .zip, etc.)">has_suspicious_tld</div>
+                <div class="feature-chip" data-tip="1 if TLD is well-known and trusted (.com, .org, .edu, etc.)">has_trusted_tld</div>
+                <div class="feature-chip" data-tip="1 if URL contains a port number">has_port</div>
+                <div class="feature-chip" data-tip="1 if URL uses a non-standard port (not 80 or 443)">has_non_std_port</div>
+                <div class="feature-chip" data-tip="Composite randomness score of the domain (0-1)">domain_randomness_score</div>
+                <div class="feature-chip" data-tip="Consonant clustering score of the SLD — random strings have high clusters">sld_consonant_cluster_score</div>
+                <div class="feature-chip" data-tip="1 if SLD contains keyboard walk patterns (qwerty, asdfgh)">sld_keyboard_pattern</div>
+                <div class="feature-chip" data-tip="1 if SLD contains a common English word (4+ characters)">sld_has_dictionary_word</div>
+                <div class="feature-chip" data-tip="Score based on vowel/consonant alternation — real words are more pronounceable">sld_pronounceability_score</div>
+                <div class="feature-chip" data-tip="1 if digits appear at suspicious positions in the SLD (start or end)">domain_digit_position_suspicious</div>
+            </div>
+            <div class="feature-category-label">Path Analysis</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Directory depth of the URL path (number of segments)">path_depth</div>
+                <div class="feature-chip" data-tip="Length of the longest path segment between slashes">max_path_segment_len</div>
+                <div class="feature-chip" data-tip="Average length of path segments">avg_path_segment_len</div>
+                <div class="feature-chip" data-tip="1 if the URL path has a file extension">has_extension</div>
+                <div class="feature-chip" data-tip="Category of file extension: 0=none, 1=document, 2=media, 3=executable, 4=web, 5=other">extension_category</div>
+                <div class="feature-chip" data-tip="1 if extension is suspicious (.exe, .bat, .cmd, .scr, .vbs, .ps1)">has_suspicious_extension</div>
+                <div class="feature-chip" data-tip="1 if extension is specifically .exe">has_exe</div>
+                <div class="feature-chip" data-tip="1 if path contains double slash (//) — possible redirect trick">has_double_slash</div>
+                <div class="feature-chip" data-tip="1 if a brand name appears in path but not in domain — impersonation signal">path_has_brand_not_domain</div>
+                <div class="feature-chip" data-tip="1 if path contains an IP address pattern">path_has_ip_pattern</div>
+                <div class="feature-chip" data-tip="1 if a document extension + 'download' keyword appear together">suspicious_path_extension_combo</div>
+            </div>
+            <div class="feature-category-label">Query String</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Number of query parameters">num_params</div>
+                <div class="feature-chip" data-tip="1 if URL has a query string">has_query</div>
+                <div class="feature-chip" data-tip="Total character length of all query parameter values">query_value_length</div>
+                <div class="feature-chip" data-tip="Length of the longest query parameter">max_param_len</div>
+                <div class="feature-chip" data-tip="1 if any query parameter value looks like a URL — possible redirect">query_has_url</div>
+            </div>
+            <div class="feature-category-label">Statistical &amp; Entropy</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Shannon entropy of the full URL — random/phishing URLs have higher entropy">url_entropy</div>
+                <div class="feature-chip" data-tip="Shannon entropy of the domain name">domain_entropy</div>
+                <div class="feature-chip" data-tip="Shannon entropy of the path component">path_entropy</div>
+                <div class="feature-chip" data-tip="Longest run of consecutive digit characters">max_consecutive_digits</div>
+                <div class="feature-chip" data-tip="Longest run of consecutive letter characters">max_consecutive_chars</div>
+                <div class="feature-chip" data-tip="Longest run of consecutive consonants in domain">max_consecutive_consonants</div>
+                <div class="feature-chip" data-tip="Rate of adjacent character repetitions (aa, bb, etc.)">char_repeat_rate</div>
+                <div class="feature-chip" data-tip="Ratio of unique bigrams — lower = more repetitive URL">unique_bigram_ratio</div>
+                <div class="feature-chip" data-tip="Ratio of unique trigrams — lower = more repetitive URL">unique_trigram_ratio</div>
+                <div class="feature-chip" data-tip="Ratio of unique characters to total characters in the SLD (0–1)">sld_letter_diversity</div>
+                <div class="feature-chip" data-tip="1 if domain contains both digits and letters">domain_has_numbers_letters</div>
+                <div class="feature-chip" data-tip="Composite score (0–1) combining URL length, dots, hyphens, slashes, and entropy">url_complexity_score</div>
+            </div>
+            <div class="feature-category-label">Security Indicators</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if domain is an IP address instead of a hostname">has_ip_address</div>
+                <div class="feature-chip" data-tip="1 if URL contains @ symbol — can trick browsers into treating text before @ as user info">has_at_symbol</div>
+                <div class="feature-chip" data-tip="1 if double-slash redirect pattern found in path">has_redirect</div>
+                <div class="feature-chip" data-tip="1 if domain is a known URL shortener (bit.ly, t.co, etc.)">is_shortened</div>
+                <div class="feature-chip" data-tip="1 if hosted on a free hosting service (000webhostapp, freehosting, etc.)">is_free_hosting</div>
+                <div class="feature-chip" data-tip="1 if hosted on a free platform (github.io, vercel.app, netlify.app, etc.)">is_free_platform</div>
+                <div class="feature-chip" data-tip="Length of subdomain on free platforms — long random subdomains are suspicious">platform_subdomain_length</div>
+                <div class="feature-chip" data-tip="1 if subdomain matches UUID-like pattern (common on Replit, Firebase)">has_uuid_subdomain</div>
+                <div class="feature-chip" data-tip="1 if URL uses HTTP instead of HTTPS">is_http</div>
+            </div>
+            <div class="feature-category-label">Keywords &amp; Brand Detection</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Count of phishing keywords (login, verify, secure, etc.) in URL">num_phishing_keywords</div>
+                <div class="feature-chip" data-tip="1 if phishing keywords found in the domain">phishing_in_domain</div>
+                <div class="feature-chip" data-tip="1 if phishing keywords found in the path">phishing_in_path</div>
+                <div class="feature-chip" data-tip="Number of recognized brand names in the URL">num_brands</div>
+                <div class="feature-chip" data-tip="1 if a brand name appears in the domain">brand_in_domain</div>
+                <div class="feature-chip" data-tip="1 if a brand name appears in the path">brand_in_path</div>
+                <div class="feature-chip" data-tip="Score for brand impersonation — brand in URL but not the real domain">brand_impersonation</div>
+                <div class="feature-chip" data-tip="1 if 'login' appears in URL">has_login</div>
+                <div class="feature-chip" data-tip="1 if 'account' appears in URL">has_account</div>
+                <div class="feature-chip" data-tip="1 if 'verify' appears in URL">has_verify</div>
+                <div class="feature-chip" data-tip="1 if 'secure' appears in URL">has_secure</div>
+                <div class="feature-chip" data-tip="1 if 'update' appears in URL">has_update</div>
+                <div class="feature-chip" data-tip="1 if 'bank' appears in URL">has_bank</div>
+                <div class="feature-chip" data-tip="1 if 'password' or 'passwd' appears in URL">has_password</div>
+                <div class="feature-chip" data-tip="1 if 'suspend' appears in URL">has_suspend</div>
+                <div class="feature-chip" data-tip="1 if 'webscr' appears in URL — common in PayPal phishing">has_webscr</div>
+                <div class="feature-chip" data-tip="1 if 'cmd=' or '/cmd/' appears in URL">has_cmd</div>
+                <div class="feature-chip" data-tip="1 if 'cgi-bin' or '.cgi' appears in URL">has_cgi</div>
+                <div class="feature-chip" data-tip="1 if brand name in subdomain but not main domain — spoofing pattern">brand_in_subdomain_not_domain</div>
+                <div class="feature-chip" data-tip="1 if multiple different brand names detected in URL">multiple_brands_in_url</div>
+                <div class="feature-chip" data-tip="1 if brand name combined with hyphen (e.g. paypal-login.com)">brand_with_hyphen</div>
+                <div class="feature-chip" data-tip="1 if brand found in domain with suspicious TLD">suspicious_brand_tld</div>
+                <div class="feature-chip" data-tip="1 if brand name + phishing keyword both present">brand_keyword_combo</div>
+            </div>
+            <div class="feature-category-label">Encoding &amp; Obfuscation</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if URL contains percent-encoded characters">has_url_encoding</div>
+                <div class="feature-chip" data-tip="Number of percent-encoded sequences in URL">encoding_count</div>
+                <div class="feature-chip" data-tip="Difference in length between encoded and decoded URL">encoding_diff</div>
+                <div class="feature-chip" data-tip="1 if domain contains Punycode (xn-- prefix) — internationalized domain">has_punycode</div>
+                <div class="feature-chip" data-tip="1 if URL contains non-ASCII Unicode characters">has_unicode</div>
+                <div class="feature-chip" data-tip="1 if URL contains hexadecimal string (0x...)">has_hex_string</div>
+                <div class="feature-chip" data-tip="1 if URL contains a Base64-like string (20+ alphanumeric chars with +/=)">has_base64</div>
+                <div class="feature-chip" data-tip="1 if domain contains look-alike patterns (rn, vv, cl, 0, 1) that mimic other characters">has_lookalike_chars</div>
+                <div class="feature-chip" data-tip="Score for mixed Unicode scripts in domain — homograph attack indicator">mixed_script_score</div>
+                <div class="feature-chip" data-tip="Risk score for homograph attacks targeting brand names">homograph_brand_risk</div>
+                <div class="feature-chip" data-tip="1 if IDN homograph score exceeds 0.5 threshold">suspected_idn_homograph</div>
+                <div class="feature-chip" data-tip="1 if URL contains double percent-encoding (%% or %25)">double_encoding</div>
+                <div class="feature-chip" data-tip="1 if percent-encoding found specifically in the domain">encoding_in_domain</div>
+                <div class="feature-chip" data-tip="Count of suspicious Unicode characters (RTL override, zero-width, BOM), capped at 5">suspicious_unicode_category</div>
+            </div>
+            </div>
+        </section>
+        <!-- HTML FEATURES -->
+        <section class="section">
+            <div class="section-title collapsible-toggle" onclick="toggleFeatures(this)">
+                RAW HTML Features <span class="feature-count">77 raw features + 23 engineered</span>
+                <span class="toggle-icon">+</span>
+            </div>
+            <div class="collapsible-content">
+            <div class="section-subtitle">All features extracted from HTML source and DOM structure. Hover over any feature to see its description.</div>
+            <div class="feature-category-label">Document Size &amp; Text</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Maximum nesting depth of DOM elements">dom_depth</div>
+                <div class="feature-chip" data-tip="Total character length of the raw HTML">html_length</div>
+                <div class="feature-chip" data-tip="Total length of visible extracted text">text_length</div>
+                <div class="feature-chip" data-tip="Number of words extracted from page text">num_words</div>
+                <div class="feature-chip" data-tip="Ratio of text content length to full HTML length">text_to_html_ratio</div>
+                <div class="feature-chip" data-tip="Total character length of inline CSS styles">inline_css_length</div>
+                <div class="feature-chip" data-tip="Total number of HTML tags in the document">num_tags</div>
+            </div>
+            <div class="feature-category-label">Metadata &amp; Page Identity</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if HTML title tag is present">has_title</div>
+                <div class="feature-chip" data-tip="1 if meta description tag is present">has_description</div>
+                <div class="feature-chip" data-tip="1 if meta keywords tag is present">has_keywords</div>
+                <div class="feature-chip" data-tip="1 if author metadata is present">has_author</div>
+                <div class="feature-chip" data-tip="1 if copyright text or metadata is detected">has_copyright</div>
+                <div class="feature-chip" data-tip="1 if viewport meta tag is present">has_viewport</div>
+                <div class="feature-chip" data-tip="1 if favicon link is declared">has_favicon</div>
+                <div class="feature-chip" data-tip="Number of meta tags in the page head">num_meta_tags</div>
+            </div>
+            <div class="feature-category-label">DOM Elements &amp; Layout</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Number of div elements">num_divs</div>
+                <div class="feature-chip" data-tip="Number of span elements">num_spans</div>
+                <div class="feature-chip" data-tip="Number of paragraph tags">num_paragraphs</div>
+                <div class="feature-chip" data-tip="Number of heading tags (h1-h6)">num_headings</div>
+                <div class="feature-chip" data-tip="Number of list containers (ul, ol)">num_lists</div>
+                <div class="feature-chip" data-tip="Number of table elements">num_tables</div>
+                <div class="feature-chip" data-tip="Number of image elements">num_images</div>
+                <div class="feature-chip" data-tip="Number of iframe elements">num_iframes</div>
+                <div class="feature-chip" data-tip="Number of hidden iframes">num_hidden_iframes</div>
+                <div class="feature-chip" data-tip="Number of images embedded as data URIs">num_data_uri_images</div>
+                <div class="feature-chip" data-tip="Number of linked CSS files">num_css_files</div>
+                <div class="feature-chip" data-tip="Number of script tags">num_scripts</div>
+                <div class="feature-chip" data-tip="Number of inline script blocks">num_inline_scripts</div>
+                <div class="feature-chip" data-tip="Number of inline style blocks or style attributes">num_inline_styles</div>
+                <div class="feature-chip" data-tip="Number of input fields in forms">num_input_fields</div>
+            </div>
+            <div class="feature-category-label">Link &amp; Resource Analysis</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Total number of links">num_links</div>
+                <div class="feature-chip" data-tip="Number of internal links pointing to same domain">num_internal_links</div>
+                <div class="feature-chip" data-tip="Number of external links pointing to other domains">num_external_links</div>
+                <div class="feature-chip" data-tip="Ratio of external links to all links">ratio_external_links</div>
+                <div class="feature-chip" data-tip="Count of distinct external domains referenced">num_unique_external_domains</div>
+                <div class="feature-chip" data-tip="Number of mailto links">num_mailto_links</div>
+                <div class="feature-chip" data-tip="Number of javascript: pseudo-links">num_javascript_links</div>
+                <div class="feature-chip" data-tip="Number of links pointing to IP-based hosts">num_ip_based_links</div>
+                <div class="feature-chip" data-tip="Number of links using suspicious top-level domains">num_suspicious_tld_links</div>
+                <div class="feature-chip" data-tip="Number of links with empty or placeholder href values">num_empty_links</div>
+                <div class="feature-chip" data-tip="Anchor text points to content unrelated to destination">num_anchor_text_mismatch</div>
+                <div class="feature-chip" data-tip="Number of external stylesheet references">num_external_css</div>
+                <div class="feature-chip" data-tip="Number of externally loaded images">num_external_images</div>
+                <div class="feature-chip" data-tip="Number of externally loaded JavaScript files">num_external_scripts</div>
+            </div>
+            <div class="feature-category-label">Forms &amp; Inputs</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if at least one form tag exists">has_form</div>
+                <div class="feature-chip" data-tip="1 if a login-style form is detected">has_login_form</div>
+                <div class="feature-chip" data-tip="Number of form elements">num_forms</div>
+                <div class="feature-chip" data-tip="Number of email-type fields">num_email_fields</div>
+                <div class="feature-chip" data-tip="Number of password fields">num_password_fields</div>
+                <div class="feature-chip" data-tip="Number of text input fields">num_text_fields</div>
+                <div class="feature-chip" data-tip="Number of submit buttons">num_submit_buttons</div>
+                <div class="feature-chip" data-tip="Number of hidden input fields">num_hidden_fields</div>
+                <div class="feature-chip" data-tip="Number of forms missing associated labels">num_forms_without_labels</div>
+                <div class="feature-chip" data-tip="Number of forms with empty action attribute">num_empty_form_actions</div>
+                <div class="feature-chip" data-tip="Number of forms submitting to external domains">num_external_form_actions</div>
+                <div class="feature-chip" data-tip="1 if password form submits to external domain">password_with_external_action</div>
+            </div>
+            <div class="feature-category-label">Scripts &amp; Dynamic Behavior</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if JavaScript eval() is used">has_eval</div>
+                <div class="feature-chip" data-tip="1 if JavaScript escape() is used">has_escape</div>
+                <div class="feature-chip" data-tip="1 if JavaScript unescape() is used">has_unescape</div>
+                <div class="feature-chip" data-tip="1 if atob() decoding function is present">has_atob</div>
+                <div class="feature-chip" data-tip="1 if Base64-like content or decoding usage is detected">has_base64</div>
+                <div class="feature-chip" data-tip="1 if String.fromCharCode usage is present">has_fromcharcode</div>
+                <div class="feature-chip" data-tip="1 if document.write() is used">has_document_write</div>
+                <div class="feature-chip" data-tip="1 if window.open() is used">has_window_open</div>
+                <div class="feature-chip" data-tip="1 if location.replace() redirects are used">has_location_replace</div>
+                <div class="feature-chip" data-tip="1 if meta refresh redirect is present">has_meta_refresh</div>
+                <div class="feature-chip" data-tip="Number of onclick event handlers">num_onclick_events</div>
+                <div class="feature-chip" data-tip="Number of onload event handlers">num_onload_events</div>
+                <div class="feature-chip" data-tip="Number of onerror event handlers">num_onerror_events</div>
+            </div>
+            <div class="feature-category-label">Visibility &amp; Interaction Tricks</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if display:none usage is detected">has_display_none</div>
+                <div class="feature-chip" data-tip="1 if visibility:hidden usage is detected">has_visibility_hidden</div>
+                <div class="feature-chip" data-tip="1 if right-click is disabled by script">has_right_click_disabled</div>
+                <div class="feature-chip" data-tip="1 if status bar text customization is attempted">has_status_bar_customization</div>
+            </div>
+            <div class="feature-category-label">Contact &amp; Social Engineering Signals</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if raw email address patterns appear in HTML">has_email_address</div>
+                <div class="feature-chip" data-tip="1 if phone number patterns appear in HTML">has_phone_number</div>
+                <div class="feature-chip" data-tip="Number of known brand name mentions in HTML text">num_brand_mentions</div>
+                <div class="feature-chip" data-tip="Number of urgency words (urgent, verify, immediately, etc.)">num_urgency_keywords</div>
+            </div>
+            <div class="feature-category-label">Ratios &amp; Proportions</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Number of forms divided by number of input fields — phishing sites often have few forms with many inputs">forms_to_inputs_ratio</div>
+                <div class="feature-chip" data-tip="Proportion of external links to total links — high ratio indicates off-site redirection">external_to_total_links</div>
+                <div class="feature-chip" data-tip="Number of scripts divided by total HTML tags — high ratio suggests heavy JavaScript reliance">scripts_to_tags_ratio</div>
+                <div class="feature-chip" data-tip="Ratio of hidden to visible input fields — hidden fields often used for tracking or obfuscation">hidden_to_visible_inputs</div>
+                <div class="feature-chip" data-tip="Number of password fields divided by total input fields — phishing forms maximize password collection">password_to_inputs_ratio</div>
+                <div class="feature-chip" data-tip="Proportion of empty/placeholder links to all links — broken or disguised navigation is suspicious">empty_to_total_links</div>
+                <div class="feature-chip" data-tip="Number of images divided by total HTML tags — low image ratio suggests text-heavy phishing pages">images_to_tags_ratio</div>
+                <div class="feature-chip" data-tip="Number of iframes divided by total HTML tags — iframes can hide malicious content or redirects">iframes_to_tags_ratio</div>
+            </div>
+            <div class="feature-category-label">Co-occurrence Interactions</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if forms exist AND password fields are present — core phishing indicator">forms_with_passwords</div>
+                <div class="feature-chip" data-tip="Count of external JavaScript files — malicious JS often hosted externally">external_scripts_links</div>
+                <div class="feature-chip" data-tip="1 if urgency keywords AND forms both present — common social engineering pattern">urgency_with_forms</div>
+                <div class="feature-chip" data-tip="1 if brand names AND forms both present — brand impersonation with credential harvesting">brand_with_forms</div>
+                <div class="feature-chip" data-tip="1 if iframes AND script tags both present — hidden iframes + JS often for malware">iframes_with_scripts</div>
+                <div class="feature-chip" data-tip="1 if hidden inputs AND external resources both present — obfuscated tracking/redirects">hidden_with_external</div>
+            </div>
+            <div class="feature-category-label">Content Density</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Ratio of visible text length to DOM depth — legitimate sites have better content distribution">content_density</div>
+                <div class="feature-chip" data-tip="Number of form elements divided by total words — form-heavy pages are suspicious">form_density</div>
+                <div class="feature-chip" data-tip="Number of script tags divided by number of forms — scripts per form indicates obfuscation level">scripts_per_form</div>
+                <div class="feature-chip" data-tip="Number of links divided by total words — link density compared to content volume">links_per_word</div>
+            </div>
+            <div class="feature-category-label">Composite Risk Scores</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="Composite score (0–1) combining form presence, password fields, external links, and scripts">phishing_risk_score</div>
+                <div class="feature-chip" data-tip="Score (0–1) specifically for form-based threats: password fields, hidden inputs, external actions">form_risk_score</div>
+                <div class="feature-chip" data-tip="Score (0–1) measuring obfuscation techniques: hidden fields, eval usage, encoding, meta refreshes">obfuscation_score</div>
+                <div class="feature-chip" data-tip="Score (0–1) measuring legitimacy signals: metadata presence, proper structure, internal links">legitimacy_score</div>
+            </div>
+            <div class="feature-category-label">Boolean Flags</div>
+            <div class="feature-grid">
+                <div class="feature-chip" data-tip="1 if any combination of suspicious elements detected (hidden inputs + eval, eval + meta refresh, etc.)">has_suspicious_elements</div>
+            </div>
+            </div>
+        </section>
+        <!-- TABS: MODEL CATEGORIES -->
+        <section class="section" style="border-bottom:none;padding-bottom:0">
+            <div class="tabs">
+                <button class="tab active" onclick="switchTab(event,'urlModels')">URL Models <span class="tab-count">3</span></button>
+                <button class="tab" onclick="switchTab(event,'htmlModels')">HTML Models <span class="tab-count">2</span></button>
+                <button class="tab" onclick="switchTab(event,'combinedModels')">Combined <span class="tab-count">2</span></button>
+                <button class="tab" onclick="switchTab(event,'cnnModels')">CNN <span class="tab-count">2</span></button>
+                <button class="tab" onclick="switchTab(event,'overview')">Comparison</button>
+            </div>
+            <!-- URL MODELS TAB -->
+            <div id="urlModels" class="tab-content active">
+                <div class="section-subtitle">
+                    3 models trained on 120 URL-based features extracted from the URL string structure,
+                    domain properties, encoding analysis, and brand impersonation detection.
+                </div>
+                <!-- Logistic Regression -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">Logistic Regression</div>
+                        <div class="model-detail-type">Baseline</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">93.71%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">95.40%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">91.84%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">93.59%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9789</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">10,326</div>
+                        <div class="cm-cell cm-fp">478</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">881</div>
+                        <div class="cm-cell cm-tp">9,922</div>
+                    </div>
+                </div>
+                <!-- Random Forest -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">Random Forest</div>
+                        <div class="model-detail-type">Ensemble</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">97.63%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">99.01%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">96.22%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">97.60%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9958</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">10,700</div>
+                        <div class="cm-cell cm-fp">104</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">408</div>
+                        <div class="cm-cell cm-tp">10,395</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Top 20 Features by Importance</div>
+                        <div class="features-list">
+                            <div class="feature-row"><span class="feature-rank">1</span><span class="feature-name">domain_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:100%"></div></div><span class="feature-importance">0.0500</span></div>
+                            <div class="feature-row"><span class="feature-rank">2</span><span class="feature-name">num_domain_parts</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:94%"></div></div><span class="feature-importance">0.0471</span></div>
+                            <div class="feature-row"><span class="feature-rank">3</span><span class="feature-name">domain_dots</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:91%"></div></div><span class="feature-importance">0.0453</span></div>
+                            <div class="feature-row"><span class="feature-rank">4</span><span class="feature-name">num_subdomains</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:79%"></div></div><span class="feature-importance">0.0393</span></div>
+                            <div class="feature-row"><span class="feature-rank">5</span><span class="feature-name">num_dots</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:67%"></div></div><span class="feature-importance">0.0337</span></div>
+                            <div class="feature-row"><span class="feature-rank">6</span><span class="feature-name">domain_length_category</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:67%"></div></div><span class="feature-importance">0.0335</span></div>
+                            <div class="feature-row"><span class="feature-rank">7</span><span class="feature-name">symbol_ratio_domain</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:65%"></div></div><span class="feature-importance">0.0324</span></div>
+                            <div class="feature-row"><span class="feature-rank">8</span><span class="feature-name">digit_ratio_url</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:60%"></div></div><span class="feature-importance">0.0298</span></div>
+                            <div class="feature-row"><span class="feature-rank">9</span><span class="feature-name">path_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:59%"></div></div><span class="feature-importance">0.0297</span></div>
+                            <div class="feature-row"><span class="feature-rank">10</span><span class="feature-name">avg_domain_part_len</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:58%"></div></div><span class="feature-importance">0.0292</span></div>
+                            <div class="feature-row"><span class="feature-rank">11</span><span class="feature-name">num_digits_url</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:57%"></div></div><span class="feature-importance">0.0283</span></div>
+                            <div class="feature-row"><span class="feature-rank">12</span><span class="feature-name">domain_entropy</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:56%"></div></div><span class="feature-importance">0.0282</span></div>
+                            <div class="feature-row"><span class="feature-rank">13</span><span class="feature-name">url_entropy</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:53%"></div></div><span class="feature-importance">0.0266</span></div>
+                            <div class="feature-row"><span class="feature-rank">14</span><span class="feature-name">max_consecutive_digits</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:49%"></div></div><span class="feature-importance">0.0246</span></div>
+                            <div class="feature-row"><span class="feature-rank">15</span><span class="feature-name">special_char_ratio</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:48%"></div></div><span class="feature-importance">0.0242</span></div>
+                            <div class="feature-row"><span class="feature-rank">16</span><span class="feature-name">is_shortened</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:47%"></div></div><span class="feature-importance">0.0237</span></div>
+                            <div class="feature-row"><span class="feature-rank">17</span><span class="feature-name">path_entropy</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:45%"></div></div><span class="feature-importance">0.0225</span></div>
+                            <div class="feature-row"><span class="feature-rank">18</span><span class="feature-name">max_path_segment_len</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:43%"></div></div><span class="feature-importance">0.0215</span></div>
+                            <div class="feature-row"><span class="feature-rank">19</span><span class="feature-name">num_letters_url</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:41%"></div></div><span class="feature-importance">0.0206</span></div>
+                            <div class="feature-row"><span class="feature-rank">20</span><span class="feature-name">path_slashes</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:40%"></div></div><span class="feature-importance">0.0201</span></div>
+                        </div>
+                    </div>
+                </div>
+                <!-- XGBoost -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">XGBoost</div>
+                        <div class="model-detail-type">Gradient Boosting</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">97.85%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">99.00%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">96.68%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">97.82%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9953</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">10,698</div>
+                        <div class="cm-cell cm-fp">106</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">359</div>
+                        <div class="cm-cell cm-tp">10,444</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Top 20 Features by Importance</div>
+                        <div class="features-list">
+                            <div class="feature-row"><span class="feature-rank">1</span><span class="feature-name">domain_dots</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:100%"></div></div><span class="feature-importance">0.2514</span></div>
+                            <div class="feature-row"><span class="feature-rank">2</span><span class="feature-name">is_shortened</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:60%"></div></div><span class="feature-importance">0.1519</span></div>
+                            <div class="feature-row"><span class="feature-rank">3</span><span class="feature-name">num_subdomains</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:57%"></div></div><span class="feature-importance">0.1423</span></div>
+                            <div class="feature-row"><span class="feature-rank">4</span><span class="feature-name">num_domain_parts</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:20%"></div></div><span class="feature-importance">0.0492</span></div>
+                            <div class="feature-row"><span class="feature-rank">5</span><span class="feature-name">multiple_brands_in_url</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:14%"></div></div><span class="feature-importance">0.0350</span></div>
+                            <div class="feature-row"><span class="feature-rank">6</span><span class="feature-name">is_free_platform</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:11%"></div></div><span class="feature-importance">0.0281</span></div>
+                            <div class="feature-row"><span class="feature-rank">7</span><span class="feature-name">domain_hyphens</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:10%"></div></div><span class="feature-importance">0.0252</span></div>
+                            <div class="feature-row"><span class="feature-rank">8</span><span class="feature-name">path_digits</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:6%"></div></div><span class="feature-importance">0.0149</span></div>
+                            <div class="feature-row"><span class="feature-rank">9</span><span class="feature-name">is_http</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:6%"></div></div><span class="feature-importance">0.0139</span></div>
+                            <div class="feature-row"><span class="feature-rank">10</span><span class="feature-name">platform_subdomain_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:5%"></div></div><span class="feature-importance">0.0123</span></div>
+                            <div class="feature-row"><span class="feature-rank">11</span><span class="feature-name">avg_domain_part_len</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:5%"></div></div><span class="feature-importance">0.0121</span></div>
+                            <div class="feature-row"><span class="feature-rank">12</span><span class="feature-name">path_slashes</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:5%"></div></div><span class="feature-importance">0.0119</span></div>
+                            <div class="feature-row"><span class="feature-rank">13</span><span class="feature-name">brand_in_path</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:4%"></div></div><span class="feature-importance">0.0111</span></div>
+                            <div class="feature-row"><span class="feature-rank">14</span><span class="feature-name">domain_length_category</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:4%"></div></div><span class="feature-importance">0.0108</span></div>
+                            <div class="feature-row"><span class="feature-rank">15</span><span class="feature-name">domain_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:4%"></div></div><span class="feature-importance">0.0106</span></div>
+                            <div class="feature-row"><span class="feature-rank">16</span><span class="feature-name">symbol_ratio_domain</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:4%"></div></div><span class="feature-importance">0.0100</span></div>
+                            <div class="feature-row"><span class="feature-rank">17</span><span class="feature-name">encoding_diff</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0087</span></div>
+                            <div class="feature-row"><span class="feature-rank">18</span><span class="feature-name">num_brands</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0077</span></div>
+                            <div class="feature-row"><span class="feature-rank">19</span><span class="feature-name">tld_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0071</span></div>
+                            <div class="feature-row"><span class="feature-rank">20</span><span class="feature-name">digit_ratio_url</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0067</span></div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+            <!-- HTML MODELS TAB -->
+            <div id="htmlModels" class="tab-content">
+                <div class="section-subtitle">
+                    2 models trained on 100 HTML-based features extracted from the page structure,
+                    forms, scripts, links, and content analysis of downloaded web pages.
+                </div>
+                <!-- Random Forest HTML -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">Random Forest HTML</div>
+                        <div class="model-detail-type">Ensemble</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">89.65%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">91.78%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">87.11%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">89.38%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9617</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">89.05%</div>
+                            <div class="metric-label">CV F1 (5-fold)</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">15,012</div>
+                        <div class="cm-cell cm-fp">1,271</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">2,099</div>
+                        <div class="cm-cell cm-tp">14,184</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Hyperparameters</div>
+                        <div class="params-grid">
+                            <div class="param-item"><span class="param-key">n_estimators</span><span class="param-value">500</span></div>
+                            <div class="param-item"><span class="param-key">max_depth</span><span class="param-value">35</span></div>
+                            <div class="param-item"><span class="param-key">min_samples_split</span><span class="param-value">2</span></div>
+                            <div class="param-item"><span class="param-key">min_samples_leaf</span><span class="param-value">1</span></div>
+                            <div class="param-item"><span class="param-key">max_features</span><span class="param-value">sqrt</span></div>
+                            <div class="param-item"><span class="param-key">class_weight</span><span class="param-value">balanced</span></div>
+                        </div>
+                    </div>
+                </div>
+                <!-- XGBoost HTML -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">XGBoost HTML</div>
+                        <div class="model-detail-type">Gradient Boosting</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">89.07%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">90.27%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">87.56%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">88.90%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9590</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">88.87%</div>
+                            <div class="metric-label">CV F1 (5-fold)</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">14,747</div>
+                        <div class="cm-cell cm-fp">1,536</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">2,025</div>
+                        <div class="cm-cell cm-tp">14,258</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Hyperparameters</div>
+                        <div class="params-grid">
+                            <div class="param-item"><span class="param-key">n_estimators</span><span class="param-value">600</span></div>
+                            <div class="param-item"><span class="param-key">max_depth</span><span class="param-value">8</span></div>
+                            <div class="param-item"><span class="param-key">learning_rate</span><span class="param-value">0.05</span></div>
+                            <div class="param-item"><span class="param-key">subsample</span><span class="param-value">0.8</span></div>
+                            <div class="param-item"><span class="param-key">colsample_bytree</span><span class="param-value">0.8</span></div>
+                            <div class="param-item"><span class="param-key">min_child_weight</span><span class="param-value">3</span></div>
+                            <div class="param-item"><span class="param-key">gamma</span><span class="param-value">0.1</span></div>
+                            <div class="param-item"><span class="param-key">reg_alpha</span><span class="param-value">0.1</span></div>
+                            <div class="param-item"><span class="param-key">reg_lambda</span><span class="param-value">1.0</span></div>
+                            <div class="param-item"><span class="param-key">early_stopping</span><span class="param-value">50 rounds</span></div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+            <!-- COMBINED MODELS TAB -->
+            <div id="combinedModels" class="tab-content">
+                <div class="section-subtitle">
+                    2 models trained on 221 combined features (121 URL + 100 HTML) for maximum detection accuracy.
+                </div>
+                <!-- Random Forest Combined -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">Random Forest Combined</div>
+                        <div class="model-detail-type">Ensemble</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">98.60%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">99.16%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.02%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.59%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9990</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.59%</div>
+                            <div class="metric-label">CV F1 (5-fold)</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">10,680</div>
+                        <div class="cm-cell cm-fp">89</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">213</div>
+                        <div class="cm-cell cm-tp">10,556</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Feature Importance Split</div>
+                        <div class="feature-split">
+                            <div class="feature-split-bar">
+                                <div class="split-url" style="width:29.1%">URL 29.1%</div>
+                                <div class="split-html" style="width:70.9%">HTML 70.9%</div>
+                            </div>
+                        </div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Top 15 Features by Importance</div>
+                        <div class="features-list">
+                            <div class="feature-row"><span class="feature-rank">1</span><span class="feature-name">html_num_links</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:100%"></div></div><span class="feature-importance">0.0640</span></div>
+                            <div class="feature-row"><span class="feature-rank">2</span><span class="feature-name">html_text_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:90%"></div></div><span class="feature-importance">0.0577</span></div>
+                            <div class="feature-row"><span class="feature-rank">3</span><span class="feature-name">html_num_tags</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:75%"></div></div><span class="feature-importance">0.0479</span></div>
+                            <div class="feature-row"><span class="feature-rank">4</span><span class="feature-name">html_num_internal_links</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:72%"></div></div><span class="feature-importance">0.0463</span></div>
+                            <div class="feature-row"><span class="feature-rank">5</span><span class="feature-name">html_num_words</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:66%"></div></div><span class="feature-importance">0.0422</span></div>
+                            <div class="feature-row"><span class="feature-rank">6</span><span class="feature-name">html_external_scripts_links</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:56%"></div></div><span class="feature-importance">0.0361</span></div>
+                            <div class="feature-row"><span class="feature-rank">7</span><span class="feature-name">html_num_divs</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:46%"></div></div><span class="feature-importance">0.0297</span></div>
+                            <div class="feature-row"><span class="feature-rank">8</span><span class="feature-name">html_num_lists</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:45%"></div></div><span class="feature-importance">0.0291</span></div>
+                            <div class="feature-row"><span class="feature-rank">9</span><span class="feature-name">html_num_external_links</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:43%"></div></div><span class="feature-importance">0.0276</span></div>
+                            <div class="feature-row"><span class="feature-rank">10</span><span class="feature-name">html_has_description</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:40%"></div></div><span class="feature-importance">0.0258</span></div>
+                            <div class="feature-row"><span class="feature-rank">11</span><span class="feature-name">html_num_unique_external_domains</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:37%"></div></div><span class="feature-importance">0.0236</span></div>
+                            <div class="feature-row"><span class="feature-rank">12</span><span class="feature-name">html_num_images</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:36%"></div></div><span class="feature-importance">0.0231</span></div>
+                            <div class="feature-row"><span class="feature-rank">13</span><span class="feature-name">html_num_spans</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:35%"></div></div><span class="feature-importance">0.0226</span></div>
+                            <div class="feature-row"><span class="feature-rank">14</span><span class="feature-name">html_num_headings</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:34%"></div></div><span class="feature-importance">0.0220</span></div>
+                            <div class="feature-row"><span class="feature-rank">15</span><span class="feature-name">html_dom_depth</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:33%"></div></div><span class="feature-importance">0.0210</span></div>
+                        </div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Hyperparameters</div>
+                        <div class="params-grid">
+                            <div class="param-item"><span class="param-key">n_estimators</span><span class="param-value">533</span></div>
+                            <div class="param-item"><span class="param-key">max_depth</span><span class="param-value">43</span></div>
+                            <div class="param-item"><span class="param-key">min_samples_split</span><span class="param-value">2</span></div>
+                            <div class="param-item"><span class="param-key">max_features</span><span class="param-value">sqrt</span></div>
+                            <div class="param-item"><span class="param-key">class_weight</span><span class="param-value">balanced</span></div>
+                        </div>
+                    </div>
+                </div>
+                <!-- XGBoost Combined -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">XGBoost Combined</div>
+                        <div class="model-detail-type">Gradient Boosting</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">99.01%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">99.35%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.66%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">99.01%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9991</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.90%</div>
+                            <div class="metric-label">CV F1 (5-fold)</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">10,700</div>
+                        <div class="cm-cell cm-fp">69</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">144</div>
+                        <div class="cm-cell cm-tp">10,625</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Feature Importance Split</div>
+                        <div class="feature-split">
+                            <div class="feature-split-bar">
+                                <div class="split-url" style="width:37.1%">URL 37.1%</div>
+                                <div class="split-html" style="width:62.9%">HTML 62.9%</div>
+                            </div>
+                        </div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Top 15 Features by Importance</div>
+                        <div class="features-list">
+                            <div class="feature-row"><span class="feature-rank">1</span><span class="feature-name">html_num_links</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:100%"></div></div><span class="feature-importance">0.4420</span></div>
+                            <div class="feature-row"><span class="feature-rank">2</span><span class="feature-name">url_is_shortened</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:10%"></div></div><span class="feature-importance">0.0427</span></div>
+                            <div class="feature-row"><span class="feature-rank">3</span><span class="feature-name">url_platform_subdomain_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:9%"></div></div><span class="feature-importance">0.0397</span></div>
+                            <div class="feature-row"><span class="feature-rank">4</span><span class="feature-name">url_domain_dots</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:7%"></div></div><span class="feature-importance">0.0315</span></div>
+                            <div class="feature-row"><span class="feature-rank">5</span><span class="feature-name">html_has_fromcharcode</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:7%"></div></div><span class="feature-importance">0.0296</span></div>
+                            <div class="feature-row"><span class="feature-rank">6</span><span class="feature-name">url_num_domain_parts</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:6%"></div></div><span class="feature-importance">0.0269</span></div>
+                            <div class="feature-row"><span class="feature-rank">7</span><span class="feature-name">html_has_meta_refresh</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0148</span></div>
+                            <div class="feature-row"><span class="feature-rank">8</span><span class="feature-name">url_is_http</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0126</span></div>
+                            <div class="feature-row"><span class="feature-rank">9</span><span class="feature-name">url_encoding_diff</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0124</span></div>
+                            <div class="feature-row"><span class="feature-rank">10</span><span class="feature-name">url_path_digits</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:3%"></div></div><span class="feature-importance">0.0116</span></div>
+                            <div class="feature-row"><span class="feature-rank">11</span><span class="feature-name">html_text_length</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:2%"></div></div><span class="feature-importance">0.0107</span></div>
+                            <div class="feature-row"><span class="feature-rank">12</span><span class="feature-name">url_path_slashes</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:2%"></div></div><span class="feature-importance">0.0105</span></div>
+                            <div class="feature-row"><span class="feature-rank">13</span><span class="feature-name">url_multiple_brands_in_url</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:2%"></div></div><span class="feature-importance">0.0103</span></div>
+                            <div class="feature-row"><span class="feature-rank">14</span><span class="feature-name">url_brand_in_path</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:2%"></div></div><span class="feature-importance">0.0102</span></div>
+                            <div class="feature-row"><span class="feature-rank">15</span><span class="feature-name">url_domain_hyphens</span><div class="importance-bar-bg"><div class="importance-bar-fill" style="width:2%"></div></div><span class="feature-importance">0.0095</span></div>
+                        </div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Hyperparameters</div>
+                        <div class="params-grid">
+                            <div class="param-item"><span class="param-key">n_estimators</span><span class="param-value">726</span></div>
+                            <div class="param-item"><span class="param-key">max_depth</span><span class="param-value">6</span></div>
+                            <div class="param-item"><span class="param-key">learning_rate</span><span class="param-value">0.137</span></div>
+                            <div class="param-item"><span class="param-key">subsample</span><span class="param-value">0.698</span></div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+            <!-- CNN MODELS TAB -->
+            <div id="cnnModels" class="tab-content">
+                <div class="section-subtitle">
+                    2 character-level CNN models that process raw text directly &mdash; no hand-crafted features needed.
+                    Parallel Conv1D branches with kernel sizes (3, 5, 7) capture patterns at different scales.
+                </div>
+                <!-- CNN URL -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">CNN URL (Char-level)</div>
+                        <div class="model-detail-type">Deep Learning</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">98.38%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.88%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">97.86%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.37%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9976</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">8,013</div>
+                        <div class="cm-cell cm-fp">90</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">173</div>
+                        <div class="cm-cell cm-tp">7,930</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Architecture</div>
+                        <div class="params-grid">
+                            <div class="param-item"><span class="param-key">Input</span><span class="param-value">Raw URL characters</span></div>
+                            <div class="param-item"><span class="param-key">max_len</span><span class="param-value">800</span></div>
+                            <div class="param-item"><span class="param-key">vocab_size</span><span class="param-value">87</span></div>
+                            <div class="param-item"><span class="param-key">Conv1D kernels</span><span class="param-value">3, 5, 7</span></div>
+                            <div class="param-item"><span class="param-key">Dataset</span><span class="param-value">108,034 URLs</span></div>
+                        </div>
+                    </div>
+                </div>
+                <!-- CNN HTML -->
+                <div class="model-detail">
+                    <div class="model-detail-header">
+                        <div class="model-detail-name">CNN HTML (Char-level)</div>
+                        <div class="model-detail-type">Deep Learning</div>
+                    </div>
+                    <div class="metrics-grid">
+                        <div class="metric-card">
+                            <div class="metric-value">96.33%</div>
+                            <div class="metric-label">Accuracy</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">98.18%</div>
+                            <div class="metric-label">Precision</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">94.41%</div>
+                            <div class="metric-label">Recall</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value">96.26%</div>
+                            <div class="metric-label">F1-Score</div>
+                        </div>
+                        <div class="metric-card">
+                            <div class="metric-value highlight">0.9908</div>
+                            <div class="metric-label">ROC-AUC</div>
+                        </div>
+                    </div>
+                    <div class="section-title">Confusion Matrix</div>
+                    <div class="confusion-matrix">
+                        <div class="cm-header"></div>
+                        <div class="cm-header">Pred Legit</div>
+                        <div class="cm-header">Pred Phish</div>
+                        <div class="cm-label">Actual Legit</div>
+                        <div class="cm-cell cm-tn">5,943</div>
+                        <div class="cm-cell cm-fp">106</div>
+                        <div class="cm-label">Actual Phish</div>
+                        <div class="cm-cell cm-fn">338</div>
+                        <div class="cm-cell cm-tp">5,711</div>
+                    </div>
+                    <div class="subsection">
+                        <div class="section-title">Architecture</div>
+                        <div class="params-grid">
+                            <div class="param-item"><span class="param-key">Input</span><span class="param-value">Raw HTML source</span></div>
+                            <div class="param-item"><span class="param-key">max_len</span><span class="param-value">5,000</span></div>
+                            <div class="param-item"><span class="param-key">vocab_size</span><span class="param-value">100</span></div>
+                            <div class="param-item"><span class="param-key">Conv1D kernels</span><span class="param-value">3, 5, 7</span></div>
+                            <div class="param-item"><span class="param-key">Dataset</span><span class="param-value">80,652 HTML pages</span></div>
+                        </div>
+                    </div>
+                </div>
+            </div>
+            <!-- COMPARISON TAB -->
+            <div id="overview" class="tab-content">
+                <div class="section-subtitle">
+                    Side-by-side comparison of all 9 models across URL, HTML, Combined, and CNN categories.
+                </div>
+                <div class="section-title">All Models</div>
+                <div class="table-scroll">
+                    <table class="comparison-table">
+                        <thead>
+                            <tr>
+                                <th>Model</th>
+                                <th>Category</th>
+                                <th>Accuracy</th>
+                                <th>Precision</th>
+                                <th>Recall</th>
+                                <th>F1-Score</th>
+                                <th>ROC-AUC</th>
+                                <th>Features</th>
+                            </tr>
+                        </thead>
+                        <tbody>
+                            <tr>
+                                <td class="model-name-cell">Logistic Regression</td>
+                                <td>URL</td>
+                                <td>93.71%</td>
+                                <td>95.40%</td>
+                                <td>91.84%</td>
+                                <td>93.59%</td>
+                                <td>0.9789</td>
+                                <td>121</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">Random Forest</td>
+                                <td>URL</td>
+                                <td>97.71%</td>
+                                <td>99.06%</td>
+                                <td>96.33%</td>
+                                <td>97.68%</td>
+                                <td>0.9958</td>
+                                <td>121</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">XGBoost</td>
+                                <td>URL</td>
+                                <td>98.07%</td>
+                                <td>99.12%</td>
+                                <td>97.00%</td>
+                                <td>98.05%</td>
+                                <td>0.9963</td>
+                                <td>121</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">Random Forest HTML</td>
+                                <td>HTML</td>
+                                <td>88.03%</td>
+                                <td>87.49%</td>
+                                <td>88.74%</td>
+                                <td>88.11%</td>
+                                <td>0.9561</td>
+                                <td>100</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">XGBoost HTML</td>
+                                <td>HTML</td>
+                                <td>87.86%</td>
+                                <td>86.45%</td>
+                                <td>89.80%</td>
+                                <td>88.09%</td>
+                                <td>0.9557</td>
+                                <td>100</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">RF Combined</td>
+                                <td>Combined</td>
+                                <td>98.60%</td>
+                                <td>99.16%</td>
+                                <td>98.02%</td>
+                                <td>98.59%</td>
+                                <td>0.9990</td>
+                                <td>221</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">XGBoost Combined</td>
+                                <td>Combined</td>
+                                <td class="best">99.01%</td>
+                                <td class="best">99.35%</td>
+                                <td class="best">98.66%</td>
+                                <td class="best">99.01%</td>
+                                <td class="best">0.9991</td>
+                                <td>221</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">CNN URL</td>
+                                <td>CNN</td>
+                                <td>98.38%</td>
+                                <td>98.88%</td>
+                                <td>97.86%</td>
+                                <td>98.37%</td>
+                                <td>0.9976</td>
+                                <td>chars</td>
+                            </tr>
+                            <tr>
+                                <td class="model-name-cell">CNN HTML</td>
+                                <td>CNN</td>
+                                <td>96.33%</td>
+                                <td>98.18%</td>
+                                <td>94.41%</td>
+                                <td>96.26%</td>
+                                <td>0.9908</td>
+                                <td>chars</td>
+                            </tr>
+                        </tbody>
+                    </table>
+                </div>
+                <div class="section-title">Key Insights</div>
+                <div class="insights-grid">
+                    <div class="insight-card insight-safe">
+                        <div class="insight-label">Best Overall</div>
+                        <div class="insight-title">XGBoost Combined</div>
+                        <div class="insight-desc">99.01% accuracy, 99.35% precision &mdash; best performance by combining 121 URL + 100 HTML features.</div>
+                    </div>
+                    <div class="insight-card insight-safe">
+                        <div class="insight-label">Ensemble Strength</div>
+                        <div class="insight-title">9-Model Consensus</div>
+                        <div class="insight-desc">Combining 3 URL + 2 HTML + 2 Combined + 2 CNN models via majority vote maximizes reliability.</div>
+                    </div>
+                    <div class="insight-card insight-accent">
+                        <div class="insight-label">Top Signal</div>
+                        <div class="insight-title">html_num_links</div>
+                        <div class="insight-desc">Number of links in HTML dominates XGBoost Combined at 44.2% importance &mdash; the single strongest feature.</div>
+                    </div>
+                </div>
+            </div>
+        </section>
+        <footer>
+            <div class="footer-text">Machine Learning Phishing Detection</div>
+        </footer>
+    </div>
+    <script src="/static/script.js?v=4"></script>
+</body>
+</html>

server/static/script.js ADDED Viewed

	@@ -0,0 +1,509 @@

+/* ================================================================
+   Phishing Detection – UI Controller (Unified All-Models)
+   ================================================================ */
+const API_BASE = window.location.origin;
+// ── DOM Refs ────────────────────────────────────────────────────
+const $url     = () => document.getElementById('urlInput');
+const $loading = () => document.getElementById('loading');
+const $results = () => document.getElementById('results');
+// ── Feature-key catalogues ──────────────────────────────────────
+const TOP_URL_FEATURES = [
+    'num_domain_parts', 'domain_dots', 'is_shortened', 'num_subdomains',
+    'domain_hyphens', 'is_free_platform', 'platform_subdomain_length',
+    'avg_domain_part_len', 'domain_length_category', 'path_digits', 'is_http',
+    'multiple_brands_in_url', 'brand_in_path', 'path_slashes', 'encoding_diff',
+    'symbol_ratio_domain', 'domain_length', 'has_at_symbol', 'tld_length',
+    'is_free_hosting',
+];
+const ALL_URL_FEATURES = [
+    'url_length', 'domain_length', 'path_length', 'query_length', 'url_length_category',
+    'domain_length_category', 'num_dots', 'num_hyphens', 'num_underscores', 'num_slashes',
+    'num_question_marks', 'num_ampersands', 'num_equals', 'num_at', 'num_percent',
+    'num_digits_url', 'num_letters_url', 'domain_dots', 'domain_hyphens', 'domain_digits',
+    'path_slashes', 'path_dots', 'path_digits', 'digit_ratio_url', 'letter_ratio_url',
+    'special_char_ratio', 'digit_ratio_domain', 'symbol_ratio_domain', 'num_subdomains',
+    'num_domain_parts', 'tld_length', 'sld_length', 'longest_domain_part', 'avg_domain_part_len',
+    'longest_part_gt_20', 'longest_part_gt_30', 'longest_part_gt_40', 'has_suspicious_tld',
+    'has_trusted_tld', 'has_port', 'has_non_std_port', 'domain_randomness_score',
+    'sld_consonant_cluster_score', 'sld_keyboard_pattern', 'sld_has_dictionary_word',
+    'sld_pronounceability_score', 'domain_digit_position_suspicious', 'path_depth',
+    'max_path_segment_len', 'avg_path_segment_len', 'has_extension', 'extension_category',
+    'has_suspicious_extension', 'has_exe', 'has_double_slash', 'path_has_brand_not_domain',
+    'path_has_ip_pattern', 'suspicious_path_extension_combo', 'num_params', 'has_query',
+    'query_value_length', 'max_param_len', 'query_has_url', 'url_entropy', 'domain_entropy',
+    'path_entropy', 'max_consecutive_digits', 'max_consecutive_chars', 'max_consecutive_consonants',
+    'char_repeat_rate', 'unique_bigram_ratio', 'unique_trigram_ratio', 'sld_letter_diversity',
+    'domain_has_numbers_letters', 'url_complexity_score', 'has_ip_address', 'has_at_symbol',
+    'has_redirect', 'is_shortened', 'is_free_hosting', 'is_free_platform',
+    'platform_subdomain_length', 'has_uuid_subdomain', 'is_http',
+    'num_phishing_keywords', 'phishing_in_domain', 'phishing_in_path', 'num_brands',
+    'brand_in_domain', 'brand_in_path', 'brand_impersonation', 'has_login', 'has_account',
+    'has_verify', 'has_secure', 'has_update', 'has_bank', 'has_password', 'has_suspend',
+    'has_webscr', 'has_cmd', 'has_cgi', 'brand_in_subdomain_not_domain', 'multiple_brands_in_url',
+    'brand_with_hyphen', 'suspicious_brand_tld', 'brand_keyword_combo', 'has_url_encoding',
+    'encoding_count', 'encoding_diff', 'has_punycode', 'has_unicode', 'has_hex_string',
+    'has_base64', 'has_lookalike_chars', 'mixed_script_score', 'homograph_brand_risk',
+    'suspected_idn_homograph', 'double_encoding', 'encoding_in_domain', 'suspicious_unicode_category',
+];
+const TOP_HTML_FEATURES = [
+    'has_login_form', 'num_password_fields', 'password_with_external_action',
+    'num_external_form_actions', 'num_empty_form_actions', 'num_hidden_fields',
+    'ratio_external_links', 'num_external_links', 'num_ip_based_links',
+    'num_suspicious_tld_links', 'has_eval', 'has_base64', 'has_atob',
+    'has_fromcharcode', 'has_document_write', 'has_right_click_disabled',
+    'has_status_bar_customization', 'has_meta_refresh', 'has_location_replace',
+    'num_hidden_iframes',
+];
+const ALL_HTML_FEATURES = [
+    'html_length', 'num_tags', 'num_divs', 'num_spans', 'num_paragraphs',
+    'num_headings', 'num_lists', 'num_images', 'num_iframes', 'num_tables',
+    'has_title', 'dom_depth',
+    'num_forms', 'num_input_fields', 'num_password_fields', 'num_email_fields',
+    'num_text_fields', 'num_submit_buttons', 'num_hidden_fields', 'has_login_form',
+    'has_form', 'num_external_form_actions', 'num_empty_form_actions',
+    'num_links', 'num_external_links', 'num_internal_links', 'num_empty_links',
+    'num_mailto_links', 'num_javascript_links', 'ratio_external_links',
+    'num_ip_based_links', 'num_suspicious_tld_links', 'num_anchor_text_mismatch',
+    'num_scripts', 'num_inline_scripts', 'num_external_scripts',
+    'has_eval', 'has_unescape', 'has_escape', 'has_document_write',
+    'text_length', 'num_words', 'text_to_html_ratio', 'num_brand_mentions',
+    'num_urgency_keywords', 'has_copyright', 'has_phone_number', 'has_email_address',
+    'num_meta_tags', 'has_description', 'has_keywords', 'has_author',
+    'has_viewport', 'has_meta_refresh',
+    'num_css_files', 'num_external_css', 'num_external_images',
+    'num_data_uri_images', 'num_inline_styles', 'inline_css_length', 'has_favicon',
+    'password_with_external_action', 'has_base64', 'has_atob', 'has_fromcharcode',
+    'num_onload_events', 'num_onerror_events', 'num_onclick_events',
+    'num_unique_external_domains', 'num_forms_without_labels',
+    'has_display_none', 'has_visibility_hidden', 'has_window_open',
+    'has_location_replace', 'num_hidden_iframes', 'has_right_click_disabled',
+    'has_status_bar_customization',
+];
+// ── Highlight rules ─────────────────────────────────────────────
+const GOOD_INDICATORS = new Set([
+    'has_trusted_tld', 'has_title', 'has_favicon', 'sld_has_dictionary_word',
+]);
+const BAD_INDICATORS = new Set([
+    'is_shortened', 'is_free_hosting', 'is_free_platform',
+    'has_ip_address', 'has_at_symbol', 'has_suspicious_tld',
+    'has_meta_refresh', 'has_popup_window', 'form_action_external',
+    'has_base64', 'brand_impersonation', 'has_punycode',
+    'has_unicode', 'has_hex_string', 'suspected_idn_homograph',
+    'is_http', 'multiple_brands_in_url', 'brand_in_path',
+]);
+const DANGER_THRESHOLDS = {
+    num_password_fields:       [0,   '>'],
+    num_hidden_fields:         [2,   '>'],
+    num_urgency_keywords:      [0,   '>'],
+    num_phishing_keywords:     [0,   '>'],
+    num_external_scripts:      [10,  '>'],
+    platform_subdomain_length: [5,   '>'],
+    domain_dots:               [3,   '>'],
+    num_subdomains:            [3,   '>'],
+    domain_entropy:            [4.5, '>'],
+    symbol_ratio_domain:       [0.3, '>'],
+    max_consecutive_digits:    [5,   '>'],
+    domain_hyphens:            [1,   '>'],
+    path_digits:               [5,   '>'],
+    encoding_diff:             [0.5, '>'],
+};
+const SAFE_THRESHOLDS = {
+    domain_length:    [15,  '<'],
+    domain_entropy:   [3.5, '<'],
+    num_brands:       [1,   '=='],
+    num_domain_parts: [2,   '=='],
+};
+// ── API helpers ─────────────────────────────────────────────────
+function normalizeUrl(raw) {
+    const trimmed = raw.trim();
+    if (!trimmed) return null;
+    return trimmed.startsWith('http://') || trimmed.startsWith('https://')
+        ? trimmed
+        : 'https://' + trimmed;
+}
+async function fetchPrediction(endpoint, body) {
+    const url = normalizeUrl($url().value);
+    if (!url) { alert('Please enter a URL'); return; }
+    showLoading();
+    try {
+        const res = await fetch(`${API_BASE}${endpoint}`, {
+            method: 'POST',
+            headers: { 'Content-Type': 'application/json' },
+            body: JSON.stringify(body(url)),
+        });
+        if (!res.ok) throw new Error('Analysis failed');
+        return await res.json();
+    } catch (err) {
+        alert('Error: ' + err.message);
+        hideLoading();
+        return null;
+    }
+}
+// ── Public actions ──────────────────────────────────────────────
+async function analyzeAll() {
+    const data = await fetchPrediction('/api/predict/all', url => ({ url }));
+    if (data) displayAllResults(data);
+}
+function clearResults() {
+    const results = $results();
+    const input   = $url();
+    if (results) results.style.display = 'none';
+    if (input) input.value = '';
+}
+// ── Ensemble weights (F1-score based) ───────────────────────────
+const MODEL_WEIGHTS = {
+    'Logistic Regression':      0.9359,
+    'Random Forest':            0.9768,
+    'XGBoost':                  0.9805,
+    'Random Forest HTML':       0.8811,
+    'XGBoost HTML':             0.8809,
+    'Random Forest Combined':   0.9859,
+    'XGBoost Combined':         0.9901,
+    'CNN URL (Char-level)':     0.9837,
+    'CNN HTML (Char-level)':    0.9626,
+};
+function computeEnsembleVerdict(data) {
+    const allPredictions = [];
+    const categories = ['url_models', 'html_models', 'combined_models', 'cnn_models'];
+    categories.forEach(cat => {
+        const section = data[cat];
+        if (section && section.predictions) {
+            allPredictions.push(...section.predictions);
+        }
+    });
+    if (allPredictions.length === 0) {
+        return { score: 0, isPhishing: false, totalModels: 0, phishingVotes: 0 };
+    }
+    let weightedSum = 0;
+    let totalWeight = 0;
+    let phishingVotes = 0;
+    allPredictions.forEach(p => {
+        const w = MODEL_WEIGHTS[p.model_name] || 0.90;
+        const phishProb = p.phishing_probability / 100;
+        weightedSum += w * phishProb;
+        totalWeight += w;
+        if (p.prediction === 'PHISHING') phishingVotes++;
+    });
+    const score = totalWeight > 0 ? (weightedSum / totalWeight) * 100 : 0;
+    return {
+        score:         Math.round(score * 10) / 10,
+        isPhishing:    score >= 50,
+        totalModels:   allPredictions.length,
+        phishingVotes,
+    };
+}
+// ── Loading UI ──────────────────────────────────────────────────
+function showLoading() {
+    $loading().style.display = 'block';
+    $results().style.display = 'none';
+}
+function hideLoading() {
+    $loading().style.display = 'none';
+}
+//  UNIFIED RESULTS
+function displayAllResults(data) {
+    hideLoading();
+    const el = $results();
+    el.style.display = 'block';
+    // Weighted ensemble verdict
+    const verdict = computeEnsembleVerdict(data);
+    const statusClass = verdict.isPhishing ? 'danger' : 'safe';
+    const statusText  = verdict.isPhishing ? 'Phishing' : 'Legitimate';
+    const safeVotes   = verdict.totalModels - verdict.phishingVotes;
+    const banner = `
+        <div class="status-banner ${statusClass}">
+            <div class="status-headline">
+                <div>
+                    <div class="status-title">${statusText}</div>
+                </div>
+            </div>
+            <div class="ensemble-score">
+            <div class="banner-score-value">${verdict.score.toFixed(1)}%</div>
+                <div class="banner-score-label">Phishing risk</div>
+            </div>
+            <div class="ensemble-bar">
+                <div class="prob-fill ${statusClass}" style="width:${verdict.score}%"></div>
+            </div>
+            <div class="status-votes">${verdict.phishingVotes}/${verdict.totalModels} models flagged phishing \u00b7 ${safeVotes}/${verdict.totalModels} say legitimate</div>
+        </div>
+        <div class="url-display">${data.url}</div>`;
+    // Build tabs
+    const tabs = [];
+    const tabContents = [];
+    // Tab 1: URL Models
+    if (data.url_models) {
+        tabs.push({ id: 'tabUrl', label: 'URL Models', count: data.url_models.predictions?.length || 0 });
+        tabContents.push({ id: 'tabUrl', html: renderUrlModelsTab(data.url_models) });
+    }
+    // Tab 2: HTML Models
+    if (data.html_models) {
+        tabs.push({ id: 'tabHtml', label: 'HTML Models', count: data.html_models.predictions?.length || 0 });
+        tabContents.push({ id: 'tabHtml', html: renderHtmlModelsTab(data.html_models) });
+    } else if (data.html_error) {
+        tabs.push({ id: 'tabHtml', label: 'HTML Models', count: 0 });
+        tabContents.push({ id: 'tabHtml', html: `<div class="error-notice">HTML download failed: ${data.html_error}</div>` });
+    }
+    // Tab 3: Combined Models
+    if (data.combined_models) {
+        tabs.push({ id: 'tabCombined', label: 'Combined Models', count: data.combined_models.predictions?.length || 0 });
+        tabContents.push({ id: 'tabCombined', html: renderCombinedModelsTab(data.combined_models) });
+    }
+    // Tab 4: CNN Models
+    if (data.cnn_models) {
+        tabs.push({ id: 'tabCnn', label: 'CNN Models', count: data.cnn_models.predictions?.length || 0 });
+        tabContents.push({ id: 'tabCnn', html: renderCnnModelsTab(data.cnn_models) });
+    }
+    const tabsHTML = tabs.map((t, i) => `
+        <button class="tab ${i === 0 ? 'active' : ''}" onclick="switchTab(event,'${t.id}')">
+            ${t.label} <span class="tab-count">${t.count}</span>
+        </button>
+    `).join('');
+    const contentsHTML = tabContents.map((t, i) => `
+        <div id="${t.id}" class="tab-content ${i === 0 ? 'active' : ''}">${t.html}</div>
+    `).join('');
+    el.innerHTML = `${banner}
+        <div class="tabs">${tabsHTML}</div>
+        ${contentsHTML}`;
+}
+//  TAB RENDERERS
+function renderUrlModelsTab(urlData) {
+    const predictions = urlData.predictions || [];
+    const features = urlData.features || {};
+    return `
+        <div class="section-title">Model Predictions</div>
+        <div class="models-grid">${predictions.map(p => renderModelCard(p)).join('')}</div>
+        ${renderFeatureSection(features, 'url')}
+    `;
+}
+function renderHtmlModelsTab(htmlData) {
+    const predictions = htmlData.predictions || [];
+    const features = htmlData.features || {};
+    return `
+        <div class="section-title">Model Predictions</div>
+        <div class="models-grid">${predictions.map(p => renderModelCard(p)).join('')}</div>
+        ${renderFeatureSection(features, 'html')}
+    `;
+}
+function renderCombinedModelsTab(combinedData) {
+    const predictions = combinedData.predictions || [];
+    const urlFeats  = combinedData.url_features || {};
+    const htmlFeats = combinedData.html_features || {};
+    const hasHtmlF  = Object.keys(htmlFeats).length > 0;
+    return `
+        <div class="section-title">Model Predictions</div>
+        <div class="models-grid">${predictions.map(p => renderModelCard(p)).join('')}</div>
+        <div class="combined-features-tabs">
+            <div class="tabs">
+                <button class="tab active" onclick="switchSubTab(event,'combinedUrlFeats')">URL Features</button>
+                <button class="tab" onclick="switchSubTab(event,'combinedHtmlFeats')">HTML Features</button>
+            </div>
+            <div id="combinedUrlFeats" class="tab-content active">
+                ${renderFeatureSection(urlFeats, 'combined-url')}
+            </div>
+            <div id="combinedHtmlFeats" class="tab-content">
+                ${hasHtmlF
+                    ? renderFeatureSection(htmlFeats, 'combined-html')
+                    : `<div class="error-notice">HTML features unavailable${combinedData.html_error ? ': ' + combinedData.html_error : ''}</div>`}
+            </div>
+        </div>
+    `;
+}
+function renderCnnModelsTab(cnnData) {
+    const predictions = cnnData.predictions || [];
+    return `
+        <div class="section-title">Model Predictions</div>
+        <div class="models-grid">${predictions.map(p => renderModelCard(p)).join('')}</div>
+    `;
+}
+//  MODEL CARDS & INFO
+function renderModelCard(pred) {
+    const isSafe = pred.prediction.toLowerCase() === 'legitimate';
+    const cls    = isSafe ? 'safe' : 'danger';
+    return `
+        <div class="model-card ${cls}">
+            <div class="model-header">
+                <div class="model-name">${pred.model_name}</div>
+                <div class="model-prediction ${cls}">${pred.prediction}</div>
+            </div>
+            <div class="model-confidence">${pred.confidence.toFixed(1)}%</div>
+            <div class="model-confidence-label">Confidence</div>
+            <div class="prob-container">
+                ${probRow('Safe',    pred.legitimate_probability, 'safe')}
+                ${probRow('Phishing', pred.phishing_probability,  'danger')}
+            </div>
+        </div>`;
+}
+function probRow(label, pct, cls) {
+    return `
+        <div class="prob-row">
+            <span class="prob-label">${label}</span>
+            <div class="prob-bar"><div class="prob-fill ${cls}" style="width:${pct}%"></div></div>
+            <span class="prob-value">${pct.toFixed(0)}%</span>
+        </div>`;
+}
+//  FEATURE RENDERING
+function renderFeatureSection(features, tag) {
+    if (!features || Object.keys(features).length === 0) return '';
+    const isHtml     = 'num_forms' in features || 'html_length' in features;
+    const topKeys    = isHtml ? TOP_HTML_FEATURES : TOP_URL_FEATURES;
+    const allKeys    = isHtml ? ALL_HTML_FEATURES : ALL_URL_FEATURES;
+    const remaining  = allKeys.filter(k => !topKeys.includes(k));
+    const topHTML       = renderFeatureList(topKeys, features);
+    const remainingHTML = renderFeatureList(remaining, features);
+    return `
+        <div class="section-title">Extracted Features (Top 20)</div>
+        <div class="features-grid">
+            ${topHTML}
+            <div id="hiddenFeatures-${tag}" class="features-hidden">${remainingHTML}</div>
+        </div>
+        <button class="show-more-btn" onclick="toggleAllFeatures('${tag}')" id="showMoreBtn-${tag}">
+            Show All Features (${Object.keys(features).length})
+        </button>`;
+}
+function renderFeatureList(keys, features) {
+    return keys.filter(k => k in features).map(k => renderFeature(k, features[k])).join('');
+}
+function renderFeature(key, value) {
+    let itemClass = '';
+    let valueClass = '';
+    const isBool = typeof value === 'boolean' || value === 0 || value === 1;
+    const boolVal = value === true || value === 1;
+    if (isBool) {
+        if (GOOD_INDICATORS.has(key)) {
+            valueClass = boolVal ? 'true' : 'false';
+            itemClass  = boolVal ? 'highlight-safe' : 'highlight-danger';
+        } else if (BAD_INDICATORS.has(key)) {
+            valueClass = boolVal ? 'false' : 'true';
+            itemClass  = boolVal ? 'highlight-danger' : 'highlight-safe';
+        }
+    }
+    if (key in DANGER_THRESHOLDS) {
+        const [thr, op] = DANGER_THRESHOLDS[key];
+        if ((op === '>' && value > thr) || (op === '>=' && value >= thr)) {
+            itemClass = 'highlight-danger';
+        }
+    }
+    if (key in SAFE_THRESHOLDS) {
+        const [thr, op] = SAFE_THRESHOLDS[key];
+        if ((op === '<' && value < thr) || (op === '==' && value === thr)) {
+            itemClass = 'highlight-safe';
+        }
+    }
+    return `
+        <div class="feature-item ${itemClass}">
+            <span class="feature-label">${formatName(key)}</span>
+            <span class="feature-value ${valueClass}">${formatValue(value)}</span>
+        </div>`;
+}
+function switchTab(event, tabId) {
+    const parent = event.currentTarget.closest('.tabs')?.parentElement ?? document;
+    parent.querySelectorAll('.tabs > .tab').forEach(t => t.classList.remove('active'));
+    parent.querySelectorAll(':scope > .tab-content').forEach(c => c.classList.remove('active'));
+    event.currentTarget.classList.add('active');
+    document.getElementById(tabId).classList.add('active');
+}
+function switchSubTab(event, tabId) {
+    const parent = event.currentTarget.closest('.combined-features-tabs');
+    parent.querySelectorAll('.tab').forEach(t => t.classList.remove('active'));
+    parent.querySelectorAll('.tab-content').forEach(c => c.classList.remove('active'));
+    event.currentTarget.classList.add('active');
+    document.getElementById(tabId).classList.add('active');
+}
+function toggleFeatures(el) {
+    const content = el.nextElementSibling;
+    const icon    = el.querySelector('.toggle-icon');
+    const isOpen  = content.classList.toggle('open');
+    icon.textContent = isOpen ? '\u2212' : '+';
+}
+function toggleAllFeatures(type) {
+    const hidden = document.getElementById('hiddenFeatures-' + type);
+    const btn    = document.getElementById('showMoreBtn-' + type);
+    if (hidden.classList.toggle('features-hidden')) {
+        const total = hidden.closest('.features-grid')?.querySelectorAll('.feature-item').length ?? 0;
+        btn.textContent = `Show All Features (${total})`;
+    } else {
+        btn.textContent = 'Show Less';
+    }
+}
+function formatName(name) {
+    return name.replace(/_/g, ' ').replace(/\b\w/g, c => c.toUpperCase());
+}
+function formatValue(value) {
+    if (typeof value === 'boolean') return value ? 'Yes' : 'No';
+    if (value === 0 || value === 1)  return value === 1 ? 'Yes' : 'No';
+    if (typeof value === 'number')   return value % 1 === 0 ? value : value.toFixed(2);
+    return value;
+}
+document.addEventListener('DOMContentLoaded', () => {
+    const input = $url();
+    if (input) input.addEventListener('keypress', e => { if (e.key === 'Enter') analyzeAll(); });
+});

server/static/style.css ADDED Viewed

	@@ -0,0 +1,1325 @@

+* {
+    margin: 0;
+    padding: 0;
+    box-sizing: border-box;
+}
+html, body {
+    scrollbar-width: none;
+    -ms-overflow-style: none;
+}
+html::-webkit-scrollbar,
+body::-webkit-scrollbar {
+    display: none;
+}
+:root {
+    --bg: #ffffff;
+    --bg-secondary: #f5f5f5;
+    --text: #000000;
+    --text-secondary: #666666;
+    --border: #e0e0e0;
+    --safe: #00a86b;
+    --safe-bg: #e8f5e9;
+    --danger: #dc2626;
+    --danger-bg: #fef2f2;
+    --accent: #000000;
+}
+/*
+ * DESIGN SYSTEM REFERENCE
+ * ───────────────────────────────────────────────
+ *
+ * Typography:
+ *   Label:   10px / 600 / 0.15em / uppercase / --text-secondary
+ *   Button:  11px / 600 / 0.1em  / uppercase
+ *   Body:    12-13px / 400 / --text or --text-secondary
+ *   Mono:    'SF Mono', 'Monaco', 'Inconsolata', monospace
+ *   Heading: 16-28px / 700
+ *
+ * Spacing:  8, 12, 16, 20, 24, 32, 40, 48px
+ * Borders:  1px solid var(--border)
+ * Accents:  border-left 3-4px solid var(--safe | --danger | --accent)
+ *
+ * Color modifiers (add to element):
+ *   .safe    → green state (--safe, --safe-bg)
+ *   .danger  → red state   (--danger, --danger-bg)
+ *
+ * Reusable components:
+ *   .btn              filled button
+ *   .btn-secondary    ghost button (use with .btn)
+ *   .btn-outline      standalone outline link/button
+ *   .tabs / .tab      tab navigation (nest for sub-tabs)
+ *   .tab-count        badge inside .tab
+ *   .tab-content      tab panel (add .active)
+ *   .section-title    section header label with border
+ *   .status-banner    verdict banner (add .safe/.danger)
+ *   .model-card       prediction card (add .safe/.danger)
+ *   .prob-bar/.prob-fill   4px progress bar (add .safe/.danger to fill)
+ *   .feature-item     key-value pair row
+ *   .features-grid    auto-fill grid for feature items
+ *   .models-grid      auto-fit grid for cards
+ *   .error-notice     error message block
+ */
+body {
+    font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
+    background: var(--bg);
+    color: var(--text);
+    min-height: 100vh;
+    line-height: 1.4;
+    font-weight: 400;
+    -webkit-font-smoothing: antialiased;
+}
+.container {
+    max-width: 960px;
+    margin: 0 auto;
+    padding: 0 24px;
+}
+/* HEADER */
+header {
+    padding: 48px 0 40px;
+    border-bottom: 1px solid var(--border);
+}
+.logo {
+    font-size: 11px;
+    font-weight: 700;
+    letter-spacing: 0.2em;
+    text-transform: uppercase;
+    margin-bottom: 4px;
+}
+.tagline {
+    font-size: 11px;
+    color: var(--text-secondary);
+    letter-spacing: 0.05em;
+}
+/* INPUT SECTION */
+.input-section {
+    padding: 48px 0;
+    border-bottom: 1px solid var(--border);
+}
+.input-label {
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.15em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    margin-bottom: 12px;
+    display: block;
+}
+.input-wrapper {
+    display: flex;
+    gap: 0;
+    margin-bottom: 16px;
+}
+input[type="text"] {
+    flex: 1;
+    padding: 16px 20px;
+    border: 1px solid var(--border);
+    border-right: none;
+    font-size: 14px;
+    font-family: inherit;
+    background: var(--bg);
+    color: var(--text);
+    outline: none;
+    transition: border-color 0.2s;
+}
+input[type="text"]:focus {
+    border-color: var(--text);
+}
+input[type="text"]::placeholder {
+    color: #999;
+}
+.btn {
+    padding: 16px 32px;
+    border: 1px solid var(--text);
+    background: var(--text);
+    color: var(--bg);
+    font-size: 11px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+    cursor: pointer;
+    font-family: inherit;
+    transition: all 0.2s;
+    white-space: nowrap;
+}
+.btn:hover {
+    background: #333;
+}
+.btn:active {
+    transform: scale(0.98);
+}
+.btn-group {
+    display: flex;
+    gap: 8px;
+}
+.btn-secondary {
+    background: var(--bg);
+    color: var(--text);
+}
+.btn-secondary:hover {
+    background: var(--bg-secondary);
+}
+/* OUTLINE BUTTON — standalone outline link or button */
+.btn-outline,
+.learn-more-btn,
+.show-more-btn,
+.back-link {
+    display: inline-block;
+    padding: 10px 20px;
+    border: 1px solid var(--border);
+    background: transparent;
+    color: var(--text);
+    font-size: 11px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+    text-decoration: none;
+    font-family: inherit;
+    cursor: pointer;
+    transition: all 0.2s;
+}
+.btn-outline:hover,
+.learn-more-btn:hover,
+.show-more-btn:hover,
+.back-link:hover {
+    border-color: var(--text);
+    background: var(--bg-secondary);
+}
+/* LOADING */
+.loading {
+    display: none;
+    padding: 80px 0;
+    text-align: center;
+}
+.loading-bar {
+    width: 48px;
+    height: 2px;
+    background: var(--border);
+    margin: 0 auto 16px;
+    position: relative;
+    overflow: hidden;
+}
+.loading-bar::after {
+    content: '';
+    position: absolute;
+    left: -50%;
+    width: 50%;
+    height: 100%;
+    background: var(--text);
+    animation: loading 1s ease-in-out infinite;
+}
+@keyframes loading {
+    0% { left: -50%; }
+    100% { left: 100%; }
+}
+.loading-text {
+    font-size: 10px;
+    letter-spacing: 0.15em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+}
+/* RESULTS */
+.results {
+    display: none;
+    padding: 48px 0;
+}
+/* STATUS BANNER */
+.status-banner {
+    padding: 32px;
+    margin-bottom: 32px;
+    text-align: center;
+}
+.status-banner.safe {
+    background: var(--safe-bg);
+    border-left: 4px solid var(--safe);
+}
+.status-banner.danger {
+    background: var(--danger-bg);
+    border-left: 4px solid var(--danger);
+}
+.status-icon {
+    font-size: 30px;
+    line-height: 1;
+}
+.status-banner.safe .status-icon {
+    color: var(--safe);
+}
+.status-banner.danger .status-icon {
+    color: var(--danger);
+}
+.status-title {
+    font-size: 28px;
+    font-weight: 700;
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+}
+.status-banner.safe .status-title {
+    color: var(--safe);
+}
+.status-banner.danger .status-title {
+    color: var(--danger);
+}
+.status-subtitle {
+    font-size: 11px;
+    color: var(--text-secondary);
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    margin-top: 4px;
+}
+/* ENSEMBLE SCORE */
+.ensemble-score {
+    display: flex;
+    align-items: baseline;
+    justify-content: center;
+    gap: 4px;
+    margin: 4px;
+}
+.status-banner.safe .model-confidence {
+    color: var(--safe);
+}
+.status-banner.danger .model-confidence {
+    color: var(--danger);
+}
+.ensemble-bar {
+    width: 100%;
+    max-width: 360px;
+    height: 4px;
+    background: var(--border);
+    margin: 10px auto 8px;
+}
+/* ensemble-bar uses .prob-fill for the fill element */
+.status-kicker {
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.14em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    margin-bottom: 10px;
+}
+.status-headline {
+    display: inline-flex;
+    align-items: center;
+    gap: 12px;
+}
+.status-headline > div:last-child {
+    text-align: left;
+}
+.banner-score-label {
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+}
+.banner-score-value {
+    font-size: 10px;
+    font-weight: 700;
+    letter-spacing: 0.04em;
+}
+.status-banner.safe .banner-score-value {
+    color: var(--safe);
+}
+.status-banner.danger .banner-score-value {
+    color: var(--danger);
+}
+.banner-score-note {
+    font-size: 10px;
+    color: var(--text-secondary);
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    margin-bottom: 10px;
+}
+.status-votes {
+    font-size: 10px;
+    color: var(--text-secondary);
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+}
+/* URL DISPLAY */
+.url-display {
+    padding: 16px 20px;
+    background: var(--bg-secondary);
+    font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
+    font-size: 13px;
+    word-break: break-all;
+    margin-bottom: 32px;
+    border-left: 2px solid var(--border);
+}
+/* SECTION TITLES */
+.section-title {
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.15em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    margin-bottom: 16px;
+    padding-bottom: 8px;
+    border-bottom: 1px solid var(--border);
+}
+/* TABS */
+.tabs {
+    display: flex;
+    gap: 0;
+    margin-bottom: 32px;
+    border-bottom: 1px solid var(--border);
+}
+.tab {
+    padding: 12px 24px;
+    font-size: 11px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+    background: none;
+    border: none;
+    border-bottom: 2px solid transparent;
+    cursor: pointer;
+    color: var(--text-secondary);
+    transition: all 0.2s;
+    font-family: inherit;
+    margin-bottom: -1px;
+}
+.tab:hover {
+    color: var(--text);
+}
+.tab.active {
+    color: var(--text);
+    border-bottom-color: var(--text);
+}
+.tab-content {
+    display: none;
+}
+.tab-content.active {
+    display: block;
+}
+/* MODEL CARDS */
+.models-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+    gap: 16px;
+    margin-bottom: 40px;
+}
+.model-card {
+    padding: 24px;
+    border: 1px solid var(--border);
+    background: var(--bg);
+}
+.model-card.safe {
+    border-left: 3px solid var(--safe);
+}
+.model-card.danger {
+    border-left: 3px solid var(--danger);
+}
+.model-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: flex-start;
+    margin-bottom: 16px;
+}
+.model-name {
+    font-size: 11px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+}
+.model-prediction {
+    font-size: 11px;
+    font-weight: 700;
+    letter-spacing: 0.05em;
+    text-transform: uppercase;
+    padding: 4px 8px;
+}
+.model-prediction.safe {
+    color: var(--safe);
+    background: var(--safe-bg);
+}
+.model-prediction.danger {
+    color: var(--danger);
+    background: var(--danger-bg);
+}
+.model-confidence {
+    font-size: 10px;
+    font-weight: 700;
+    margin-bottom: 8px;
+}
+.model-confidence-label {
+    font-size: 10px;
+    color: var(--text-secondary);
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+}
+/* PROBABILITY BAR */
+.prob-container {
+    margin-top: 16px;
+}
+.prob-row {
+    display: flex;
+    align-items: center;
+    gap: 12px;
+    margin-bottom: 8px;
+}
+.prob-label {
+    font-size: 10px;
+    text-transform: uppercase;
+    letter-spacing: 0.05em;
+    width: 70px;
+    color: var(--text-secondary);
+}
+.prob-bar {
+    flex: 1;
+    height: 4px;
+    background: var(--bg-secondary);
+    position: relative;
+}
+.prob-fill {
+    height: 100%;
+    transition: width 0.5s ease;
+}
+.prob-fill.safe {
+    background: var(--safe);
+}
+.prob-fill.danger {
+    background: var(--danger);
+}
+.prob-value {
+    font-size: 11px;
+    font-weight: 600;
+    width: 45px;
+    text-align: right;
+}
+/* FEATURES GRID */
+.features-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(220px, 1fr));
+    gap: 8px;
+    margin-bottom: 32px;
+}
+.feature-item {
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    padding: 12px 16px;
+    background: var(--bg-secondary);
+    border-left: 2px solid var(--border);
+    font-size: 12px;
+    transition: border-color 0.2s;
+}
+.feature-item:hover {
+    border-left-color: var(--text);
+}
+.feature-item.highlight-safe {
+    border-left-color: var(--safe);
+    background: var(--safe-bg);
+}
+.feature-item.highlight-danger {
+    border-left-color: var(--danger);
+    background: var(--danger-bg);
+}
+.feature-label {
+    color: var(--text-secondary);
+    font-size: 11px;
+}
+.feature-value {
+    font-weight: 600;
+    font-family: 'SF Mono', 'Monaco', monospace;
+    font-size: 12px;
+}
+.feature-value.true {
+    color: var(--safe);
+}
+.feature-value.false {
+    color: var(--danger);
+}
+.show-more-btn  { width: 100%; margin-top: 16px; }
+[id^="hiddenFeatures-"] {
+    display: contents;
+}
+[id^="hiddenFeatures-"].features-hidden {
+    display: none;
+}
+/* FOOTER */
+footer {
+    padding: 32px 0;
+    border-top: 1px solid var(--border);
+    text-align: center;
+}
+.footer-text {
+    font-size: 10px;
+    color: var(--text-secondary);
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+}
+.learn-more-btn { margin-top: 16px; padding: 12px 28px; }
+/* TAB COUNT BADGE */
+.tab-count {
+    display: inline-block;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border);
+    font-size: 10px;
+    font-weight: 700;
+    padding: 1px 6px;
+    margin-left: 4px;
+    vertical-align: middle;
+}
+.tab.active .tab-count {
+    background: var(--text);
+    color: var(--bg);
+    border-color: var(--text);
+}
+/* FEATURE SPLIT BAR */
+.feature-split {
+    margin-top: 16px;
+}
+.feature-split-bar {
+    display: flex;
+    height: 32px;
+    overflow: hidden;
+    border: 1px solid var(--border);
+}
+.split-url {
+    background: var(--bg-secondary);
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.05em;
+    text-transform: uppercase;
+    border-right: 1px solid var(--border);
+}
+.split-html {
+    background: var(--text);
+    color: var(--bg);
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.05em;
+    text-transform: uppercase;
+}
+/* COMBINED FEATURES TABS */
+.combined-features-tabs {
+    margin-top: 32px;
+}
+/* ERROR NOTICE */
+.error-notice {
+    font-size: 12px;
+    color: var(--danger);
+    padding: 16px 20px;
+    background: var(--danger-bg);
+    border-left: 2px solid var(--danger);
+    margin: 16px 0;
+}
+/* RESPONSIVE */
+@media (max-width: 640px) {
+    header {
+        padding: 32px 0;
+    }
+    .input-section {
+        padding: 32px 0;
+    }
+    .input-wrapper {
+        flex-direction: column;
+    }
+    input[type="text"] {
+        border-right: 1px solid var(--border);
+        border-bottom: none;
+    }
+    .btn {
+        width: 100%;
+    }
+    .btn-group {
+        flex-direction: column;
+    }
+    .models-grid {
+        grid-template-columns: 1fr;
+    }
+    .status-title {
+        font-size: 22px;
+    }
+    .status-headline {
+        gap: 10px;
+    }
+    .status-icon {
+        font-size: 26px;
+    }
+    .banner-score-value {
+        font-size: 14px;
+    }
+}
+/* =============================================
+   MODELS PAGE
+   ============================================= */
+/* MODELS PAGE - HEADER */
+.models-page header {
+    display: flex;
+    justify-content: space-between;
+    align-items: flex-end;
+}
+.header-left {
+    display: flex;
+    flex-direction: column;
+}
+.logo a {
+    color: var(--text);
+    text-decoration: none;
+}
+.logo a:hover {
+    opacity: 0.7;
+}
+.back-link      { padding: 8px 16px; }
+/* PAGE TITLE */
+.page-title-section {
+    padding: 48px 0;
+    border-bottom: 1px solid var(--border);
+}
+.page-title {
+    font-size: 28px;
+    font-weight: 700;
+    letter-spacing: 0.02em;
+    margin-bottom: 8px;
+}
+.page-description {
+    font-size: 13px;
+    color: var(--text-secondary);
+    line-height: 1.6;
+    max-width: 640px;
+}
+/* SECTION */
+.section {
+    padding: 40px 0;
+    border-bottom: 1px solid var(--border);
+}
+.section:last-child {
+    border-bottom: none;
+}
+.section-subtitle {
+    font-size: 13px;
+    color: var(--text-secondary);
+    margin-bottom: 24px;
+    line-height: 1.5;
+}
+/* COMPARISON TABLE */
+.comparison-table {
+    width: 100%;
+    border-collapse: collapse;
+    font-size: 13px;
+    margin-bottom: 24px;
+}
+.comparison-table th {
+    text-align: left;
+    padding: 12px 16px;
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.12em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    border-bottom: 2px solid var(--border);
+    white-space: nowrap;
+}
+.comparison-table td {
+    padding: 12px 16px;
+    border-bottom: 1px solid var(--border);
+}
+.comparison-table tr:hover td {
+    background: var(--bg-secondary);
+}
+.comparison-table .model-name-cell {
+    font-weight: 600;
+    font-size: 12px;
+}
+.comparison-table .best {
+    color: var(--safe);
+    font-weight: 700;
+}
+/* METRIC CARDS */
+.metrics-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(160px, 1fr));
+    gap: 16px;
+    margin-bottom: 32px;
+}
+.metric-card {
+    padding: 20px;
+    border: 1px solid var(--border);
+    text-align: center;
+}
+.metric-value {
+    font-size: 28px;
+    font-weight: 700;
+    margin-bottom: 4px;
+}
+.metric-label {
+    font-size: 10px;
+    color: var(--text-secondary);
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+}
+.metric-value.highlight {
+    color: var(--safe);
+}
+/* MODEL DETAIL CARDS */
+.model-detail {
+    margin-bottom: 40px;
+    padding: 32px;
+    border: 1px solid var(--border);
+    border-left: 3px solid var(--accent);
+}
+.model-detail-header {
+    display: flex;
+    justify-content: space-between;
+    align-items: flex-start;
+    margin-bottom: 24px;
+    flex-wrap: wrap;
+    gap: 12px;
+}
+.model-detail-name {
+    font-size: 16px;
+    font-weight: 700;
+    letter-spacing: 0.02em;
+}
+.model-detail-type {
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    padding: 4px 10px;
+    border: 1px solid var(--border);
+}
+/* CONFUSION MATRIX */
+.confusion-matrix {
+    display: inline-grid;
+    grid-template-columns: auto auto auto;
+    gap: 0;
+    margin: 16px 0;
+    font-size: 13px;
+}
+.cm-header {
+    padding: 8px 20px;
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    text-align: center;
+}
+.cm-cell {
+    padding: 16px 24px;
+    text-align: center;
+    font-weight: 700;
+    font-size: 14px;
+    font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
+    border: 1px solid var(--border);
+}
+.cm-tp { background: var(--safe-bg); color: var(--safe); }
+.cm-tn { background: var(--safe-bg); color: var(--safe); }
+.cm-fp { background: var(--danger-bg); color: var(--danger); }
+.cm-fn { background: var(--danger-bg); color: var(--danger); }
+.cm-label {
+    padding: 8px 16px;
+    font-size: 10px;
+    font-weight: 600;
+    letter-spacing: 0.08em;
+    text-transform: uppercase;
+    color: var(--text-secondary);
+    display: flex;
+    align-items: center;
+    justify-content: center;
+}
+/* FEATURES LIST */
+.features-list {
+    display: grid;
+    grid-template-columns: 1fr;
+    gap: 4px;
+    margin-top: 16px;
+}
+.feature-row {
+    display: flex;
+    justify-content: space-between;
+    align-items: center;
+    padding: 10px 16px;
+    background: var(--bg-secondary);
+    font-size: 12px;
+}
+.feature-row:nth-child(even) {
+    background: var(--bg);
+}
+.feature-rank {
+    font-size: 10px;
+    color: var(--text-secondary);
+    font-weight: 600;
+    width: 24px;
+    flex-shrink: 0;
+}
+.feature-name {
+    flex: 1;
+    font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
+    font-size: 11px;
+}
+.feature-importance {
+    font-weight: 700;
+    font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
+    font-size: 11px;
+    text-align: right;
+    width: 70px;
+    flex-shrink: 0;
+}
+.importance-bar-bg {
+    flex: 1;
+    max-width: 120px;
+    height: 4px;
+    background: var(--border);
+    margin: 0 12px;
+    flex-shrink: 0;
+}
+.importance-bar-fill {
+    height: 100%;
+    background: var(--text);
+}
+/* HYPERPARAMS */
+.params-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
+    gap: 8px;
+}
+.param-item {
+    display: flex;
+    justify-content: space-between;
+    padding: 10px 16px;
+    background: var(--bg-secondary);
+    font-size: 12px;
+}
+.param-key {
+    color: var(--text-secondary);
+    font-size: 11px;
+}
+.param-value {
+    font-weight: 600;
+    font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
+    font-size: 11px;
+}
+/* PIPELINE DIAGRAM */
+.pipeline {
+    display: flex;
+    margin: 24px 0;
+    counter-reset: step;
+}
+.pipeline-step {
+    flex: 1;
+    display: flex;
+    align-items: center;
+    gap: 10px;
+    padding: 10px 12px;
+    border: 1px solid var(--border);
+    border-right: none;
+    font-size: 11px;
+    font-weight: 600;
+    letter-spacing: 0.05em;
+    text-transform: uppercase;
+    background: var(--bg);
+    position: relative;
+}
+.pipeline-step:last-child { border-right: 1px solid var(--border); }
+/* Arrow connector between steps */
+.pipeline-step:not(:last-child)::after {
+    content: '\2192';
+    position: absolute;
+    right: -4px;
+    top: 50%;
+    transform: translate(50%, -50%);
+    font-size: 11px;
+    color: var(--text-secondary);
+    background: var(--bg);
+    z-index: 1;
+    padding: 2px 0;
+    line-height: 1;
+}
+.pipeline-step .step-number {
+    width: 22px;
+    height: 22px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    border-radius: 50%;
+    background: var(--text);
+    color: var(--bg);
+    font-size: 11px;
+    font-weight: 700;
+    flex-shrink: 0;
+    line-height: 1;
+}
+/* COLLAPSIBLE TOGGLE */
+.collapsible-toggle {
+    cursor: pointer;
+    display: flex;
+    align-items: center;
+    user-select: none;
+}
+.collapsible-toggle:hover {
+    color: var(--text);
+}
+.toggle-icon {
+    margin-left: auto;
+    font-size: 16px;
+    font-weight: 400;
+    color: var(--text-secondary);
+    transition: transform 0.2s;
+    flex-shrink: 0;
+    width: 20px;
+    text-align: center;
+    line-height: 1;
+}
+.collapsible-content {
+    display: none;
+    padding-top: 8px;
+}
+.collapsible-content.open {
+    display: block;
+}
+/* FEATURE GRID */
+.feature-count {
+    font-weight: 400;
+    color: var(--text-secondary);
+    letter-spacing: 0.05em;
+    margin-left: 8px;
+}
+.feature-category-label {
+    font-size: 12px;
+    font-weight: 700;
+    letter-spacing: 0.04em;
+    margin: 24px 0 10px;
+    color: var(--text);
+}
+.feature-category-label:first-of-type {
+    margin-top: 0;
+}
+.feature-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fill, minmax(200px, 1fr));
+    gap: 6px;
+}
+.feature-chip {
+    position: relative;
+    padding: 8px 12px;
+    font-family: 'SF Mono', 'Monaco', 'Inconsolata', monospace;
+    font-size: 11px;
+    background: var(--bg-secondary);
+    border: 1px solid var(--border);
+    cursor: default;
+    transition: all 0.15s;
+    white-space: nowrap;
+    overflow: hidden;
+    text-overflow: ellipsis;
+}
+.feature-chip:hover {
+    border-color: var(--text);
+    background: var(--bg);
+    overflow: visible;
+}
+.feature-chip:hover::after {
+    content: attr(data-tip);
+    position: absolute;
+    bottom: calc(100% + 10px);
+    left: 50%;
+    transform: translateX(-50%);
+    padding: 12px 16px;
+    background: var(--text);
+    color: var(--bg);
+    font-family: 'Helvetica Neue', Helvetica, Arial, sans-serif;
+    font-size: 12px;
+    line-height: 1.5;
+    white-space: normal;
+    width: max-content;
+    max-width: 300px;
+    z-index: 1000;
+    pointer-events: none;
+    box-shadow: 0 4px 16px rgba(0, 0, 0, 0.2);
+}
+.feature-chip:hover::before {
+    content: '';
+    position: absolute;
+    bottom: calc(100% + 4px);
+    left: 50%;
+    transform: translateX(-50%);
+    border: 6px solid transparent;
+    border-top-color: var(--text);
+    z-index: 1000;
+    pointer-events: none;
+}
+/* SUBSECTION SPACER */
+.subsection {
+    margin-top: 24px;
+}
+/* TABLE SCROLL WRAPPER */
+.table-scroll {
+    overflow-x: auto;
+    margin-bottom: 32px;
+}
+/* INSIGHT CARDS (comparison tab) */
+.insights-grid {
+    display: grid;
+    grid-template-columns: repeat(auto-fit, minmax(280px, 1fr));
+    gap: 16px;
+}
+.insight-card {
+    padding: 20px;
+    border: 1px solid var(--border);
+}
+.insight-card.insight-safe  { border-left: 3px solid var(--safe); }
+.insight-card.insight-accent { border-left: 3px solid var(--accent); }
+.insight-label {
+    font-size: 11px;
+    font-weight: 600;
+    letter-spacing: 0.1em;
+    text-transform: uppercase;
+    margin-bottom: 8px;
+}
+.insight-title {
+    font-size: 16px;
+    font-weight: 700;
+    margin-bottom: 4px;
+}
+.insight-desc {
+    font-size: 12px;
+    color: var(--text-secondary);
+}
+/* MODELS PAGE RESPONSIVE */
+@media (max-width: 640px) {
+    .models-page header {
+        flex-direction: column;
+        align-items: flex-start;
+        gap: 16px;
+    }
+    .page-title {
+        font-size: 22px;
+    }
+    .comparison-table {
+        font-size: 11px;
+    }
+    .comparison-table th,
+    .comparison-table td {
+        padding: 8px 10px;
+    }
+    .metrics-grid {
+        grid-template-columns: repeat(2, 1fr);
+    }
+    .model-detail {
+        padding: 20px;
+    }
+    .pipeline {
+        flex-wrap: wrap;
+    }
+    .pipeline-step {
+        flex: 1 1 40%;
+        border-right: 1px solid var(--border);
+        margin: -0.5px;
+    }
+    .pipeline-step:not(:last-child)::after { display: none; }
+    .confusion-matrix {
+        font-size: 11px;
+    }
+    .cm-cell {
+        padding: 12px 16px;
+        font-size: 12px;
+    }
+    .feature-grid {
+        grid-template-columns: repeat(2, 1fr);
+    }
+}

start_server.bat ADDED Viewed

	@@ -0,0 +1,37 @@

+@echo off
+REM Phishing Detection Server Startup Script
+REM Starts the FastAPI server for phishing detection
+echo ============================================
+echo   Phishing Detection Server
+echo ============================================
+echo.
+REM Check if virtual environment exists
+if not exist "venv\" (
+    echo ERROR: Virtual environment not found!
+    echo Please run: python -m venv venv
+    pause
+    exit /b 1
+)
+REM Activate virtual environment
+echo [1/3] Activating virtual environment...
+call venv\Scripts\activate.bat
+REM Install server dependencies if needed
+echo [2/3] Checking dependencies...
+pip install -q -r requirements.txt
+REM Start the server
+echo [3/3] Starting server...
+echo.
+echo ============================================
+echo   Server running at: http://localhost:8000
+echo   API Docs: http://localhost:8000/docs
+echo   Press Ctrl+C to stop
+echo ============================================
+echo.
+cd server
+python -m uvicorn app:app --host 0.0.0.0 --port 8000 --reload

start_server.sh ADDED Viewed

	@@ -0,0 +1,35 @@

+#!/bin/bash
+# Phishing Detection Server Startup Script (Linux/Mac)
+echo "============================================"
+echo "  Phishing Detection Server"
+echo "============================================"
+echo ""
+# Check if virtual environment exists
+if [ ! -d "venv" ]; then
+    echo "ERROR: Virtual environment not found!"
+    echo "Please run: python -m venv venv"
+    exit 1
+fi
+# Activate virtual environment
+echo "[1/3] Activating virtual environment..."
+source venv/bin/activate
+# Install server dependencies if needed
+echo "[2/3] Checking dependencies..."
+pip install -q -r server/requirements.txt
+# Start the server
+echo "[3/3] Starting server..."
+echo ""
+echo "============================================"
+echo "  Server running at: http://localhost:8000"
+echo "  API Docs: http://localhost:8000/docs"
+echo "  Press Ctrl+C to stop"
+echo "============================================"
+echo ""
+cd server
+python -m uvicorn app:app --host 0.0.0.0 --port 8000 --reload