Spaces:

arabovs-ai-lab
/

TimeFlowPro1

Runtime error

App Files Files Community

ArabovMK commited on 15 days ago

Commit

bd3c428

1 Parent(s): f716d2c

Update all files

Browse files

Files changed (39) hide show

.gitignore +9 -0
Dockerfile +1 -1
README.md +262 -14
app.py +0 -0
config/__init__.py +0 -0
config/config.py +169 -0
config/default_config.json +78 -0
config/settings.py +375 -0
correlations/__init__.py +0 -0
correlations/correlation_analyzer.py +687 -0
data_loader/__init__.py +0 -0
data_loader/data_loader.py +487 -0
decomposition/__init__.py +0 -0
decomposition/decomposer.py +690 -0
feature_selection/__init__.py +0 -0
feature_selection/feature_selector.py +478 -0
features/__init__.py +0 -0
features/feature_engineer.py +638 -0
missing_values/__init__.py +0 -0
missing_values/missing_analyzer.py +700 -0
outliers/__init__.py +0 -0
outliers/outlier_analyzer.py +857 -0
pipeline/__init__.py +0 -0
pipeline/main_pipeline.py +603 -0
requirements.txt +100 -3
run_pipeline.py +62 -0
scaling/__init__.py +0 -0
scaling/data_scaler.py +634 -0
splitting/__init__.py +0 -0
splitting/data_splitter.py +403 -0
src/streamlit_app.py +0 -40
stationarity/__init__.py +0 -0
stationarity/stationarity_checker.py +631 -0
streamlit/streamlit_app.py +0 -0
temp_data.csv +0 -0
validation/__init__.py +0 -0
validation/data_validator.py +655 -0
visualization/__init__.py +0 -0
visualization/visualization_manager.py +1462 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,9 @@

+.venv/
+.venv
+__pycache__/
+*__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+streamlit_results/

Dockerfile CHANGED Viewed

@@ -17,4 +17,4 @@ EXPOSE 8501
 HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
-ENTRYPOINT ["streamlit", "run", "src/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]


17
18	HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health
19
20	+ ENTRYPOINT ["streamlit", "run", "streamlit/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

README.md CHANGED Viewed

@@ -1,20 +1,268 @@
 ---
-title: TimeFlowPro
-emoji: 🚀
-colorFrom: red
-colorTo: red
 sdk: docker
-app_port: 8501
-tags:
-- streamlit
-pinned: false
-short_description: TimeFlowPro
-license: mit
 ---
-# Welcome to Streamlit!
-Edit `/src/streamlit_app.py` to customize this app to your heart's desire. :heart:
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).

 ---
+title: TimeFlow Pro
+emoji: 📊
+colorFrom: blue
+colorTo: indigo
 sdk: docker
+pinned: true
+app_file: app.py
+sdk_version: 1.52.2
 ---
+# 📊 TimeFlow Pro
+<div align="center">
+**Intelligent Time Series Data Analysis and Preprocessing Platform**
+*Advanced pipeline for data preparation and feature engineering*
+[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/spaces/your-username/timeflow-pro)
+[![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
+[![Python](https://img.shields.io/badge/Python-3.9+-blue)](https://python.org)
+</div>
+## 🌟 Overview
+TimeFlow Pro is a comprehensive platform for time series data analysis, preprocessing, and feature engineering. Designed for data scientists and analysts, it provides an intuitive interface for transforming raw time series data into ML-ready datasets with advanced preprocessing capabilities.
+## 🚀 Key Features
+### 📈 **Data Analysis & Visualization**
+- **Interactive Data Exploration**: Real-time preview and statistics
+- **Missing Value Analysis**: Smart detection and handling strategies
+- **Outlier Detection**: Multiple methods including IQR, Z-Score, Isolation Forest
+- **Temporal Analysis**: Seasonality detection, trend analysis, decomposition
+### ⚙️ **Advanced Preprocessing Pipeline**
+- **Feature Engineering**: Automatic lag features, rolling statistics, seasonal components
+- **Stationarity Checking**: ADF tests and transformation suggestions
+- **Data Scaling**: Robust, Standard, MinMax, and custom scaling methods
+- **Feature Selection**: Correlation, variance, mutual information, RF importance
+### 🏗️ **ML-Ready Outputs**
+- **Train/Validation/Test Splits**: Time-based or random splitting
+- **Multiple Export Formats**: CSV, Parquet, Excel, JSON
+- **Model Integration**: Ready-to-use datasets for scikit-learn, XGBoost, LightGBM
+- **Visual Reports**: Comprehensive pipeline execution reports
+## 🎮 Quick Start
+### 1. **Upload Your Data**
+- Support for CSV, Excel, Parquet formats
+- Automatic date parsing and validation
+- Smart column type detection
+### 2. **Configure Pipeline**
+```python
+# Example configuration
+config = {
+    'target_column': 'sales',
+    'test_size': 0.2,
+    'max_lags': 5,
+    'seasonal_period': 365,
+    'scaling_method': 'robust'
+}
+```
+### 3. **Run Pipeline & Export**
+- Execute full preprocessing pipeline
+- Download processed data
+- Get feature importance reports
+- Export modeling datasets
+## 📊 Technical Architecture
+### 🔧 **Pipeline Components**
+```
+Data Loading → Validation → Missing Handling → Outlier Treatment
+     ↓
+Feature Engineering → Stationarity Check → Correlation Analysis
+     ↓
+Data Splitting → Scaling → Feature Selection → Final Validation
+```
+### 🏆 **Core Features**
+- **Multi-stage Validation**: Raw, processed, and final data validation
+- **Memory Optimization**: Efficient handling of large datasets
+- **Error Recovery**: Graceful handling of pipeline failures
+- **Reproducible Results**: Configuration saving and logging
+## 📚 Use Cases
+### 🏢 **Business Analytics**
+- Sales forecasting and trend analysis
+- Inventory optimization
+- Customer behavior prediction
+- Financial time series analysis
+### 🏭 **Industrial Applications**
+- Sensor data preprocessing
+- Predictive maintenance
+- Quality control monitoring
+- Energy consumption forecasting
+### 🎓 **Academic Research**
+- Time series modeling experiments
+- Feature engineering research
+- Algorithm comparison studies
+- Educational tool for data science
+## 🛠️ Installation
+### Local Development
+```bash
+# Clone repository
+git clone https://huggingface.co/spaces/your-username/timeflow-pro
+cd timeflow-pro
+# Install dependencies
+pip install -r requirements.txt
+# Run application
+streamlit run app.py
+```
+### Docker Deployment
+```bash
+# Build Docker image
+docker build -t timeflow-pro .
+# Run container
+docker run -p 8501:8501 timeflow-pro
+```
+## 🌐 API Usage Example
+```python
+from timeflow_pro import TimeFlowPipeline
+import pandas as pd
+# Load your data
+data = pd.read_csv('your_data.csv')
+# Configure pipeline
+config = {
+    'target_column': 'target',
+    'test_size': 0.2,
+    'max_lags': 7,
+    'seasonal_period': 30
+}
+# Create and run pipeline
+pipeline = TimeFlowPipeline(config)
+processed_data = pipeline.run(data)
+# Get modeling data
+modeling_data = pipeline.get_modeling_data()
+X_train, y_train = modeling_data['X_train'], modeling_data['y_train']
+```
+## 📈 Performance Benchmarks
+| Dataset Size | Processing Time | Memory Usage | Features Generated |
+|--------------|----------------|--------------|-------------------|
+| 10K rows | ~5 seconds | <500 MB | 50-100 features |
+| 100K rows | ~30 seconds | <1 GB | 100-200 features |
+| 1M rows | ~5 minutes | <2 GB | 200-500 features |
+## 🔧 Configuration Options
+### **Data Processing**
+- `missing_threshold`: Threshold for column removal (0.0-0.5)
+- `outlier_method`: IQR, Z-Score, or Isolation Forest
+- `scaling_method`: Robust, Standard, MinMax, or None
+### **Feature Engineering**
+- `max_lags`: Maximum lag features (1-20)
+- `seasonal_period`: Seasonal window (7, 30, 90, 365)
+- `rolling_windows`: List of rolling windows [7, 30, 90]
+### **Model Preparation**
+- `feature_selection_method`: Correlation, Variance, RF, Mutual Info
+- `max_features`: Maximum features to select (5-100)
+- `split_method`: Time-based or random splitting
+## 📋 Requirements
+### **Core Dependencies**
+```txt
+streamlit>=1.28.0
+pandas>=2.0.0
+numpy>=1.24.0
+plotly>=5.17.0
+scikit-learn>=1.3.0
+```
+### **Optional Dependencies**
+```txt
+xgboost>=2.0.0      # For XGBoost feature importance
+lightgbm>=4.0.0     # For LightGBM integration
+statsmodels>=0.14.0 # For advanced time series analysis
+```
+## 🤝 Contributing
+We welcome contributions! Here's how you can help:
+### **Areas for Contribution**
+1. **New Feature Engineering Methods**
+2. **Additional Visualization Types**
+3. **Export Format Support**
+4. **Performance Optimizations**
+5. **Documentation Improvements**
+### **Development Workflow**
+```bash
+# 1. Fork the repository
+# 2. Create feature branch
+git checkout -b feature/new-feature
+# 3. Make changes and test
+# 4. Submit pull request
+```
+## 📜 License
+This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+### **Special Thanks To:**
+- **Streamlit Team** for the amazing framework
+- **Hugging Face** for hosting the Space
+- **Open Source Community** for invaluable libraries
+- **All Contributors** who helped improve TimeFlow Pro
+### **Built With:**
+- 🐍 Python
+- 📊 Streamlit
+- 🎨 Plotly
+- 🔧 Scikit-learn
+- 📈 Pandas & NumPy
+## 📞 Support & Contact
+### **Get Help:**
+- 📧 **Email**: cool.araby@gmail.com
+- 💬 **Issues**: [GitHub Issues](https://github.com/your-username/timeflow-pro/issues)
+- 💡 **Discussions**: [Community Forum](https://github.com/your-username/timeflow-pro/discussions)
+### **Stay Updated:**
+- ⭐ **Star** the repository
+- 👁️ **Watch** for releases
+- 🔔 **Enable notifications**
+---
+<div align="center">
+**Transform Your Time Series Data with Ease**
+*TimeFlow Pro - Making Data Preparation Simple and Powerful*
+[![Follow on Hugging Face](https://img.shields.io/badge/Follow%20on-🤗%20Hugging%20Face-yellow)](https://huggingface.co/your-username)
+[![GitHub Stars](https://img.shields.io/github/stars/your-username/timeflow-pro?style=social)](https://github.com/your-username/timeflow-pro)
+</div>

app.py ADDED Viewed

The diff for this file is too large to render. See raw diff

config/__init__.py ADDED Viewed

File without changes

config/config.py ADDED Viewed

	@@ -0,0 +1,169 @@

+# ============================================
+# ENUMERATION CLASSES
+# ============================================
+from dataclasses import asdict, dataclass, field
+from enum import Enum
+import json
+import logging
+from pathlib import Path
+from typing import Dict, List, Optional
+from venv import logger
+class DataType(Enum):
+    """Data types"""
+    NUMERIC = "numeric"
+    CATEGORICAL = "categorical"
+    TEMPORAL = "temporal"
+    TEXT = "text"
+class PreprocessingMethod(Enum):
+    """Data preprocessing methods"""
+    FILL_MEAN = "fill_mean"
+    FILL_MEDIAN = "fill_median"
+    FILL_INTERPOLATE = "fill_interpolate"
+    FILL_KNN = "fill_knn"
+    REMOVE = "remove"
+    CLIP = "clip"
+    WINSORIZE = "winsorize"
+    NORMALIZE = "normalize"
+    STANDARDIZE = "standardize"
+    LOG_TRANSFORM = "log_transform"
+    BOX_COX = "box_cox"
+    DIFFERENCING = "differencing"
+class SeasonalityType(Enum):
+    """Seasonality types"""
+    DAILY = "daily"
+    WEEKLY = "weekly"
+    MONTHLY = "monthly"
+    QUARTERLY = "quarterly"
+    YEARLY = "yearly"
+    MULTIPLE = "multiple"
+# ============================================
+# CLASS 1: CONFIGURATION
+# ============================================
+@dataclass
+class Config:
+    """Experiment configuration for data preprocessing"""
+    # Paths and directories
+    data_path: str = 'temp_data.csv'
+    results_dir: str = 'data_preprocessing_results'
+    # Temporal parameters
+    start_year: int = 1970
+    end_year: int = 1990
+    freq: str = 'D'  # Data frequency: D (daily), H (hourly), M (monthly)
+    # Target variable
+    target_column: str = 'raskhodvoda'
+    # Feature parameters
+    max_lags: int = 12
+    seasonal_period: int = 365
+    rolling_windows: List[int] = field(default_factory=lambda: [7, 30, 90, 365])
+    expanding_windows: List[int] = field(default_factory=lambda: [30, 90, 365])
+    # Processing parameters
+    missing_threshold: float = 0.3  # Threshold for dropping columns with missing values
+    outlier_method: str = 'iqr'  # Outlier detection method: iqr, zscore, lof
+    outlier_alpha: float = 1.5  # IQR multiplier
+    outlier_contamination: float = 0.1  # For methods like LOF
+    # Data splitting
+    test_size: float = 0.2
+    validation_size: float = 0.1
+    split_method: str = 'time'  # time, random, expanding_window
+    # Scaling
+    scaling_method: str = 'robust'  # standard, minmax, robust, none
+    # Feature selection
+    feature_selection_method: str = 'correlation'  # correlation, mutual_info, rf, pca
+    max_features: int = 50
+    # Validation
+    enable_validation: bool = True
+    validation_rules: Dict = field(default_factory=dict)
+    # Visualisation
+    save_plots: bool = True
+    plot_style: str = 'seaborn'
+    # Performance
+    use_multiprocessing: bool = False
+    n_jobs: int = -1
+    chunk_size: int = 10000
+    # Logging
+    log_level: str = 'INFO'
+    save_reports: bool = True
+    def __post_init__(self):
+        """Post-initialisation for creating directories and setting up logging"""
+        self.create_directories()
+        self.setup_logging()
+        # Setting default validation rules
+        if not self.validation_rules:
+            self.validation_rules = {
+                'min_rows': 100,
+                'max_missing_percentage': 30,
+                'min_unique_values': 2,
+                'max_skewness': 3,
+                'max_kurtosis': 10
+            }
+    def create_directories(self) -> None:
+        """Create directories for preprocessing results"""
+        dirs = [
+            self.results_dir,
+            f'{self.results_dir}/plots',
+            f'{self.results_dir}/plots/time_series',
+            f'{self.results_dir}/plots/distributions',
+            f'{self.results_dir}/plots/correlations',
+            f'{self.results_dir}/plots/features',
+            f'{self.results_dir}/tables',
+            f'{self.results_dir}/processed_data',
+            f'{self.results_dir}/models',
+            f'{self.results_dir}/reports',
+            f'{self.results_dir}/logs',
+            f'{self.results_dir}/checkpoints'
+        ]
+        for directory in dirs:
+            Path(directory).mkdir(parents=True, exist_ok=True)
+        logger.info(f"Directories created in {self.results_dir}")
+    def setup_logging(self) -> None:
+        """Configure logging"""
+        log_level = getattr(logging, self.log_level.upper())
+        logger.setLevel(log_level)
+    def to_dict(self) -> Dict:
+        """Convert configuration to dictionary"""
+        return asdict(self)
+    def save(self, path: Optional[str] = None) -> None:
+        """Save configuration to file"""
+        if path is None:
+            path = f'{self.results_dir}/config.json'
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(self.to_dict(), f, indent=4, ensure_ascii=False)
+        logger.info(f"Configuration saved to {path}")
+    @classmethod
+    def load(cls, path: str) -> 'Config':
+        """Load configuration from file"""
+        with open(path, 'r', encoding='utf-8') as f:
+            config_dict = json.load(f)
+        return cls(**config_dict)

config/default_config.json ADDED Viewed

	@@ -0,0 +1,78 @@

+{
+  "data_path": "temp_data.csv",
+  "results_dir": "results",
+  "start_year": 1970,
+  "end_year": 1990,
+  "freq": "D",
+  "target_column": "raskhodvoda",
+  "max_lags": 12,
+  "seasonal_period": 365,
+  "rolling_windows": [7, 30, 90, 365],
+  "expanding_windows": [30, 90, 365],
+  "missing_threshold": 0.3,
+  "outlier_method": "iqr",
+  "outlier_alpha": 1.5,
+  "outlier_contamination": 0.1,
+  "test_size": 0.2,
+  "validation_size": 0.1,
+  "split_method": "time",
+  "scaling_method": "robust",
+  "feature_selection_method": "correlation",
+  "max_features": 50,
+  "enable_validation": true,
+  "validation_rules": {
+    "min_rows": 100,
+    "max_missing_percentage": 30,
+    "min_unique_values": 2,
+    "max_skewness": 3,
+    "max_kurtosis": 10,
+    "min_variance": 0.001,
+    "max_constant_columns": 0
+  },
+  "save_plots": true,
+  "plot_style": "seaborn-whitegrid",
+  "plot_dpi": 300,
+  "plot_format": "png",
+  "use_multiprocessing": false,
+  "n_jobs": -1,
+  "chunk_size": 10000,
+  "memory_limit_gb": 4,
+  "log_level": "INFO",
+  "save_reports": true,
+  "report_format": "json",
+  "decomposition_method": "stl",
+  "stationarity_tests": ["adf", "kpss"],
+  "correlation_threshold": 0.85,
+  "vif_threshold": 10,
+  "random_seed": 42,
+  "enable_profiling": false,
+  "save_intermediate": true,
+  "streamlit_settings": {
+    "theme": "light",
+    "sidebar_state": "expanded",
+    "page_title": "Time Series Preprocessing",
+    "page_icon": "📊",
+    "layout": "wide"
+  },
+  "export_options": {
+    "csv": true,
+    "parquet": false,
+    "excel": false,
+    "pickle": true
+  }
+}

config/settings.py ADDED Viewed

	@@ -0,0 +1,375 @@

+"""
+General project settings: visualisation, paths, constants
+"""
+import warnings
+import matplotlib.pyplot as plt
+import seaborn as sns
+from pathlib import Path
+from typing import Dict, Any, Optional
+import yaml
+import json
+import os
+# ============================================================================
+# PATHS AND DIRECTORIES
+# ============================================================================
+PROJECT_ROOT = Path(__file__).parent.parent.parent
+DATA_DIR = PROJECT_ROOT / "data"
+RAW_DATA_DIR = DATA_DIR / "raw"
+PROCESSED_DATA_DIR = DATA_DIR / "processed"
+EXTERNAL_DATA_DIR = DATA_DIR / "external"
+RESULTS_DIR = PROJECT_ROOT / "results"
+PLOTS_DIR = RESULTS_DIR / "plots"
+MODELS_DIR = RESULTS_DIR / "models"
+REPORTS_DIR = RESULTS_DIR / "reports"
+LOGS_DIR = RESULTS_DIR / "logs"
+CONFIGS_DIR = PROJECT_ROOT / "configs"
+NOTEBOOKS_DIR = PROJECT_ROOT / "notebooks"
+TESTS_DIR = PROJECT_ROOT / "tests"
+# Create directories on import
+for directory in [RAW_DATA_DIR, PROCESSED_DATA_DIR, EXTERNAL_DATA_DIR,
+                  PLOTS_DIR, MODELS_DIR, REPORTS_DIR, LOGS_DIR]:
+    directory.mkdir(parents=True, exist_ok=True)
+# ============================================================================
+# VISUALISATION SETTINGS
+# ============================================================================
+def setup_visualization(
+    style: str = "seaborn-whitegrid",
+    palette: str = "husl",
+    context: str = "notebook",
+    font_scale: float = 1.0,
+    dpi: int = 150,
+    figsize: tuple = (12, 6),
+    **kwargs
+):
+    """
+    Configure visualisation parameters for matplotlib and seaborn
+    Parameters:
+    -----------
+    style : str
+        Matplotlib style: 'seaborn-whitegrid', 'ggplot', 'bmh', 'dark_background'
+    palette : str
+        Seaborn palette: 'husl', 'Set2', 'viridis', 'mako'
+    context : str
+        Seaborn context: 'paper', 'notebook', 'talk', 'poster'
+    font_scale : float
+        Font scale
+    dpi : int
+        Plot resolution
+    figsize : tuple
+        Default figure size
+    """
+    # Ignore warnings
+    warnings.filterwarnings('ignore')
+    # Matplotlib settings
+    plt.style.use(style)
+    # RC parameters
+    rc_params = {
+        'font.size': 10,
+        'figure.figsize': figsize,
+        'figure.dpi': dpi,
+        'savefig.dpi': 300,
+        'savefig.bbox': 'tight',
+        'savefig.format': 'png',
+        'axes.titlesize': 12,
+        'axes.labelsize': 10,
+        'xtick.labelsize': 9,
+        'ytick.labelsize': 9,
+        'legend.fontsize': 9,
+        'font.family': ['DejaVu Sans', 'Arial', 'sans-serif'],
+        'figure.titlesize': 14,
+        'axes.grid': True,
+        'grid.alpha': 0.3,
+        'lines.linewidth': 1.5,
+        'lines.markersize': 6,
+        'patch.edgecolor': 'black',
+        'patch.force_edgecolor': True,
+        'xtick.top': False,
+        'ytick.right': False,
+        'axes.spines.top': False,
+        'axes.spines.right': False
+    }
+    # Update additional parameters
+    rc_params.update(kwargs)
+    plt.rcParams.update(rc_params)
+    # Seaborn settings
+    sns.set_style(style.replace('seaborn-', ''))
+    sns.set_palette(palette)
+    sns.set_context(context, font_scale=font_scale)
+    print(f"✓ Visualisation settings applied: style={style}, palette={palette}")
+def get_color_palette(name: str = "husl", n_colors: int = 8) -> list:
+    """
+    Get colour palette
+    Parameters:
+    -----------
+    name : str
+        Palette name
+    n_colors : int
+        Number of colours
+    Returns:
+    --------
+    list
+        List of colours in HEX format
+    """
+    palette_map = {
+        "husl": sns.color_palette("husl", n_colors),
+        "Set2": sns.color_palette("Set2", n_colors),
+        "Set3": sns.color_palette("Set3", n_colors),
+        "viridis": sns.color_palette("viridis", n_colors),
+        "plasma": sns.color_palette("plasma", n_colors),
+        "coolwarm": sns.color_palette("coolwarm", n_colors),
+        "RdYlBu": sns.color_palette("RdYlBu", n_colors),
+        "Spectral": sns.color_palette("Spectral", n_colors),
+        "tab10": sns.color_palette("tab10", n_colors),
+        "tab20": sns.color_palette("tab20", n_colors),
+    }
+    palette = palette_map.get(name, sns.color_palette("husl", n_colors))
+    return [f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}"
+            for r, g, b in palette]
+# ============================================================================
+# CONSTANTS
+# ============================================================================
+# Data types
+DATETIME_FORMATS = [
+    "%Y-%m-%d", "%Y/%m/%d", "%d.%m.%Y", "%d/%m/%Y",
+    "%Y-%m-%d %H:%M:%S", "%Y/%m/%d %H:%M:%S",
+    "%d.%m.%Y %H:%M:%S", "%d/%m/%Y %H:%M:%S"
+]
+# Metrics
+METRICS = {
+    "regression": ["mse", "rmse", "mae", "mape", "r2", "explained_variance"],
+    "classification": ["accuracy", "precision", "recall", "f1", "roc_auc"]
+}
+# Statistical constants
+STATS_CONSTANTS = {
+    "confidence_levels": [0.9, 0.95, 0.99],
+    "z_scores": {0.9: 1.645, 0.95: 1.96, 0.99: 2.576},
+    "outlier_multipliers": {"mild": 1.5, "extreme": 3.0}
+}
+# Time series parameters
+TIME_SERIES_CONSTANTS = {
+    "frequencies": {
+        "H": "hourly",
+        "D": "daily",
+        "W": "weekly",
+        "M": "monthly",
+        "Q": "quarterly",
+        "Y": "yearly"
+    },
+    "seasonal_periods": {
+        "hourly": 24,
+        "daily": 7,
+        "weekly": 52,
+        "monthly": 12,
+        "quarterly": 4,
+        "yearly": 1
+    }
+}
+# ============================================================================
+# CONFIGURATION UTILITIES
+# ============================================================================
+def load_config(config_path: Optional[str] = None) -> Dict[str, Any]:
+    """
+    Load configuration from file
+    Parameters:
+    -----------
+    config_path : str, optional
+        Path to configuration file
+    Returns:
+    --------
+    Dict[str, Any]
+        Configuration dictionary
+    """
+    if config_path is None:
+        config_path = CONFIGS_DIR / "default_config.json"
+    config_path = Path(config_path)
+    if not config_path.exists():
+        print(f"⚠ Configuration file not found: {config_path}")
+        return {}
+    # Determine file format
+    if config_path.suffix.lower() in ['.json']:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = json.load(f)
+    elif config_path.suffix.lower() in ['.yaml', '.yml']:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+    else:
+        raise ValueError(f"Unsupported file format: {config_path.suffix}")
+    print(f"✓ Configuration loaded from: {config_path}")
+    return config
+def save_config(config: Dict[str, Any], config_path: str) -> None:
+    """
+    Save configuration to file
+    Parameters:
+    -----------
+    config : Dict[str, Any]
+        Configuration to save
+    config_path : str
+        Save path
+    """
+    config_path = Path(config_path)
+    config_path.parent.mkdir(parents=True, exist_ok=True)
+    # Determine format
+    if config_path.suffix.lower() in ['.json']:
+        with open(config_path, 'w', encoding='utf-8') as f:
+            json.dump(config, f, indent=2, ensure_ascii=False)
+    elif config_path.suffix.lower() in ['.yaml', '.yml']:
+        with open(config_path, 'w', encoding='utf-8') as f:
+            yaml.dump(config, f, default_flow_style=False, allow_unicode=True)
+    else:
+        raise ValueError(f"Unsupported file format: {config_path.suffix}")
+    print(f"✓ Configuration saved to: {config_path}")
+def merge_configs(base_config: Dict[str, Any],
+                  override_config: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Recursive configuration merging
+    Parameters:
+    -----------
+    base_config : Dict[str, Any]
+        Base configuration
+    override_config : Dict[str, Any]
+        Override configuration
+    Returns:
+    --------
+    Dict[str, Any]
+        Merged configuration
+    """
+    result = base_config.copy()
+    for key, value in override_config.items():
+        if (key in result and isinstance(result[key], dict)
+            and isinstance(value, dict)):
+            result[key] = merge_configs(result[key], value)
+        else:
+            result[key] = value
+    return result
+# ============================================================================
+# ENVIRONMENT SETUP
+# ============================================================================
+def setup_environment(
+    log_level: str = "INFO",
+    random_seed: int = 42,
+    enable_warnings: bool = False,
+    memory_limit_gb: Optional[int] = None
+) -> None:
+    """
+    Set up environment for reproducibility
+    Parameters:
+    -----------
+    log_level : str
+        Logging level
+    random_seed : int
+        Seed for random generators
+    enable_warnings : bool
+        Enable warnings
+    memory_limit_gb : int, optional
+        Memory limit in GB
+    """
+    import numpy as np
+    import random
+    import torch
+    import tensorflow as tf
+    # Set seeds
+    np.random.seed(random_seed)
+    random.seed(random_seed)
+    try:
+        torch.manual_seed(random_seed)
+    except:
+        pass
+    try:
+        tf.random.set_seed(random_seed)
+    except:
+        pass
+    # Configure warnings
+    if enable_warnings:
+        warnings.filterwarnings('default')
+    else:
+        warnings.filterwarnings('ignore')
+    # Memory limit (if specified)
+    if memory_limit_gb:
+        import resource
+        soft, hard = resource.getrlimit(resource.RLIMIT_AS)
+        memory_limit = memory_limit_gb * 1024**3  # GB to bytes
+        resource.setrlimit(resource.RLIMIT_AS, (memory_limit, hard))
+        print(f"✓ Memory limit set: {memory_limit_gb} GB")
+    print(f"✓ Environment configured. Random seed: {random_seed}")
+# ============================================================================
+# AUTOMATIC SETUP ON IMPORT
+# ============================================================================
+# Automatically apply visualisation settings
+setup_visualization()
+# Export useful variables
+__all__ = [
+    'setup_visualization',
+    'get_color_palette',
+    'load_config',
+    'save_config',
+    'merge_configs',
+    'setup_environment',
+    'PROJECT_ROOT',
+    'DATA_DIR',
+    'RAW_DATA_DIR',
+    'PROCESSED_DATA_DIR',
+    'RESULTS_DIR',
+    'PLOTS_DIR',
+    'DATETIME_FORMATS',
+    'METRICS',
+    'STATS_CONSTANTS',
+    'TIME_SERIES_CONSTANTS'
+]

correlations/__init__.py ADDED Viewed

File without changes

correlations/correlation_analyzer.py ADDED Viewed

	@@ -0,0 +1,687 @@

+# ============================================
+# CLASS 8: CORRELATION AND MULTICOLLINEARITY ANALYSIS
+# ============================================
+import os
+import traceback
+from typing import Any, Dict, List, Optional
+from venv import logger
+from config.config import Config
+import numpy as np
+import pandas as pd
+class CorrelationAnalyzer:
+    """Class for comprehensive correlation and multicollinearity analysis"""
+    def __init__(self, config: Config):
+        """
+        Initialise the analyser
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.correlation_matrices = {}
+        self.high_correlation_pairs = {}
+        self.multicollinearity_info = {}
+        self.vif_scores = {}
+    def analyze(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        threshold: float = 0.8,
+        detailed: bool = True,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Analyse correlations in the data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable
+        threshold : float
+            Threshold for identifying high correlations
+        detailed : bool
+            Whether to perform detailed analysis
+        **kwargs : dict
+            Additional parameters
+        Returns:
+        --------
+        pd.DataFrame
+            Correlation matrix
+        """
+        logger.info("\n" + "="*80)
+        logger.info("CORRELATION AND MULTICOLLINEARITY ANALYSIS")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        try:
+            # 1. Calculate correlation matrix
+            corr_matrix = self._compute_correlations(data, target_col)
+            if corr_matrix.empty:
+                logger.warning("Correlation matrix is empty")
+                return pd.DataFrame()
+            # 2. Identify high correlations
+            high_correlations = self._detect_high_correlations(corr_matrix, threshold)
+            self.high_correlation_pairs['pearson'] = high_correlations
+            # 3. Analyse correlations with target variable
+            target_correlations = []
+            if target_col in corr_matrix.columns:
+                target_correlations = self._get_target_correlations(corr_matrix, target_col)
+            # 4. Analyse multicollinearity (VIF)
+            vif_results = self._compute_vif_scores(data)
+            # 5. Detailed analysis if required
+            if detailed:
+                self._detailed_correlation_analysis(data, corr_matrix, target_col)
+            # 6. Visualisation
+            if self.config.save_plots:
+                self._plot_correlation_analysis(data, corr_matrix, target_col, high_correlations, vif_results)
+            # 7. Output results
+            self._log_analysis_results(corr_matrix, high_correlations, target_correlations, vif_results)
+            return corr_matrix
+        except Exception as e:
+            logger.error(f"Error in correlation analysis: {e}")
+            logger.error(traceback.format_exc())
+            return pd.DataFrame()
+    def _compute_correlations(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """Calculate correlation matrix"""
+        logger.info("Calculating correlation matrix...")
+        # Select only numeric columns
+        numeric_data = data.select_dtypes(include=[np.number])
+        # Remove constant columns
+        numeric_data = numeric_data.loc[:, numeric_data.nunique() > 1]
+        if numeric_data.shape[1] < 2:
+            logger.warning("Insufficient numeric features for analysis")
+            return pd.DataFrame()
+        # Remove missing values
+        numeric_data_clean = numeric_data.dropna()
+        if len(numeric_data_clean) < 10:
+            logger.warning("Insufficient data after cleaning")
+            return pd.DataFrame()
+        # Calculate Pearson correlation
+        try:
+            corr_matrix = numeric_data_clean.corr(method='pearson')
+            self.correlation_matrices['pearson'] = corr_matrix
+            logger.info(f"✓ Correlation matrix calculated: {corr_matrix.shape}")
+            return corr_matrix
+        except Exception as e:
+            logger.error(f"Error calculating correlation: {e}")
+            return pd.DataFrame()
+    def _detect_high_correlations(
+        self,
+        corr_matrix: pd.DataFrame,
+        threshold: float = 0.8
+    ) -> List[Dict[str, Any]]:
+        """Detect high correlations"""
+        high_correlations = []
+        if corr_matrix.empty:
+            return high_correlations
+        # Use upper triangle of matrix
+        upper_triangle = corr_matrix.where(
+            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
+        )
+        # Find pairs with correlation above threshold
+        for col in upper_triangle.columns:
+            if col in upper_triangle:
+                high_corr_series = upper_triangle[col][abs(upper_triangle[col]) > threshold]
+                for row_idx, correlation in high_corr_series.items():
+                    if not pd.isna(correlation):
+                        high_correlations.append({
+                            'feature1': row_idx,
+                            'feature2': col,
+                            'correlation': float(correlation),
+                            'abs_correlation': abs(float(correlation))
+                        })
+        # Sort by absolute correlation value
+        high_correlations.sort(key=lambda x: x['abs_correlation'], reverse=True)
+        logger.info(f"High correlations detected (> {threshold}): {len(high_correlations)}")
+        return high_correlations
+    def _get_target_correlations(
+        self,
+        corr_matrix: pd.DataFrame,
+        target_col: str
+    ) -> List[Dict[str, Any]]:
+        """Get correlations with target variable"""
+        target_correlations = []
+        if target_col not in corr_matrix.columns:
+            return target_correlations
+        # Extract correlations with target variable
+        target_corr_series = corr_matrix[target_col]
+        for feature, correlation in target_corr_series.items():
+            if feature != target_col and not pd.isna(correlation):
+                target_correlations.append({
+                    'feature': feature,
+                    'correlation': float(correlation),
+                    'abs_correlation': abs(float(correlation)),
+                    'direction': 'positive' if correlation > 0 else 'negative'
+                })
+        # Sort by absolute value
+        target_correlations.sort(key=lambda x: x['abs_correlation'], reverse=True)
+        logger.info(f"Correlations with target variable calculated: {len(target_correlations)}")
+        return target_correlations
+    def _compute_vif_scores(self, data: pd.DataFrame) -> Dict[str, Any]:
+        """Calculate VIF (Variance Inflation Factor)"""
+        logger.info("Analysing multicollinearity (VIF)...")
+        vif_results = {
+            'scores': {},
+            'issues': [],
+            'summary': {
+                'critical': 0,
+                'high': 0,
+                'medium': 0,
+                'low': 0
+            }
+        }
+        try:
+            from statsmodels.stats.outliers_influence import variance_inflation_factor
+            import statsmodels.api as sm
+            # Prepare data
+            numeric_data = data.select_dtypes(include=[np.number])
+            numeric_data = numeric_data.loc[:, numeric_data.nunique() > 1]
+            # Remove missing and infinite values
+            clean_data = numeric_data.replace([np.inf, -np.inf], np.nan).dropna()
+            if clean_data.shape[0] < 10 or clean_data.shape[1] < 2:
+                logger.warning("Insufficient data for VIF analysis")
+                return vif_results
+            # Add constant
+            X = sm.add_constant(clean_data, has_constant='add')
+            # Calculate VIF for each feature
+            vif_scores = {}
+            for i, column in enumerate(X.columns):
+                if column == 'const':
+                    continue
+                try:
+                    vif = variance_inflation_factor(X.values, i)
+                    # Handle extreme values
+                    if np.isinf(vif) or vif > 1e6:
+                        vif = 1e6
+                    vif_scores[column] = float(vif)
+                    # Classify by severity
+                    if vif > 100:
+                        vif_results['summary']['critical'] += 1
+                        vif_results['issues'].append({
+                            'feature': column,
+                            'vif': float(vif),
+                            'severity': 'critical',
+                            'recommendation': 'Remove feature'
+                        })
+                    elif vif > 10:
+                        vif_results['summary']['high'] += 1
+                        vif_results['issues'].append({
+                            'feature': column,
+                            'vif': float(vif),
+                            'severity': 'high',
+                            'recommendation': 'Consider removal'
+                        })
+                    elif vif > 5:
+                        vif_results['summary']['medium'] += 1
+                    else:
+                        vif_results['summary']['low'] += 1
+                except Exception as e:
+                    logger.warning(f"VIF error for {column}: {e}")
+                    vif_scores[column] = np.nan
+            vif_results['scores'] = vif_scores
+            self.vif_scores = vif_scores
+            logger.info(f"✓ VIF analysis completed. Critical features: {vif_results['summary']['critical']}")
+        except ImportError:
+            logger.warning("statsmodels not installed, skipping VIF analysis")
+        except Exception as e:
+            logger.error(f"VIF analysis error: {e}")
+        return vif_results
+    def _detailed_correlation_analysis(
+        self,
+        data: pd.DataFrame,
+        corr_matrix: pd.DataFrame,
+        target_col: str
+    ) -> None:
+        """Detailed correlation analysis"""
+        # Analyse correlation clusters
+        if not corr_matrix.empty and corr_matrix.shape[0] > 3:
+            try:
+                # Use clustering to group correlated features
+                from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
+                from scipy.spatial.distance import squareform
+                # Convert correlations to distances
+                distance_matrix = 1 - abs(corr_matrix)
+                np.fill_diagonal(distance_matrix.values, 0)
+                # Clustering
+                condensed_dist = squareform(distance_matrix)
+                Z = linkage(condensed_dist, method='average')
+                # Determine clusters
+                clusters = fcluster(Z, t=0.5, criterion='distance')
+                # Group features by cluster
+                feature_clusters = {}
+                for idx, cluster_id in enumerate(clusters):
+                    feature = corr_matrix.columns[idx]
+                    if cluster_id not in feature_clusters:
+                        feature_clusters[cluster_id] = []
+                    feature_clusters[cluster_id].append(feature)
+                # Save cluster information
+                self.multicollinearity_info['correlation_clusters'] = feature_clusters
+                logger.info(f"Correlated feature clusters detected: {len(feature_clusters)}")
+            except Exception as e:
+                logger.debug(f"Cluster analysis failed: {e}")
+    def _plot_correlation_analysis(
+        self,
+        data: pd.DataFrame,
+        corr_matrix: pd.DataFrame,
+        target_col: str,
+        high_correlations: List[Dict[str, Any]],
+        vif_results: Dict[str, Any]
+    ) -> None:
+        """Visualise correlation analysis"""
+        try:
+            import matplotlib.pyplot as plt
+            import seaborn as sns
+            from matplotlib import rcParams
+            # Style settings
+            plt.style.use('seaborn-v0_8-darkgrid')
+            rcParams.update({
+                'figure.figsize': (12, 8),
+                'font.size': 10,
+                'axes.titlesize': 14,
+                'axes.labelsize': 12
+            })
+            # Create directory
+            plots_dir = os.path.join(self.config.results_dir, 'plots', 'correlations')
+            os.makedirs(plots_dir, exist_ok=True)
+            # 1. Correlation matrix heatmap
+            if not corr_matrix.empty and corr_matrix.shape[0] > 1:
+                fig, ax = plt.subplots(figsize=(14, 12))
+                mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+                sns.heatmap(
+                    corr_matrix,
+                    mask=mask,
+                    annot=True,
+                    fmt='.2f',
+                    cmap='coolwarm',
+                    center=0,
+                    square=True,
+                    linewidths=0.5,
+                    cbar_kws={"shrink": 0.8},
+                    ax=ax
+                )
+                ax.set_title('Correlation Matrix (Pearson)', fontweight='bold')
+                plt.tight_layout()
+                plt.savefig(os.path.join(plots_dir, 'correlation_matrix.png'),
+                           dpi=150, bbox_inches='tight')
+                plt.close()
+            # 2. Target variable correlations
+            if target_col in corr_matrix.columns:
+                target_corrs = corr_matrix[target_col].drop(target_col, errors='ignore')
+                if not target_corrs.empty:
+                    fig, ax = plt.subplots(figsize=(10, 8))
+                    top_corrs = target_corrs.abs().sort_values(ascending=True).tail(20)
+                    colors = ['red' if target_corrs[feat] < 0 else 'blue'
+                             for feat in top_corrs.index]
+                    ax.barh(range(len(top_corrs)), top_corrs.values, color=colors)
+                    ax.set_yticks(range(len(top_corrs)))
+                    ax.set_yticklabels(top_corrs.index)
+                    ax.set_xlabel('Absolute correlation')
+                    ax.set_title(f'Top-20 correlations with {target_col}', fontweight='bold')
+                    ax.grid(True, alpha=0.3, axis='x')
+                    plt.tight_layout()
+                    plt.savefig(os.path.join(plots_dir, 'target_correlations.png'),
+                               dpi=150, bbox_inches='tight')
+                    plt.close()
+            # 3. VIF scores plot
+            if vif_results['scores']:
+                valid_scores = {k: v for k, v in vif_results['scores'].items()
+                               if not pd.isna(v)}
+                if valid_scores:
+                    fig, ax = plt.subplots(figsize=(12, 8))
+                    sorted_scores = dict(sorted(valid_scores.items(),
+                                               key=lambda x: x[1],
+                                               reverse=True)[:25])
+                    colors = []
+                    for vif in sorted_scores.values():
+                        if vif > 100:
+                            colors.append('red')
+                        elif vif > 10:
+                            colors.append('orange')
+                        elif vif > 5:
+                            colors.append('yellow')
+                        else:
+                            colors.append('green')
+                    bars = ax.barh(list(sorted_scores.keys()),
+                                  list(sorted_scores.values()),
+                                  color=colors, edgecolor='black')
+                    ax.set_xlabel('VIF Score')
+                    ax.set_title('VIF Scores (multicollinearity)', fontweight='bold')
+                    ax.axvline(x=5, color='yellow', linestyle='--', alpha=0.7)
+                    ax.axvline(x=10, color='orange', linestyle='--', alpha=0.7)
+                    ax.axvline(x=100, color='red', linestyle='--', alpha=0.7)
+                    ax.grid(True, alpha=0.3, axis='x')
+                    plt.tight_layout()
+                    plt.savefig(os.path.join(plots_dir, 'vif_scores.png'),
+                               dpi=150, bbox_inches='tight')
+                    plt.close()
+            # 4. High correlations plot
+            if high_correlations:
+                fig, ax = plt.subplots(figsize=(12, 8))
+                # Limit number for display
+                display_corrs = high_correlations[:15]
+                # Create labels for feature pairs
+                labels = [f"{corr['feature1']} ↔ {corr['feature2']}"
+                         for corr in display_corrs]
+                values = [corr['correlation'] for corr in display_corrs]
+                colors = ['red' if v < 0 else 'blue' for v in values]
+                y_pos = np.arange(len(display_corrs))
+                ax.barh(y_pos, values, color=colors)
+                ax.set_yticks(y_pos)
+                ax.set_yticklabels(labels, fontsize=9)
+                ax.invert_yaxis()
+                ax.set_xlabel('Correlation')
+                ax.set_title('High correlations (> 0.8)', fontweight='bold')
+                ax.grid(True, alpha=0.3, axis='x')
+                plt.tight_layout()
+                plt.savefig(os.path.join(plots_dir, 'high_correlations.png'),
+                           dpi=150, bbox_inches='tight')
+                plt.close()
+            logger.info(f"Visualisations saved to {plots_dir}")
+        except Exception as e:
+            logger.warning(f"Error creating visualisations: {e}")
+    def _log_analysis_results(
+        self,
+        corr_matrix: pd.DataFrame,
+        high_correlations: List[Dict[str, Any]],
+        target_correlations: List[Dict[str, Any]],
+        vif_results: Dict[str, Any]
+    ) -> None:
+        """Log analysis results"""
+        logger.info("\n" + "="*80)
+        logger.info("CORRELATION AND MULTICOLLINEARITY ANALYSIS REPORT")
+        logger.info("="*80)
+        # General information
+        logger.info(f"\n📊 GENERAL INFORMATION:")
+        logger.info(f"   Correlation matrix size: {corr_matrix.shape}")
+        logger.info(f"   Total features: {len(corr_matrix.columns)}")
+        # High correlations
+        if high_correlations:
+            logger.info(f"\n⚠ HIGH CORRELATIONS (|r| > 0.8): {len(high_correlations)}")
+            logger.info("   " + "-" * 60)
+            for i, corr in enumerate(high_correlations[:10]):
+                sign = "🟥" if corr['correlation'] < 0 else "🟩"
+                logger.info(f"   {i+1:2d}. {sign} {corr['feature1']:25s} ↔ {corr['feature2']:25s}: {corr['correlation']:7.4f}")
+            if len(high_correlations) > 10:
+                logger.info(f"   ... and {len(high_correlations) - 10} more pairs")
+        # Target variable correlations
+        if target_correlations:
+            logger.info(f"\n🎯 CORRELATIONS WITH TARGET VARIABLE:")
+            logger.info("   " + "-" * 60)
+            for i, corr in enumerate(target_correlations[:10]):
+                direction = "↓" if corr['correlation'] < 0 else "↑"
+                logger.info(f"   {i+1:2d}. {direction} {corr['feature']:35s}: {corr['correlation']:7.4f}")
+        # Multicollinearity analysis
+        if vif_results['scores']:
+            logger.info(f"\n📈 MULTICOLLINEARITY ANALYSIS (VIF):")
+            logger.info("   " + "-" * 60)
+            logger.info(f"   Critical (VIF > 100): {vif_results['summary']['critical']}")
+            logger.info(f"   High (10 < VIF ≤ 100): {vif_results['summary']['high']}")
+            logger.info(f"   Medium (5 < VIF ≤ 10): {vif_results['summary']['medium']}")
+            logger.info(f"   Low (VIF ≤ 5): {vif_results['summary']['low']}")
+            # Top problematic features
+            if vif_results['issues']:
+                logger.info(f"\n🔴 PROBLEMATIC FEATURES (VIF > 10):")
+                for issue in vif_results['issues'][:10]:
+                    logger.info(f"   • {issue['feature']:35s}: VIF = {issue['vif']:7.1f} ({issue['severity']})")
+        logger.info("\n" + "="*80)
+        logger.info("RECOMMENDATIONS:")
+        logger.info("="*80)
+        # Generate recommendations
+        recommendations = []
+        if len(high_correlations) > 20:
+            recommendations.append("1. Remove highly correlated features (correlation method)")
+        if vif_results['summary']['critical'] > 0:
+            recommendations.append("2. Remove features with critical VIF (>100)")
+        if vif_results['summary']['high'] > 5:
+            recommendations.append("3. Consider removing features with VIF > 10")
+        if not recommendations:
+            recommendations.append("1. Data in good condition, no serious issues detected")
+            recommendations.append("2. Proceed to modelling")
+        for i, rec in enumerate(recommendations, 1):
+            logger.info(f"   {rec}")
+        logger.info("\n" + "="*80)
+    def remove_highly_correlated(
+        self,
+        data: pd.DataFrame,
+        threshold: float = 0.85,
+        method: str = 'variance',
+        keep_target: bool = True,
+        keep_features: List[str] = None
+    ) -> pd.DataFrame:
+        """
+        Remove highly correlated features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Source data
+        threshold : float
+            Correlation threshold for removal
+        method : str
+            Feature selection method for removal: 'variance', 'random', 'importance'
+        keep_target : bool
+            Whether to keep target variable
+        keep_features : List[str], optional
+            Features to keep
+        Returns:
+        --------
+        pd.DataFrame
+            Data after removing highly correlated features
+        """
+        logger.info("\n" + "="*80)
+        logger.info("REMOVING HIGHLY CORRELATED FEATURES")
+        logger.info("="*80)
+        data_clean = data.copy()
+        if 'pearson' not in self.correlation_matrices:
+            logger.warning("Correlation matrix not calculated, run analyze() first")
+            return data_clean
+        corr_matrix = self.correlation_matrices['pearson']
+        # Features to keep
+        features_to_keep = set()
+        if keep_target and self.config.target_column in data_clean.columns:
+            features_to_keep.add(self.config.target_column)
+        if keep_features:
+            for feat in keep_features:
+                if feat in data_clean.columns:
+                    features_to_keep.add(feat)
+        # Temporal features (usually important for time series)
+        temporal_patterns = ['year', 'month', 'day', 'week', 'quarter',
+                            'hour', 'minute', 'second', 'sin', 'cos']
+        for col in data_clean.columns:
+            if any(pattern in col.lower() for pattern in temporal_patterns):
+                features_to_keep.add(col)
+        # Find highly correlated pairs
+        upper_triangle = corr_matrix.where(
+            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
+        )
+        # Collect highly correlated features
+        correlated_features = set()
+        for col in upper_triangle.columns:
+            if col in features_to_keep:
+                continue
+            high_corr = upper_triangle[col][abs(upper_triangle[col]) > threshold]
+            for row_idx, corr_value in high_corr.items():
+                if not pd.isna(corr_value) and row_idx not in features_to_keep:
+                    # Select which feature to remove
+                    if method == 'variance':
+                        # Remove the one with lower variance
+                        var_col = data_clean[col].var()
+                        var_row = data_clean[row_idx].var()
+                        feature_to_remove = col if var_col < var_row else row_idx
+                    elif method == 'importance':
+                        # Remove the one with lower correlation to target variable
+                        if self.config.target_column in corr_matrix.columns:
+                            corr_col_target = abs(corr_matrix.loc[col, self.config.target_column])
+                            corr_row_target = abs(corr_matrix.loc[row_idx, self.config.target_column])
+                            feature_to_remove = col if corr_col_target < corr_row_target else row_idx
+                        else:
+                            # If no target, remove randomly
+                            feature_to_remove = np.random.choice([col, row_idx])
+                    else:
+                        # Remove randomly
+                        feature_to_remove = np.random.choice([col, row_idx])
+                    correlated_features.add(feature_to_remove)
+        # Remove features
+        features_to_remove = list(correlated_features)
+        if features_to_remove:
+            data_clean = data_clean.drop(columns=features_to_remove)
+            logger.info(f"\n📊 REMOVAL RESULTS:")
+            logger.info(f"   Initial feature count: {len(data.columns)}")
+            logger.info(f"   Features removed: {len(features_to_remove)}")
+            logger.info(f"   Final feature count: {len(data_clean.columns)}")
+            logger.info(f"   Retained: {len(data_clean.columns)/len(data.columns)*100:.1f}%")
+            if features_to_remove:
+                logger.info(f"\n🗑️ REMOVED FEATURES:")
+                for i, feat in enumerate(sorted(features_to_remove)[:20]):
+                    logger.info(f"   {i+1:2d}. {feat}")
+                if len(features_to_remove) > 20:
+                    logger.info(f"   ... and {len(features_to_remove) - 20} more features")
+        else:
+            logger.info("✓ No highly correlated features detected, all features retained")
+        logger.info("="*80)
+        return data_clean
+    def get_report(self) -> Dict[str, Any]:
+        """Get analysis report"""
+        report = {
+            "correlation_matrix_shape": None,
+            "high_correlation_count": 0,
+            "vif_summary": {},
+            "target_correlation_count": 0
+        }
+        if 'pearson' in self.correlation_matrices:
+            report["correlation_matrix_shape"] = self.correlation_matrices['pearson'].shape
+        if 'pearson' in self.high_correlation_pairs:
+            report["high_correlation_count"] = len(self.high_correlation_pairs['pearson'])
+        if self.vif_scores:
+            report["vif_summary"] = self.vif_scores.get('summary', {})
+        return report

data_loader/__init__.py ADDED Viewed

File without changes

data_loader/data_loader.py ADDED Viewed

	@@ -0,0 +1,487 @@

+# ============================================
+# CLASS 2: DATA LOADER
+# ============================================
+from datetime import datetime
+import hashlib
+import json
+import traceback
+from typing import Dict, List, Optional
+from venv import logger
+from config.config import Config, DataType
+import numpy as np
+import pandas as pd
+class DataLoader:
+    """Class for loading and initial data processing"""
+    def __init__(self, config: Config):
+        """
+        Initialise data loader
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.data = None
+        self.metadata = {}
+        self.data_hash = None
+        self.loading_time = None
+        self.data_types = {}
+        self.original_shape = None
+    def load_from_csv(
+        self,
+        data_path: Optional[str] = None,
+        parse_dates: List[str] = None,
+        date_format: str = None,
+        dtype: Dict = None,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Load data from CSV file
+        Parameters:
+        -----------
+        data_path : str, optional
+            Path to CSV file. If None, uses path from configuration.
+        parse_dates : List[str], optional
+            List of columns to parse as dates
+        date_format : str, optional
+            Date format
+        dtype : Dict, optional
+            Data types for columns
+        **kwargs : dict
+            Additional parameters for pd.read_csv
+        Returns:
+        --------
+        pd.DataFrame
+            Loaded data
+        """
+        logger.info("="*80)
+        logger.info("LOADING DATA FROM CSV")
+        logger.info("="*80)
+        start_time = datetime.now()
+        try:
+            path = data_path or self.config.data_path
+            if parse_dates is None:
+                parse_dates = ['date']
+            # Load data
+            self.data = pd.read_csv(
+                path,
+                parse_dates=parse_dates,
+                dayfirst=False,
+                dtype=dtype,
+                **kwargs
+            )
+            # Convert dates if needed
+            for date_col in parse_dates:
+                if date_col in self.data.columns:
+                    if date_format:
+                        self.data[date_col] = pd.to_datetime(
+                            self.data[date_col],
+                            format=date_format,
+                            errors='coerce'
+                        )
+                    else:
+                        self.data[date_col] = pd.to_datetime(
+                            self.data[date_col],
+                            errors='coerce'
+                        )
+            # Save original shape
+            self.original_shape = self.data.shape
+            # Filter by years
+            if 'date' in self.data.columns:
+                mask = (self.data['date'].dt.year >= self.config.start_year) & \
+                       (self.data['date'].dt.year <= self.config.end_year)
+                self.data = self.data.loc[mask].copy()
+            # Sort by date
+            if 'date' in self.data.columns:
+                self.data = self.data.sort_values('date').reset_index(drop=True)
+                # Set date as index
+                self.data.set_index('date', inplace=True)
+            # Calculate data hash
+            self.data_hash = self._calculate_data_hash()
+            # Analyse data types
+            self._analyse_data_types()
+            # Save metadata
+            self._save_metadata()
+            # Loading time
+            self.loading_time = (datetime.now() - start_time).total_seconds()
+            logger.info(f"✓ Loaded {len(self.data)} records, {len(self.data.columns)} columns")
+            logger.info(f"  Period: {self.data.index.min()} - {self.data.index.max()}")
+            logger.info(f"  Data types: {self.data_types}")
+            logger.info(f"  Target variable: {self.config.target_column}")
+            logger.info(f"  Loading time: {self.loading_time:.2f} sec")
+            return self.data
+        except Exception as e:
+            logger.error(f"✗ Error loading data: {e}")
+            logger.error(traceback.format_exc())
+            raise
+    def create_synthetic_data(
+        self,
+        n_days: int = 365*21,
+        trend_strength: float = 0.01,
+        seasonal_amplitude: List[float] = None,
+        noise_std: float = 10,
+        include_exogenous: bool = True,
+        random_state: int = 42
+    ) -> pd.DataFrame:
+        """
+        Create synthetic data for testing
+        Parameters:
+        -----------
+        n_days : int
+            Number of days to generate
+        trend_strength : float
+            Trend strength
+        seasonal_amplitude : List[float], optional
+            Seasonal component amplitudes
+        noise_std : float
+            Noise standard deviation
+        include_exogenous : bool
+            Whether to include exogenous variables
+        random_state : int
+            Seed for reproducibility
+        Returns:
+        --------
+        pd.DataFrame
+            Synthetic data
+        """
+        logger.info("="*80)
+        logger.info("CREATING SYNTHETIC DATA")
+        logger.info("="*80)
+        if seasonal_amplitude is None:
+            seasonal_amplitude = [50, 30, 20]
+        np.random.seed(random_state)
+        # Generate dates
+        dates = pd.date_range(
+            start=f'{self.config.start_year}-01-01',
+            periods=n_days,
+            freq='D'
+        )
+        t = np.arange(n_days)
+        # Base components
+        trend = trend_strength * t
+        # Seasonal components
+        seasonal = 0
+        periods = [365, 30, 7]  # yearly, monthly, weekly seasonality
+        for i, (period, amplitude) in enumerate(zip(periods, seasonal_amplitude)):
+            seasonal += amplitude * np.sin(2 * np.pi * t / period)
+            if i < len(seasonal_amplitude) - 1:
+                seasonal += 0.5 * amplitude * np.cos(4 * np.pi * t / period)
+        # Cyclical component (business cycles)
+        cycle = 20 * np.sin(2 * np.pi * t / (365*5))  # 5-year cycle
+        # Noise
+        noise = np.random.normal(0, noise_std, n_days)
+        # Generate target variable
+        raskhodvoda = 100 + trend + seasonal + cycle + noise
+        # Create DataFrame
+        self.data = pd.DataFrame(
+            index=dates,
+            data={'raskhodvoda': raskhodvoda}
+        )
+        # Generate exogenous variables
+        if include_exogenous:
+            # Temperature with seasonality
+            tavg = 10 + 8 * np.sin(2 * np.pi * t / 365) + np.random.normal(0, 3, n_days)
+            tmin = tavg - 5 + np.random.normal(0, 2, n_days)
+            tmax = tavg + 5 + np.random.normal(0, 2, n_days)
+            # Water level with trend and seasonality
+            urovenvoda = 200 + 0.5 * t + 20 * np.sin(2 * np.pi * t / 365) + np.random.normal(0, 5, n_days)
+            # Add to DataFrame
+            self.data['tavg'] = tavg
+            self.data['tmin'] = tmin
+            self.data['tmax'] = tmax
+            self.data['urovenvoda'] = urovenvoda
+            # Add noisy lags
+            for lag in [1, 7, 30]:
+                self.data[f'tavg_lag_{lag}'] = self.data['tavg'].shift(lag) + np.random.normal(0, 1, n_days)
+        # Add missing values and outliers for testing
+        if n_days > 100:
+            # Missing values (5% of data)
+            mask_missing = np.random.random(n_days) < 0.05
+            self.data.loc[mask_missing, 'tavg'] = np.nan
+            # Outliers (1% of data)
+            mask_outliers = np.random.random(n_days) < 0.01
+            self.data.loc[mask_outliers, 'raskhodvoda'] *= 2
+        # Save metadata
+        self.metadata.update({
+            'is_synthetic': True,
+            'synthetic_params': {
+                'n_days': n_days,
+                'trend_strength': trend_strength,
+                'seasonal_amplitude': seasonal_amplitude,
+                'noise_std': noise_std,
+                'include_exogenous': include_exogenous,
+                'random_state': random_state
+            }
+        })
+        logger.info(f"✓ Created {len(self.data)} synthetic records")
+        logger.info(f"  Columns: {list(self.data.columns)}")
+        return self.data
+    def _calculate_data_hash(self) -> str:
+        """Calculate data hash for tracking changes"""
+        if self.data is None:
+            return None
+        # Use hash of first 1000 rows and metadata
+        sample = self.data.head(1000).to_string().encode()
+        return hashlib.md5(sample).hexdigest()
+    def _analyse_data_types(self) -> None:
+        """Analyse data types in DataFrame"""
+        if self.data is None:
+            return
+        for col in self.data.columns:
+            dtype = str(self.data[col].dtype)
+            if 'datetime' in dtype:
+                self.data_types[col] = DataType.TEMPORAL.value
+            elif 'int' in dtype or 'float' in dtype:
+                self.data_types[col] = DataType.NUMERIC.value
+            elif 'object' in dtype or 'category' in dtype:
+                # Check if categorical
+                unique_ratio = self.data[col].nunique() / len(self.data)
+                if unique_ratio < 0.1:  # Less than 10% unique values
+                    self.data_types[col] = DataType.CATEGORICAL.value
+                else:
+                    self.data_types[col] = DataType.TEXT.value
+            else:
+                self.data_types[col] = 'unknown'
+    def _save_metadata(self) -> None:
+        """Save data metadata"""
+        if self.data is None:
+            return
+        # Basic metadata
+        self.metadata.update({
+            'original_shape': list(self.original_shape) if self.original_shape else [],
+            'current_shape': list(self.data.shape),
+            'columns': list(self.data.columns),
+            'data_types': self.data_types,
+            'date_range': {
+                'min': self.data.index.min().strftime('%Y-%m-%d') if pd.notnull(self.data.index.min()) else None,
+                'max': self.data.index.max().strftime('%Y-%m-%d') if pd.notnull(self.data.index.max()) else None
+            },
+            'data_hash': self.data_hash,
+            'loading_time': self.loading_time
+        })
+        # Statistics for numeric columns
+        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
+        if len(numeric_cols) > 0:
+            stats = self.data[numeric_cols].describe().to_dict()
+            # Add additional statistics
+            for col in numeric_cols:
+                stats[col]['skewness'] = float(self.data[col].skew())
+                stats[col]['kurtosis'] = float(self.data[col].kurtosis())
+                stats[col]['cv'] = float(self.data[col].std() / self.data[col].mean()) if self.data[col].mean() != 0 else np.nan
+            self.metadata['numeric_statistics'] = stats
+        # Missing values information
+        missing_info = {
+            'total_missing': int(self.data.isnull().sum().sum()),
+            'missing_by_column': self.data.isnull().sum().to_dict(),
+            'missing_percentage': (self.data.isnull().sum() / len(self.data) * 100).to_dict(),
+            'rows_with_missing': int(self.data.isnull().any(axis=1).sum()),
+            'columns_with_missing': self.data.columns[self.data.isnull().any()].tolist()
+        }
+        self.metadata['missing_info'] = missing_info
+    def get_data_info(self) -> Dict:
+        """Get information about data"""
+        if self.data is None:
+            return {}
+        info = {
+            'shape': list(self.data.shape),
+            'columns': list(self.data.columns),
+            'data_types': self.data_types,
+            'date_range': {
+                'min': self.data.index.min().strftime('%Y-%m-%d') if pd.notnull(self.data.index.min()) else None,
+                'max': self.data.index.max().strftime('%Y-%m-%d') if pd.notnull(self.data.index.max()) else None
+            },
+            'target_column': self.config.target_column,
+            'numeric_columns': self.data.select_dtypes(include=[np.number]).columns.tolist(),
+            'categorical_columns': [col for col, dtype in self.data_types.items()
+                                   if dtype == DataType.CATEGORICAL.value],
+            'missing_info': self.metadata.get('missing_info', {})
+        }
+        return info
+    def save_raw_data_info(self) -> None:
+        """Save raw data information"""
+        if self.data is None:
+            return
+        info_path = f'{self.config.results_dir}/reports/raw_data_info.json'
+        # Custom JSON encoder for handling numpy types
+        class NumpyEncoder(json.JSONEncoder):
+            def default(self, obj):
+                if isinstance(obj, (np.integer, np.floating)):
+                    if np.isnan(obj):
+                        return None
+                    return float(obj)
+                elif isinstance(obj, np.bool_):
+                    return bool(obj)
+                elif isinstance(obj, np.ndarray):
+                    return obj.tolist()
+                elif isinstance(obj, pd.Timestamp):
+                    return obj.strftime('%Y-%m-%d %H:%M:%S')
+                elif isinstance(obj, pd.Period):
+                    return str(obj)
+                return super().default(obj)
+        with open(info_path, 'w', encoding='utf-8') as f:
+            json.dump(self.metadata, f, indent=4, ensure_ascii=False, cls=NumpyEncoder)
+        logger.info(f"✓ Raw data information saved: {info_path}")
+    def resample_data(
+        self,
+        freq: str = None,
+        method: str = 'mean'
+    ) -> pd.DataFrame:
+        """
+        Resample time series data
+        Parameters:
+        -----------
+        freq : str, optional
+            New frequency (e.g., 'D', 'W', 'M')
+        method : str
+            Aggregation method: 'mean', 'sum', 'last', 'first'
+        Returns:
+        --------
+        pd.DataFrame
+            Resampled data
+        """
+        if self.data is None:
+            logger.warning("Data not loaded")
+            return None
+        freq = freq or self.config.freq
+        # Check if index is datetime
+        if not isinstance(self.data.index, pd.DatetimeIndex):
+            logger.error("Data index is not DatetimeIndex")
+            return self.data
+        # Aggregation methods
+        agg_methods = {
+            'mean': np.mean,
+            'sum': np.sum,
+            'last': lambda x: x.iloc[-1],
+            'first': lambda x: x.iloc[0],
+            'min': np.min,
+            'max': np.max,
+            'median': np.median
+        }
+        if method not in agg_methods:
+            logger.warning(f"Method {method} not supported, using mean")
+            method = 'mean'
+        # Resampling
+        try:
+            if method == 'last':
+                resampled_data = self.data.resample(freq).last()
+            elif method == 'first':
+                resampled_data = self.data.resample(freq).first()
+            else:
+                resampled_data = self.data.resample(freq).agg(agg_methods[method])
+            logger.info(f"Data resampled to frequency {freq}, method {method}")
+            logger.info(f"Size before: {len(self.data)}, after: {len(resampled_data)}")
+            self.data = resampled_data
+            return self.data
+        except Exception as e:
+            logger.error(f"Error during resampling: {e}")
+            return self.data
+    def detect_frequency(self) -> str:
+        """
+        Automatically detect data frequency
+        Returns:
+        --------
+        str
+            Detected data frequency
+        """
+        if self.data is None or len(self.data) < 2:
+            return 'unknown'
+        if not isinstance(self.data.index, pd.DatetimeIndex):
+            return 'irregular'
+        # Calculate differences between timestamps
+        diffs = pd.Series(self.data.index).diff().dropna()
+        if len(diffs) == 0:
+            return 'unknown'
+        # Most frequent difference
+        mode_diff = diffs.mode().iloc[0] if not diffs.mode().empty else diffs.iloc[0]
+        # Determine frequency
+        if mode_diff < pd.Timedelta('1 hour'):
+            return 'H'  # Hourly
+        elif mode_diff < pd.Timedelta('1 day'):
+            return 'D'  # Daily
+        elif mode_diff < pd.Timedelta('7 days'):
+            return 'W'  # Weekly
+        elif mode_diff < pd.Timedelta('30 days'):
+            return 'M'  # Monthly
+        elif mode_diff < pd.Timedelta('90 days'):
+            return 'Q'  # Quarterly
+        else:
+            return 'Y'  # Yearly

decomposition/__init__.py ADDED Viewed

File without changes

decomposition/decomposer.py ADDED Viewed

	@@ -0,0 +1,690 @@

+# ============================================
+# CLASS 7: TIME SERIES DECOMPOSITION
+# ============================================
+import traceback
+from typing import Dict, Optional
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import statsmodels.api as sm
+from scipy import stats
+from statsmodels.tsa.seasonal import seasonal_decompose, STL
+from statsmodels.tsa.stattools import adfuller, kpss, acf, pacf
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+from statsmodels.stats.diagnostic import acorr_ljungbox
+from statsmodels.tsa.statespace.sarimax import SARIMAX
+from statsmodels.tsa.holtwinters import ExponentialSmoothing
+class TimeSeriesDecomposer:
+    """Class for time series decomposition"""
+    def __init__(self, config: Config):
+        """
+        Initialise decomposer
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.decomposition_results = {}
+        self.decomposition_models = {}
+        self.seasonal_periods = {}
+    def decompose(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        method: str = 'stl',
+        period: Optional[int] = None,
+        **kwargs
+    ) -> Dict:
+        """
+        Decompose time series into components
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration value.
+        method : str
+            Decomposition model: 'stl', 'seasonal_decompose', 'mstl', 'naive'
+        period : int, optional
+            Seasonality period. If None, uses configuration value.
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        Dict
+            Decomposition results
+        """
+        logger.info("\n" + "="*80)
+        logger.info("TIME SERIES DECOMPOSITION")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        period = period or self.config.seasonal_period
+        if target_col not in data.columns:
+            logger.error(f"Target variable '{target_col}' not found")
+            return {}
+        # Set date as index if not set
+        if not isinstance(data.index, pd.DatetimeIndex):
+            if 'date' in data.columns:
+                data = data.set_index('date')
+            else:
+                logger.error("DatetimeIndex required for decomposition")
+                return {}
+        series = data[target_col]
+        # Automatic seasonality period detection
+        if period is None or period == 'auto':
+            period = self._detect_seasonal_period(series)
+            logger.info(f"Automatically detected seasonality period: {period}")
+        try:
+            decomposition_result = None
+            if method == 'stl':
+                decomposition_result = self._stl_decomposition(series, period, **kwargs)
+            elif method == 'seasonal_decompose':
+                decomposition_result = self._seasonal_decompose(series, period, **kwargs)
+            elif method == 'mstl':
+                decomposition_result = self._mstl_decomposition(series, **kwargs)
+            elif method == 'naive':
+                decomposition_result = self._naive_decomposition(series, period, **kwargs)
+            else:
+                logger.warning(f"Method {method} not supported, using STL")
+                decomposition_result = self._stl_decomposition(series, period, **kwargs)
+            if decomposition_result is None:
+                logger.error("Decomposition failed")
+                return {}
+            # Analyse residuals
+            residuals_info = self._analyse_residuals(decomposition_result.get('residual', None))
+            # Analyse seasonality
+            seasonal_info = self._analyse_seasonality(
+                decomposition_result.get('seasonal', None),
+                period
+            )
+            # Save results
+            self.decomposition_results[target_col] = {
+                'method': method,
+                'period': period,
+                'residuals_analysis': residuals_info,
+                'seasonality_analysis': seasonal_info,
+                'components_present': list(decomposition_result.keys()),
+                'decomposition_stats': {
+                    'trend_strength': self._calculate_trend_strength(
+                        decomposition_result.get('trend', None),
+                        decomposition_result.get('residual', None)
+                    ),
+                    'seasonal_strength': self._calculate_seasonal_strength(
+                        decomposition_result.get('seasonal', None),
+                        decomposition_result.get('residual', None)
+                    )
+                }
+            }
+            # Visualisation
+            if self.config.save_plots:
+                self._plot_decomposition(data, target_col, decomposition_result, method, period)
+            # Additional visualisation
+            if residuals_info:
+                self._plot_residuals_analysis(decomposition_result.get('residual', None), target_col)
+            return self.decomposition_results[target_col]
+        except Exception as e:
+            logger.error(f"Error during decomposition: {e}")
+            logger.error(traceback.format_exc())
+            return {}
+    def _detect_seasonal_period(self, series: pd.Series) -> int:
+        """Automatic seasonality period detection"""
+        if len(series) < 100:
+            return self.config.seasonal_period
+        try:
+            # Use autocorrelation to determine period
+            acf_values = acf(series.dropna(), nlags=min(500, len(series)//2))
+            # Find peaks in autocorrelation
+            peaks = []
+            for i in range(1, len(acf_values)-1):
+                if acf_values[i] > acf_values[i-1] and acf_values[i] > acf_values[i+1]:
+                    if acf_values[i] > 0.3:  # Significance threshold
+                        peaks.append(i)
+            if peaks:
+                # Take most significant period
+                dominant_period = peaks[0]
+                # Check for multiple periods
+                for period in [7, 30, 90, 365]:
+                    if abs(dominant_period - period) <= 2:
+                        return period
+                return dominant_period
+            return self.config.seasonal_period
+        except:
+            return self.config.seasonal_period
+    def _stl_decomposition(
+        self,
+        series: pd.Series,
+        period: int,
+        **kwargs
+    ) -> Optional[Dict]:
+        """STL decomposition"""
+        try:
+            if len(series) < 2 * period:
+                logger.warning(f"Insufficient data for STL decomposition with period {period}")
+                return self._seasonal_decompose(series, period, **kwargs)
+            # STL decomposition
+            stl = STL(
+                series,
+                period=period,
+                seasonal=kwargs.get('seasonal', 7),
+                trend=kwargs.get('trend', None),
+                robust=kwargs.get('robust', True),
+                seasonal_deg=kwargs.get('seasonal_deg', 1),
+                trend_deg=kwargs.get('trend_deg', 1),
+                low_pass_deg=kwargs.get('low_pass_deg', 1)
+            )
+            result = stl.fit()
+            return {
+                'trend': result.trend,
+                'seasonal': result.seasonal,
+                'residual': result.resid,
+                'observed': series
+            }
+        except Exception as e:
+            logger.warning(f"STL decomposition failed: {e}")
+            return self._seasonal_decompose(series, period, **kwargs)
+    def _seasonal_decompose(
+        self,
+        series: pd.Series,
+        period: int,
+        **kwargs
+    ) -> Optional[Dict]:
+        """Seasonal decomposition from statsmodels"""
+        try:
+            model = kwargs.get('model', 'additive')
+            if len(series) < 2 * period:
+                # Reduce period if insufficient data
+                period = max(7, len(series) // 4)
+            decomposition = seasonal_decompose(
+                series,
+                model=model,
+                period=period,
+                extrapolate_trend=kwargs.get('extrapolate_trend', 'freq'),
+                two_sided=kwargs.get('two_sided', True)
+            )
+            return {
+                'trend': decomposition.trend,
+                'seasonal': decomposition.seasonal,
+                'residual': decomposition.resid,
+                'observed': series
+            }
+        except Exception as e:
+            logger.warning(f"Seasonal decompose failed: {e}")
+            return self._naive_decomposition(series, period, **kwargs)
+    def _mstl_decomposition(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> Optional[Dict]:
+        """Multi-seasonal decomposition (simplified)"""
+        try:
+            # Simplified MSTL version
+            periods = kwargs.get('periods', [7, 365])
+            result = {
+                'observed': series,
+                'trend': None,
+                'seasonal': pd.Series(0, index=series.index),
+                'residual': series.copy()
+            }
+            # Sequentially remove seasonal components
+            for period in periods:
+                if len(series) >= 2 * period:
+                    try:
+                        decomp = seasonal_decompose(
+                            result['residual'],
+                            model='additive',
+                            period=period,
+                            extrapolate_trend='freq'
+                        )
+                        if result['trend'] is None:
+                            result['trend'] = decomp.trend
+                        result['seasonal'] = result['seasonal'] + decomp.seasonal
+                        result['residual'] = decomp.resid
+                    except:
+                        continue
+            if result['trend'] is None:
+                result['trend'] = series.rolling(window=min(365, len(series)//4), center=True).mean()
+            return result
+        except Exception as e:
+            logger.warning(f"MSTL decomposition failed: {e}")
+            return self._seasonal_decompose(series, 365, **kwargs)
+    def _naive_decomposition(
+        self,
+        series: pd.Series,
+        period: int,
+        **kwargs
+    ) -> Optional[Dict]:
+        """Naive decomposition"""
+        try:
+            # Simple decomposition using moving averages
+            trend = series.rolling(
+                window=min(period, len(series)//4),
+                center=True,
+                min_periods=1
+            ).mean()
+            # Seasonal component
+            if period > 1:
+                # Average by seasons
+                seasonal = series.groupby(series.index.dayofyear if period == 365 else
+                                         series.index.dayofweek if period == 7 else
+                                         series.index.month).transform('mean')
+                seasonal = seasonal - seasonal.mean()
+            else:
+                seasonal = pd.Series(0, index=series.index)
+            residual = series - trend - seasonal
+            return {
+                'trend': trend,
+                'seasonal': seasonal,
+                'residual': residual,
+                'observed': series
+            }
+        except Exception as e:
+            logger.error(f"Naive decomposition failed: {e}")
+            return None
+    def _analyse_residuals(self, residuals) -> Dict:
+        """Analyse decomposition residuals"""
+        if residuals is None:
+            return {}
+        residuals_clean = residuals.dropna()
+        if len(residuals_clean) == 0:
+            return {}
+        stats_info = {
+            'mean': float(residuals_clean.mean()),
+            'std': float(residuals_clean.std()),
+            'skewness': float(residuals_clean.skew()),
+            'kurtosis': float(residuals_clean.kurtosis()),
+            'min': float(residuals_clean.min()),
+            'max': float(residuals_clean.max()),
+            'mad': float((residuals_clean - residuals_clean.mean()).abs().mean()),
+            'normality_tests': {},
+            'autocorrelation_tests': {}
+        }
+        # Normality test
+        if len(residuals_clean) > 3:
+            try:
+                # Shapiro-Wilk test
+                shapiro_stat, shapiro_p = stats.shapiro(residuals_clean.iloc[:5000])
+                stats_info['normality_tests']['shapiro_wilk'] = {
+                    'statistic': float(shapiro_stat),
+                    'pvalue': float(shapiro_p),
+                    'is_normal': shapiro_p > 0.05
+                }
+                # Anderson-Darling test
+                anderson_result = stats.anderson(residuals_clean, dist='norm')
+                stats_info['normality_tests']['anderson_darling'] = {
+                    'statistic': float(anderson_result.statistic),
+                    'critical_values': {str(level): float(value)
+                                       for level, value in zip(anderson_result.significance_level,
+                                                              anderson_result.critical_values)},
+                    'is_normal': anderson_result.statistic < anderson_result.critical_values[2]  # At 5% level
+                }
+            except:
+                stats_info['normality_tests']['error'] = 'not enough data or calculation error'
+        # Autocorrelation test
+        try:
+            # Ljung-Box test
+            lb_test = acorr_ljungbox(residuals_clean, lags=[10, 20, 30], return_df=True)
+            autocorr_info = {}
+            for idx, row in lb_test.iterrows():
+                autocorr_info[f'lag_{int(row.name)}'] = {
+                    'statistic': float(row['lb_stat']),
+                    'pvalue': float(row['lb_pvalue']),
+                    'has_autocorrelation': row['lb_pvalue'] < 0.05
+                }
+            stats_info['autocorrelation_tests']['ljung_box'] = autocorr_info
+            # Durbin-Watson test
+            try:
+                dw_stat = sm.stats.stattools.durbin_watson(residuals_clean)
+                stats_info['autocorrelation_tests']['durbin_watson'] = {
+                    'statistic': float(dw_stat),
+                    'interpretation': 'no autocorrelation' if 1.5 < dw_stat < 2.5 else
+                                     'positive autocorrelation' if dw_stat < 1.5 else
+                                     'negative autocorrelation'
+                }
+            except:
+                pass
+        except:
+            stats_info['autocorrelation_tests']['error'] = 'calculation error'
+        # Heteroskedasticity test
+        try:
+            # ARCH test
+            from statsmodels.stats.diagnostic import het_arch
+            arch_test = het_arch(residuals_clean)
+            stats_info['heteroskedasticity_tests'] = {
+                'arch': {
+                    'statistic': float(arch_test[0]),
+                    'pvalue': float(arch_test[1]),
+                    'is_homoskedastic': arch_test[1] > 0.05
+                }
+            }
+        except:
+            pass
+        return stats_info
+    def _analyse_seasonality(self, seasonal_component, period: int) -> Dict:
+        """Analyse seasonal component"""
+        if seasonal_component is None:
+            return {}
+        seasonal_clean = seasonal_component.dropna()
+        if len(seasonal_clean) == 0:
+            return {}
+        analysis = {
+            'period': period,
+            'amplitude': float(seasonal_clean.max() - seasonal_clean.min()),
+            'mean_amplitude': float(seasonal_clean.abs().mean()),
+            'seasonal_strength': float(seasonal_clean.std()),
+            'periodicity_check': {}
+        }
+        # Check periodicity via autocorrelation
+        if len(seasonal_clean) > period * 2:
+            try:
+                acf_values = acf(seasonal_clean, nlags=min(period * 3, len(seasonal_clean)//2))
+                # Look for peaks at expected lags
+                expected_lags = [period, period*2]
+                peaks_found = []
+                for lag in expected_lags:
+                    if lag < len(acf_values):
+                        if acf_values[lag] > 0.5:  # Strong autocorrelation at period
+                            peaks_found.append({
+                                'lag': lag,
+                                'autocorrelation': float(acf_values[lag]),
+                                'is_significant': True
+                            })
+                analysis['periodicity_check']['autocorrelation_peaks'] = peaks_found
+                analysis['periodicity_check']['is_periodic'] = len(peaks_found) > 0
+            except:
+                pass
+        # Seasonality pattern analysis
+        if isinstance(seasonal_clean.index, pd.DatetimeIndex):
+            try:
+                # Group by months/week days
+                if period == 12 or period == 365:
+                    # Monthly seasonality
+                    monthly_seasonal = seasonal_clean.groupby(seasonal_clean.index.month).mean()
+                    analysis['monthly_pattern'] = monthly_seasonal.to_dict()
+                if period == 7 or period == 365:
+                    # Daily seasonality
+                    daily_seasonal = seasonal_clean.groupby(seasonal_clean.index.dayofweek).mean()
+                    analysis['daily_pattern'] = daily_seasonal.to_dict()
+            except:
+                pass
+        return analysis
+    def _calculate_trend_strength(self, trend, residual) -> float:
+        """Calculate trend strength"""
+        if trend is None or residual is None:
+            return 0.0
+        trend_clean = trend.dropna()
+        residual_clean = residual.dropna()
+        if len(trend_clean) == 0 or len(residual_clean) == 0:
+            return 0.0
+        # Trend strength = 1 - Var(residual) / Var(trend + residual)
+        try:
+            var_total = np.var(trend_clean + residual_clean)
+            if var_total > 0:
+                trend_strength = 1 - np.var(residual_clean) / var_total
+                return max(0.0, min(1.0, float(trend_strength)))
+        except:
+            pass
+        return 0.0
+    def _calculate_seasonal_strength(self, seasonal, residual) -> float:
+        """Calculate seasonality strength"""
+        if seasonal is None or residual is None:
+            return 0.0
+        seasonal_clean = seasonal.dropna()
+        residual_clean = residual.dropna()
+        if len(seasonal_clean) == 0 or len(residual_clean) == 0:
+            return 0.0
+        # Seasonality strength = 1 - Var(residual) / Var(seasonal + residual)
+        try:
+            var_total = np.var(seasonal_clean + residual_clean)
+            if var_total > 0:
+                seasonal_strength = 1 - np.var(residual_clean) / var_total
+                return max(0.0, min(1.0, float(seasonal_strength)))
+        except:
+            pass
+        return 0.0
+    def _plot_decomposition(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        decomposition: Dict,
+        method: str,
+        period: int
+    ) -> None:
+        """Visualise decomposition"""
+        fig, axes = plt.subplots(4, 1, figsize=(14, 12))
+        # Original series
+        axes[0].plot(decomposition.get('observed', pd.Series()))
+        axes[0].set_ylabel('Observed')
+        axes[0].set_title(f'Time Series Decomposition: {target_col} ({method}, period={period})')
+        axes[0].grid(True, alpha=0.3)
+        # Trend
+        if 'trend' in decomposition and decomposition['trend'] is not None:
+            axes[1].plot(decomposition['trend'])
+        axes[1].set_ylabel('Trend')
+        axes[1].grid(True, alpha=0.3)
+        # Seasonality
+        if 'seasonal' in decomposition and decomposition['seasonal'] is not None:
+            axes[2].plot(decomposition['seasonal'])
+        axes[2].set_ylabel('Seasonality')
+        axes[2].grid(True, alpha=0.3)
+        # Residuals
+        if 'residual' in decomposition and decomposition['residual'] is not None:
+            axes[3].plot(decomposition['residual'])
+        axes[3].set_ylabel('Residuals')
+        axes[3].set_xlabel('Date')
+        axes[3].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/decomposition_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+        # Additional plots
+        self._plot_decomposition_components(data, target_col, decomposition)
+    def _plot_decomposition_components(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        decomposition: Dict
+    ) -> None:
+        """Visualise decomposition components"""
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        # 1. Sum of components vs original series
+        if all(k in decomposition for k in ['trend', 'seasonal', 'residual']):
+            reconstructed = decomposition['trend'] + decomposition['seasonal'] + decomposition['residual']
+            axes[0, 0].plot(decomposition['observed'], alpha=0.7, label='Original')
+            axes[0, 0].plot(reconstructed, alpha=0.7, label='Reconstructed')
+            axes[0, 0].set_title('Original vs Reconstructed Series')
+            axes[0, 0].set_xlabel('Date')
+            axes[0, 0].set_ylabel(target_col)
+            axes[0, 0].legend()
+            axes[0, 0].grid(True, alpha=0.3)
+        # 2. Residuals distribution
+        if 'residual' in decomposition and decomposition['residual'] is not None:
+            residuals = decomposition['residual'].dropna()
+            axes[0, 1].hist(residuals, bins=30, edgecolor='black', alpha=0.7, density=True)
+            # Normal distribution for comparison
+            xmin, xmax = axes[0, 1].get_xlim()
+            x = np.linspace(xmin, xmax, 100)
+            p = stats.norm.pdf(x, residuals.mean(), residuals.std())
+            axes[0, 1].plot(x, p, 'k', linewidth=2, label='Normal distribution')
+            axes[0, 1].set_title('Residuals Distribution')
+            axes[0, 1].set_xlabel('Residuals')
+            axes[0, 1].set_ylabel('Density')
+            axes[0, 1].legend()
+            axes[0, 1].grid(True, alpha=0.3)
+        # 3. ACF of residuals
+        if 'residual' in decomposition and decomposition['residual'] is not None:
+            plot_acf(decomposition['residual'].dropna(), lags=50, ax=axes[1, 0], alpha=0.05)
+            axes[1, 0].set_title('Residuals ACF')
+            axes[1, 0].set_xlabel('Lag')
+            axes[1, 0].set_ylabel('Autocorrelation')
+            axes[1, 0].grid(True, alpha=0.3)
+        # 4. Seasonal pattern
+        if 'seasonal' in decomposition and decomposition['seasonal'] is not None:
+            seasonal = decomposition['seasonal']
+            if isinstance(seasonal.index, pd.DatetimeIndex):
+                # Group by months
+                try:
+                    monthly_seasonal = seasonal.groupby(seasonal.index.month).mean()
+                    axes[1, 1].bar(monthly_seasonal.index, monthly_seasonal.values)
+                    axes[1, 1].set_title('Average Seasonal Pattern by Month')
+                    axes[1, 1].set_xlabel('Month')
+                    axes[1, 1].set_ylabel('Seasonality')
+                    axes[1, 1].set_xticks(range(1, 13))
+                    axes[1, 1].grid(True, alpha=0.3)
+                except:
+                    axes[1, 1].plot(seasonal.index, seasonal.values)
+                    axes[1, 1].set_title('Seasonal Component')
+                    axes[1, 1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/decomposition_components_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def _plot_residuals_analysis(self, residuals, target_col: str) -> None:
+        """Visualise residual analysis"""
+        if residuals is None:
+            return
+        residuals_clean = residuals.dropna()
+        if len(residuals_clean) == 0:
+            return
+        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
+        # Q-Q plot
+        stats.probplot(residuals_clean, dist="norm", plot=axes[0])
+        axes[0].set_title('Residuals Q-Q plot')
+        axes[0].grid(True, alpha=0.3)
+        # Residuals over time
+        axes[1].plot(residuals_clean.index, residuals_clean.values, linewidth=0.5)
+        axes[1].axhline(y=0, color='r', linestyle='-', alpha=0.3)
+        axes[1].set_title('Residuals Over Time')
+        axes[1].set_xlabel('Date')
+        axes[1].set_ylabel('Residuals')
+        axes[1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/residuals_analysis_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get decomposition report"""
+        return self.decomposition_results

feature_selection/__init__.py ADDED Viewed

File without changes

feature_selection/feature_selector.py ADDED Viewed

	@@ -0,0 +1,478 @@

+# ============================================
+# CLASS 11: FEATURE SELECTION
+# ============================================
+from typing import Dict, List, Optional, Tuple
+from venv import logger
+from config.config import Config
+try:
+    import pandas as pd
+    import numpy as np
+    import matplotlib.pyplot as plt
+    import seaborn as sns
+    from sklearn.ensemble import RandomForestRegressor
+    from sklearn.decomposition import PCA
+    from sklearn.preprocessing import StandardScaler
+    print("✅ All imports working!")
+except ImportError as e:
+    print(f"❌ Import error: {e}")
+from sklearn.inspection import permutation_importance, partial_dependence
+from sklearn.feature_selection import (
+    SelectKBest, SelectPercentile, RFE, RFECV, VarianceThreshold,
+    f_regression, mutual_info_regression
+)
+class FeatureSelector:
+    """Class for selecting the most important features"""
+    def __init__(self, config: Config):
+        """
+        Initialise feature selector
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.selected_features = []
+        self.feature_importances = {}
+        self.selection_methods = {}
+        self.selector_objects = {}
+    def select(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        method: str = None,
+        n_features: int = None,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Select the most important features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration value.
+        method : str, optional
+            Selection method. If None, uses configuration value.
+        n_features : int, optional
+            Number of features to select. If None, uses configuration value.
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Data with selected features
+        """
+        logger.info("\n" + "="*80)
+        logger.info("FEATURE SELECTION")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        method = method or self.config.feature_selection_method
+        n_features = n_features or self.config.max_features
+        if target_col not in data.columns:
+            logger.error(f"Target variable '{target_col}' not found")
+            return data
+        # Prepare data
+        X = data.drop(columns=[target_col]).select_dtypes(include=[np.number])
+        y = data[target_col]
+        # Remove missing values
+        mask = X.notna().all(axis=1) & y.notna()
+        X_clean = X[mask]
+        y_clean = y[mask]
+        if len(X_clean) < 10 or len(X_clean.columns) < 2:
+            logger.warning("Insufficient data for feature selection")
+            return data
+        logger.info(f"Selection method: {method}")
+        logger.info(f"Target number of features: {n_features}")
+        logger.info(f"Initial number of features: {len(X.columns)}")
+        logger.info(f"Data for selection: {len(X_clean)} records")
+        # Apply selection method
+        selected_features_list = []
+        feature_importance_dict = {}
+        if method == 'correlation':
+            selected_features_list, feature_importance_dict = self._correlation_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'mutual_info':
+            selected_features_list, feature_importance_dict = self._mutual_info_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'rf':
+            selected_features_list, feature_importance_dict = self._random_forest_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'pca':
+            selected_features_list, feature_importance_dict = self._pca_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'rfe':
+            selected_features_list, feature_importance_dict = self._rfe_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'lasso':
+            selected_features_list, feature_importance_dict = self._lasso_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'hybrid':
+            selected_features_list, feature_importance_dict = self._hybrid_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        else:
+            logger.warning(f"Method {method} not supported, using correlation")
+            selected_features_list, feature_importance_dict = self._correlation_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        # Save selected features
+        self.selected_features = selected_features_list
+        self.feature_importances = feature_importance_dict
+        self.selection_methods[method] = {
+            'selected_features': selected_features_list,
+            'n_features': len(selected_features_list),
+            'feature_importances': feature_importance_dict
+        }
+        # Form final dataset
+        features_to_keep = selected_features_list + [target_col]
+        features_to_keep = [f for f in features_to_keep if f in data.columns]
+        data_selected = data[features_to_keep].copy()
+        logger.info(f"✓ Selected {len(selected_features_list)} features")
+        logger.info(f"  Total features kept: {len(data_selected.columns)}")
+        # Visualisation
+        if self.config.save_plots and selected_features_list:
+            self._plot_feature_selection(
+                X_clean, y_clean, selected_features_list,
+                feature_importance_dict, method
+            )
+        return data_selected
+    def _correlation_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on correlation"""
+        # Calculate correlations with target variable
+        correlations = X.corrwith(y).abs().sort_values(ascending=False)
+        # Select top-n_features
+        selected_features = correlations.head(n_features).index.tolist()
+        feature_importance = correlations.to_dict()
+        return selected_features, feature_importance
+    def _mutual_info_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on mutual information"""
+        try:
+            mi_scores = mutual_info_regression(X, y, random_state=kwargs.get('random_state', 42))
+            mi_series = pd.Series(mi_scores, index=X.columns)
+            mi_series = mi_series.sort_values(ascending=False)
+            selected_features = mi_series.head(n_features).index.tolist()
+            feature_importance = mi_series.to_dict()
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"Mutual information selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _random_forest_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on Random Forest"""
+        try:
+            rf = RandomForestRegressor(
+                n_estimators=kwargs.get('n_estimators', 100),
+                max_depth=kwargs.get('max_depth', None),
+                random_state=kwargs.get('random_state', 42),
+                n_jobs=self.config.n_jobs if self.config.use_multiprocessing else None
+            )
+            rf.fit(X, y)
+            importances = pd.Series(rf.feature_importances_, index=X.columns)
+            importances = importances.sort_values(ascending=False)
+            selected_features = importances.head(n_features).index.tolist()
+            feature_importance = importances.to_dict()
+            self.selector_objects['random_forest'] = rf
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"Random Forest selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _pca_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on PCA"""
+        try:
+            # First standardise data
+            from sklearn.preprocessing import StandardScaler
+            scaler = StandardScaler()
+            X_scaled = scaler.fit_transform(X)
+            # Apply PCA
+            pca = PCA(n_components=min(n_features, len(X.columns)))
+            X_pca = pca.fit_transform(X_scaled)
+            # Get feature importance via absolute component values
+            importance = np.abs(pca.components_).sum(axis=0)
+            importance_series = pd.Series(importance, index=X.columns)
+            importance_series = importance_series.sort_values(ascending=False)
+            selected_features = importance_series.head(n_features).index.tolist()
+            feature_importance = importance_series.to_dict()
+            self.selector_objects['pca'] = pca
+            self.selector_objects['scaler'] = scaler
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"PCA selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _rfe_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Recursive Feature Elimination"""
+        try:
+            from sklearn.feature_selection import RFE
+            from sklearn.linear_model import LinearRegression
+            estimator = LinearRegression()
+            rfe = RFE(
+                estimator=estimator,
+                n_features_to_select=n_features,
+                step=kwargs.get('step', 1)
+            )
+            rfe.fit(X, y)
+            selected_mask = rfe.support_
+            selected_features = X.columns[selected_mask].tolist()
+            # Feature importance via ranking
+            ranking = pd.Series(rfe.ranking_, index=X.columns)
+            feature_importance = (1 / ranking).to_dict()  # Convert ranking to importance
+            self.selector_objects['rfe'] = rfe
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"RFE selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _lasso_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection using Lasso"""
+        try:
+            from sklearn.linear_model import LassoCV
+            lasso = LassoCV(
+                cv=kwargs.get('cv', 5),
+                random_state=kwargs.get('random_state', 42),
+                max_iter=kwargs.get('max_iter', 1000)
+            )
+            lasso.fit(X, y)
+            # Features with non-zero coefficients
+            coefficients = pd.Series(lasso.coef_, index=X.columns)
+            non_zero_features = coefficients[coefficients != 0].abs().sort_values(ascending=False)
+            # Select top-n_features
+            selected_features = non_zero_features.head(n_features).index.tolist()
+            feature_importance = non_zero_features.to_dict()
+            self.selector_objects['lasso'] = lasso
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"Lasso selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _hybrid_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Hybrid feature selection method"""
+        # Combine multiple methods
+        methods = kwargs.get('methods', ['correlation', 'mutual_info', 'rf'])
+        weights = kwargs.get('weights', [0.3, 0.3, 0.4])
+        all_importances = {}
+        for method, weight in zip(methods, weights):
+            try:
+                if method == 'correlation':
+                    _, importance = self._correlation_selection(X, y, n_features, **kwargs)
+                elif method == 'mutual_info':
+                    _, importance = self._mutual_info_selection(X, y, n_features, **kwargs)
+                elif method == 'rf':
+                    _, importance = self._random_forest_selection(X, y, n_features, **kwargs)
+                else:
+                    continue
+                # Normalise importances and weight them
+                importance_series = pd.Series(importance)
+                if importance_series.max() > importance_series.min():
+                    importance_normalized = (importance_series - importance_series.min()) / \
+                                          (importance_series.max() - importance_series.min())
+                else:
+                    importance_normalized = pd.Series(1, index=importance_series.index)
+                # Add weighted importances
+                for feature in importance_normalized.index:
+                    if feature not in all_importances:
+                        all_importances[feature] = 0
+                    all_importances[feature] += importance_normalized[feature] * weight
+            except Exception as e:
+                logger.debug(f"Method {method} failed in hybrid selection: {e}")
+        # Sort by total importance
+        combined_importance = pd.Series(all_importances).sort_values(ascending=False)
+        selected_features = combined_importance.head(n_features).index.tolist()
+        return selected_features, combined_importance.to_dict()
+    def _plot_feature_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        selected_features: List[str],
+        feature_importance: Dict,
+        method: str
+    ) -> None:
+        """Visualise feature selection results"""
+        # Prepare data for visualisation
+        importance_series = pd.Series(feature_importance).sort_values(ascending=False)
+        # Limit number of features for display
+        display_features = importance_series.head(20)
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        # 1. Feature importance
+        y_pos = np.arange(len(display_features))
+        axes[0, 0].barh(y_pos, display_features.values)
+        axes[0, 0].set_yticks(y_pos)
+        axes[0, 0].set_yticklabels(display_features.index, fontsize=9)
+        axes[0, 0].invert_yaxis()
+        axes[0, 0].set_xlabel('Importance')
+        axes[0, 0].set_title(f'Top-{len(display_features)} features by importance ({method})')
+        axes[0, 0].grid(True, alpha=0.3, axis='x')
+        # 2. Cumulative importance
+        cumulative_importance = importance_series.cumsum() / importance_series.sum()
+        axes[0, 1].plot(range(1, len(cumulative_importance) + 1), cumulative_importance.values)
+        axes[0, 1].axhline(y=0.8, color='r', linestyle='--', alpha=0.7, label='80% importance')
+        axes[0, 1].axhline(y=0.9, color='orange', linestyle='--', alpha=0.7, label='90% importance')
+        axes[0, 1].set_xlabel('Number of features')
+        axes[0, 1].set_ylabel('Cumulative importance')
+        axes[0, 1].set_title('Cumulative feature importance')
+        axes[0, 1].legend()
+        axes[0, 1].grid(True, alpha=0.3)
+        # 3. Correlation matrix of selected features
+        if len(selected_features) > 1:
+            selected_X = X[selected_features]
+            corr_matrix = selected_X.corr()
+            mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+            sns.heatmap(
+                corr_matrix,
+                annot=True,
+                fmt='.2f',
+                cmap='coolwarm',
+                center=0,
+                square=True,
+                mask=mask,
+                cbar_kws={'shrink': 0.8},
+                ax=axes[1, 0]
+            )
+            axes[1, 0].set_title(f'Correlation of selected features ({len(selected_features)})')
+        # 4. Importance distribution
+        axes[1, 1].hist(importance_series.values, bins=30, edgecolor='black', alpha=0.7)
+        axes[1, 1].set_xlabel('Feature importance')
+        axes[1, 1].set_ylabel('Frequency')
+        axes[1, 1].set_title('Feature importance distribution')
+        axes[1, 1].grid(True, alpha=0.3)
+        plt.suptitle(f'Feature selection results using {method} method', fontsize=14)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/feature_selection_{method}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get feature selection report"""
+        return {
+            'selected_features': self.selected_features,
+            'feature_importances': self.feature_importances,
+            'selection_methods': self.selection_methods
+        }

features/__init__.py ADDED Viewed

File without changes

features/feature_engineer.py ADDED Viewed

	@@ -0,0 +1,638 @@

+# ============================================
+# CLASS 5: FEATURE ENGINEER
+# ============================================
+from typing import Dict, List, Optional
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+class FeatureEngineer:
+    """Class for creating new features for time series"""
+    def __init__(self, config: Config):
+        """
+        Initialise feature engineer
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.created_features = []
+        self.feature_info = {}
+        self.feature_importances = {}
+        self.transforms_applied = {}
+    def create_all_features(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None
+    ) -> pd.DataFrame:
+        """
+        Create all types of features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration value.
+        Returns:
+        --------
+        pd.DataFrame
+            Data with all features
+        """
+        logger.info("\n" + "="*80)
+        logger.info("CREATING FEATURES FOR TIME SERIES")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        initial_features = len(data.columns)
+        initial_rows = len(data)
+        # Check and save index
+        original_index = data.index
+        index_is_datetime = isinstance(original_index, pd.DatetimeIndex)
+        logger.info(f"Initial number of features: {initial_features}")
+        logger.info(f"Initial number of rows: {initial_rows}")
+        logger.info(f"Index is DatetimeIndex: {index_is_datetime}")
+        # If index not DatetimeIndex but 'date' column exists
+        if not index_is_datetime and 'date' in data.columns:
+            logger.info("Attempting to set DatetimeIndex from 'date' column")
+            try:
+                data = data.set_index('date')
+                if isinstance(data.index, pd.DatetimeIndex):
+                    index_is_datetime = True
+                    original_index = data.index
+                    logger.info("✓ DatetimeIndex set from 'date' column")
+                else:
+                    logger.warning("Failed to set DatetimeIndex")
+            except Exception as e:
+                logger.warning(f"Error setting DatetimeIndex: {e}")
+        # Save data copy for index restoration later
+        data_processed = data.copy()
+        # 1. Create basic temporal features (if date exists)
+        if index_is_datetime:
+            logger.info("\n1. BASIC TEMPORAL FEATURES")
+            data_processed = self.create_temporal_features(data_processed)
+        else:
+            logger.info("\n1. BASIC TEMPORAL FEATURES: skipped (no DatetimeIndex)")
+        # 2. Create statistical features
+        logger.info("\n2. STATISTICAL FEATURES")
+        data_processed = self.create_statistical_features(data_processed, target_col)
+        # 3. Create rolling features
+        logger.info("\n3. ROLLING FEATURES")
+        data_processed = self.create_rolling_features(data_processed, target_col)
+        # 4. Create lag features (limited quantity)
+        logger.info("\n4. LAG FEATURES")
+        data_processed = self.create_lag_features(data_processed, target_col)
+        # 5. Create interaction features
+        logger.info("\n5. INTERACTION FEATURES")
+        data_processed = self.create_interaction_features(data_processed, target_col)
+        # 6. Create spectral features (only if sufficient data)
+        logger.info("\n6. SPECTRAL FEATURES")
+        if len(data_processed) > 100:
+            data_processed = self.create_spectral_features(data_processed, target_col)
+        else:
+            logger.info("   Skipped: insufficient data")
+        # 7. Create decomposition features (only if sufficient data and date exists)
+        logger.info("\n7. DECOMPOSITION FEATURES")
+        if len(data_processed) > 365 and index_is_datetime:
+            data_processed = self.create_decomposition_features(data_processed, target_col)
+        else:
+            logger.info("   Skipped: insufficient data or no DatetimeIndex")
+        # Remove rows with NaN that appeared due to lags and differences
+        rows_before_nan = len(data_processed)
+        data_processed = data_processed.dropna()
+        rows_after_nan = len(data_processed)
+        removed_rows = rows_before_nan - rows_after_nan
+        # Remove constant features
+        constant_features = []
+        for col in data_processed.columns:
+            if data_processed[col].nunique() <= 1:
+                constant_features.append(col)
+        if constant_features:
+            logger.info(f"\nRemoving constant features: {len(constant_features)} found")
+            for feat in constant_features[:10]:
+                logger.info(f"  - {feat}")
+            if len(constant_features) > 10:
+                logger.info(f"  ... and {len(constant_features) - 10} more features")
+            data_processed = data_processed.drop(columns=constant_features)
+            # Update created features list
+            self.created_features = [f for f in self.created_features if f not in constant_features]
+        # Save information
+        self.feature_info = {
+            'initial_features': initial_features,
+            'final_features': len(data_processed.columns),
+            'features_created': len(self.created_features),
+            'initial_rows': initial_rows,
+            'final_rows': len(data_processed),
+            'removed_rows': removed_rows,
+            'constant_features_removed': len(constant_features),
+            'created_features_list': self.created_features,
+            'feature_categories': self.get_feature_categories()
+        }
+        logger.info(f"\nFeature creation summary:")
+        logger.info(f"  Initial number of features: {initial_features}")
+        logger.info(f"  Final number of features: {len(data_processed.columns)}")
+        logger.info(f"  New features created: {len(self.created_features)}")
+        logger.info(f"  Initial number of rows: {initial_rows}")
+        logger.info(f"  Final number of rows: {len(data_processed)}")
+        logger.info(f"  Rows removed due to NaN: {removed_rows}")
+        logger.info(f"  Constant features removed: {len(constant_features)}")
+        return data_processed
+    def create_temporal_features(self, data: pd.DataFrame) -> pd.DataFrame:
+        """
+        Create temporal features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        Returns:
+        --------
+        pd.DataFrame
+            Data with temporal features
+        """
+        data_processed = data.copy()
+        if not isinstance(data_processed.index, pd.DatetimeIndex):
+            logger.warning("Temporal features not created: index not DatetimeIndex")
+            return data_processed
+        try:
+            # Basic temporal features
+            data_processed['year'] = data_processed.index.year
+            data_processed['month'] = data_processed.index.month
+            data_processed['day'] = data_processed.index.day
+            data_processed['dayofyear'] = data_processed.index.dayofyear
+            data_processed['dayofweek'] = data_processed.index.dayofweek
+            data_processed['weekofyear'] = data_processed.index.isocalendar().week.astype(int)
+            data_processed['quarter'] = data_processed.index.quarter
+            data_processed['is_weekend'] = data_processed['dayofweek'].isin([5, 6]).astype(int)
+            # Cyclic features for seasonality
+            data_processed['month_sin'] = np.sin(2 * np.pi * data_processed['month'] / 12)
+            data_processed['month_cos'] = np.cos(2 * np.pi * data_processed['month'] / 12)
+            data_processed['dayofyear_sin'] = np.sin(2 * np.pi * data_processed['dayofyear'] / 365.25)
+            data_processed['dayofyear_cos'] = np.cos(2 * np.pi * data_processed['dayofyear'] / 365.25)
+            data_processed['dayofweek_sin'] = np.sin(2 * np.pi * data_processed['dayofweek'] / 7)
+            data_processed['dayofweek_cos'] = np.cos(2 * np.pi * data_processed['dayofweek'] / 7)
+            # Time in days from start (relative features)
+            min_date = data_processed.index.min()
+            data_processed['days_from_start'] = (data_processed.index - min_date).days
+            # Register created features
+            temporal_features = ['year', 'month', 'day', 'dayofyear', 'dayofweek',
+                               'weekofyear', 'quarter', 'is_weekend', 'month_sin',
+                               'month_cos', 'dayofyear_sin', 'dayofyear_cos',
+                               'dayofweek_sin', 'dayofweek_cos', 'days_from_start']
+            self.created_features.extend([f for f in temporal_features if f not in self.created_features])
+            logger.info(f"✓ Created {len(temporal_features)} temporal features")
+        except Exception as e:
+            logger.warning(f"Error creating temporal features: {e}")
+        return data_processed
+    def create_statistical_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create statistical features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with statistical features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Only if we have year data
+        if 'year' in data_processed.columns:
+            # Yearly statistics
+            try:
+                yearly_stats = data_processed.groupby('year')[target_col].agg([
+                    'mean', 'std', 'min', 'max', 'median'
+                ])
+                yearly_stats.columns = [f'{target_col}_yearly_{col}' for col in yearly_stats.columns]
+                data_processed = data_processed.merge(yearly_stats, on='year', how='left')
+                # Add created features to list
+                for col in yearly_stats.columns:
+                    self.created_features.append(col)
+            except Exception as e:
+                logger.debug(f"Yearly statistics not created: {e}")
+        # Normalised features (only if there is variation)
+        std_val = data_processed[target_col].std()
+        if std_val > 0:
+            data_processed[f'{target_col}_zscore'] = (data_processed[target_col] - data_processed[target_col].mean()) / std_val
+            self.created_features.append(f'{target_col}_zscore')
+        # Features based on percentiles (binary features)
+        try:
+            for p in [0.25, 0.5, 0.75]:
+                quantile_val = data_processed[target_col].quantile(p)
+                data_processed[f'{target_col}_above_p{int(p*100)}'] = (data_processed[target_col] > quantile_val).astype(int)
+                self.created_features.append(f'{target_col}_above_p{int(p*100)}')
+        except Exception as e:
+            logger.debug(f"Quantile features not created: {e}")
+        logger.info(f"✓ Statistical features created: {len([c for c in data_processed.columns if c not in data.columns])}")
+        return data_processed
+    def create_rolling_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create rolling statistics
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with rolling features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Use only main windows from configuration
+        windows = [w for w in self.config.rolling_windows if w < len(data_processed) // 2]
+        for window in windows:
+            try:
+                # Basic statistics
+                data_processed[f'{target_col}_rolling_mean_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).mean()
+                data_processed[f'{target_col}_rolling_std_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).std()
+                data_processed[f'{target_col}_rolling_min_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).min()
+                data_processed[f'{target_col}_rolling_max_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).max()
+                self.created_features.extend([
+                    f'{target_col}_rolling_mean_{window}',
+                    f'{target_col}_rolling_std_{window}',
+                    f'{target_col}_rolling_min_{window}',
+                    f'{target_col}_rolling_max_{window}'
+                ])
+            except Exception as e:
+                logger.debug(f"Rolling features for window {window} not created: {e}")
+                continue
+        logger.info(f"✓ Rolling features created: {len([c for c in data_processed.columns if 'rolling' in c and c not in data.columns])}")
+        return data_processed
+    def create_lag_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create lag features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with lag features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Limited number of lags
+        max_lags = min(self.config.max_lags, 7)  # Maximum 7 lags
+        for lag in [1, 2, 3, 7, 14, 30]:
+            if lag <= max_lags:
+                data_processed[f'{target_col}_lag_{lag}'] = data_processed[target_col].shift(lag)
+                self.created_features.append(f'{target_col}_lag_{lag}')
+        # Seasonal lags (only if sufficient data)
+        if len(data_processed) > 365:
+            try:
+                data_processed[f'{target_col}_seasonal_lag_365'] = data_processed[target_col].shift(365)
+                self.created_features.append(f'{target_col}_seasonal_lag_365')
+            except Exception as e:
+                logger.debug(f"Seasonal lag not created: {e}")
+        # Differences (stationarity)
+        data_processed[f'{target_col}_diff_1'] = data_processed[target_col].diff(1)
+        self.created_features.append(f'{target_col}_diff_1')
+        if len(data_processed) > 7:
+            data_processed[f'{target_col}_diff_7'] = data_processed[target_col].diff(7)
+            self.created_features.append(f'{target_col}_diff_7')
+        logger.info(f"✓ Lag features created: {len([c for c in data_processed.columns if ('lag' in c or 'diff' in c) and c not in data.columns])}")
+        return data_processed
+    def create_interaction_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create interaction features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with interaction features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Interactions with temperature (only if data exists)
+        temp_cols = ['tavg', 'tmin', 'tmax']
+        available_temp_cols = [col for col in temp_cols if col in data_processed.columns]
+        for temp_col in available_temp_cols:
+            try:
+                # Avoid division by zero
+                temp_data = data_processed[temp_col].replace(0, np.nan)
+                if temp_data.notna().all() and (temp_data != 0).all():
+                    data_processed[f'{target_col}_{temp_col}_ratio'] = data_processed[target_col] / temp_data
+                    self.created_features.append(f'{target_col}_{temp_col}_ratio')
+                    # Product
+                    data_processed[f'{target_col}_{temp_col}_product'] = data_processed[target_col] * temp_data
+                    self.created_features.append(f'{target_col}_{temp_col}_product')
+            except Exception as e:
+                logger.debug(f"Interaction feature with {temp_col} not created: {e}")
+        # Interaction with water level
+        if 'urovenvoda' in data_processed.columns:
+            try:
+                uroven_data = data_processed['urovenvoda'].replace(0, np.nan)
+                if uroven_data.notna().all() and (uroven_data != 0).all():
+                    data_processed[f'{target_col}_urovenvoda_ratio'] = data_processed[target_col] / uroven_data
+                    self.created_features.append(f'{target_col}_urovenvoda_ratio')
+            except Exception as e:
+                logger.debug(f"Interaction feature with urovenvoda not created: {e}")
+        logger.info(f"✓ Interaction features created: {len([c for c in data_processed.columns if ('ratio' in c or 'product' in c) and c not in data.columns])}")
+        return data_processed
+    def create_spectral_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create spectral features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with spectral features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        if len(data_processed) < 100:
+            logger.info("Insufficient data for creating spectral features")
+            return data_processed
+        try:
+            # Fast Fourier Transform
+            series = data_processed[target_col].dropna().values
+            if len(series) > 50:
+                # Calculate periodogram
+                from scipy.signal import periodogram
+                freqs, psd = periodogram(series, fs=1.0)
+                # Find dominant frequencies
+                if len(psd) > 3:
+                    # Top-3 frequencies by power
+                    top_indices = np.argsort(psd)[-3:][::-1]
+                    for i, idx in enumerate(top_indices, 1):
+                        if idx < len(freqs):
+                            freq = freqs[idx]
+                            if freq > 0:
+                                period = 1 / freq
+                                data_processed[f'{target_col}_dominant_period_{i}'] = period
+                                self.created_features.append(f'{target_col}_dominant_period_{i}')
+        except Exception as e:
+            logger.debug(f"Spectral features creation failed: {e}")
+        return data_processed
+    def create_decomposition_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create features based on decomposition
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with decomposition features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        if len(data_processed) < 365:
+            logger.info("Insufficient data for decomposition")
+            return data_processed
+        try:
+            # Check for date presence
+            if isinstance(data_processed.index, pd.DatetimeIndex):
+                # STL decomposition
+                if len(data_processed) > 730:  # Need at least 2 years for yearly seasonality
+                    try:
+                        from statsmodels.tsa.seasonal import STL
+                        # STL decomposition
+                        stl = STL(
+                            data_processed[target_col].fillna(method='ffill'),
+                            period=365,
+                            robust=True
+                        )
+                        result = stl.fit()
+                        # Add components
+                        data_processed[f'{target_col}_trend'] = result.trend
+                        data_processed[f'{target_col}_seasonal'] = result.seasonal
+                        data_processed[f'{target_col}_residual'] = result.resid
+                        self.created_features.extend([
+                            f'{target_col}_trend',
+                            f'{target_col}_seasonal',
+                            f'{target_col}_residual'
+                        ])
+                        logger.info("✓ STL decomposition successful")
+                    except Exception as e:
+                        logger.debug(f"STL decomposition failed: {e}")
+                        # Simple seasonal decomposition
+                        try:
+                            from statsmodels.tsa.seasonal import seasonal_decompose
+                            decomposition = seasonal_decompose(
+                                data_processed[target_col].fillna(method='ffill'),
+                                model='additive',
+                                period=365,
+                                extrapolate_trend='freq'
+                            )
+                            data_processed[f'{target_col}_trend'] = decomposition.trend
+                            data_processed[f'{target_col}_seasonal'] = decomposition.seasonal
+                            self.created_features.extend([
+                                f'{target_col}_trend',
+                                f'{target_col}_seasonal'
+                            ])
+                            logger.info("✓ Seasonal decomposition successful")
+                        except Exception as e2:
+                            logger.debug(f"Seasonal decomposition failed: {e2}")
+        except Exception as e:
+            logger.debug(f"Decomposition features creation failed: {e}")
+        return data_processed
+    def get_feature_categories(self) -> Dict[str, List[str]]:
+        """Get features by categories"""
+        categories = {
+            'temporal': [],
+            'statistical': [],
+            'rolling': [],
+            'lag': [],
+            'interaction': [],
+            'spectral': [],
+            'decomposition': [],
+            'binary': []
+        }
+        for feature in self.created_features:
+            if any(keyword in feature for keyword in ['year', 'month', 'day', 'week', 'quarter', 'sin', 'cos', 'is_weekend']):
+                categories['temporal'].append(feature)
+            elif any(keyword in feature for keyword in ['zscore', 'above_p', 'yearly_']):
+                if 'above_p' in feature:
+                    categories['binary'].append(feature)
+                else:
+                    categories['statistical'].append(feature)
+            elif 'rolling' in feature:
+                categories['rolling'].append(feature)
+            elif any(keyword in feature for keyword in ['lag', 'diff']):
+                categories['lag'].append(feature)
+            elif 'ratio' in feature or 'product' in feature:
+                categories['interaction'].append(feature)
+            elif 'dominant' in feature:
+                categories['spectral'].append(feature)
+            elif any(keyword in feature for keyword in ['trend', 'seasonal', 'residual']):
+                categories['decomposition'].append(feature)
+        # Remove empty categories
+        categories = {k: v for k, v in categories.items() if v}
+        return categories

missing_values/__init__.py ADDED Viewed

File without changes

missing_values/missing_analyzer.py ADDED Viewed

	@@ -0,0 +1,700 @@

+# ============================================
+# CLASS 3: MISSING VALUE ANALYSER
+# ============================================
+from typing import Dict, Tuple
+from venv import logger
+from config.config import Config
+from scipy.interpolate import interp1d
+from statsmodels.tsa.seasonal import seasonal_decompose, STL
+try:
+    import pandas as pd
+    import numpy as np
+    import matplotlib.pyplot as plt
+    print("✅ All imports working!")
+except ImportError as e:
+    print(f"❌ Import error: {e}")
+class MissingValueAnalyser:
+    """Class for analysing and handling missing values"""
+    def __init__(self, config: Config):
+        """
+        Initialise missing value analyser
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.missing_info = {}
+        self.handling_methods = {}
+        self.imputers = {}
+        self.missing_patterns = {}
+    def analyse(
+        self,
+        data: pd.DataFrame,
+        detailed: bool = True
+    ) -> Dict:
+        """
+        Analyse missing values in data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        detailed : bool
+            Whether to perform detailed analysis
+        Returns:
+        --------
+        Dict
+            Information about missing values
+        """
+        logger.info("\n" + "="*80)
+        logger.info("MISSING VALUE ANALYSIS")
+        logger.info("="*80)
+        # Calculate missing values
+        missing_total = data.isnull().sum()
+        missing_percent = (missing_total / len(data)) * 100
+        missing_df = pd.DataFrame({
+            'missing_count': missing_total,
+            'missing_percent': missing_percent,
+            'dtype': data.dtypes.astype(str)
+        })
+        # Detailed analysis
+        if detailed:
+            self._detailed_missing_analysis(data, missing_df)
+        # Save information
+        self.missing_info = {
+            'summary': {
+                col: {
+                    'missing_count': int(missing_df.loc[col, 'missing_count']),
+                    'missing_percent': float(missing_df.loc[col, 'missing_percent']),
+                    'dtype': missing_df.loc[col, 'dtype']
+                }
+                for col in missing_df.index
+            },
+            'overall': {
+                'total_missing': int(missing_total.sum()),
+                'total_rows': int(len(data)),
+                'total_cells': int(data.size),
+                'overall_missing_percentage': float(missing_total.sum() / data.size * 100),
+                'rows_with_any_missing': int(data.isnull().any(axis=1).sum()),
+                'rows_all_missing': int(data.isnull().all(axis=1).sum()),
+                'columns_with_missing': missing_df[missing_df['missing_count'] > 0].index.tolist(),
+                'columns_all_missing': missing_df[missing_df['missing_count'] == len(data)].index.tolist()
+            }
+        }
+        # Visualisation
+        if self.config.save_plots:
+            self._plot_missing_values(data, missing_df)
+        # Output results
+        self._log_missing_summary(missing_df)
+        return self.missing_info
+    def _detailed_missing_analysis(
+        self,
+        data: pd.DataFrame,
+        missing_df: pd.DataFrame
+    ) -> None:
+        """Detailed missing value analysis"""
+        # Analyse missing patterns
+        missing_matrix = data.isnull()
+        # Row missing patterns
+        row_patterns = missing_matrix.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
+        row_pattern_counts = row_patterns.value_counts().head(10)
+        # Column missing patterns
+        col_patterns = missing_matrix.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=0)
+        col_pattern_counts = col_patterns.value_counts().head(10)
+        # Time-based missing patterns analysis
+        time_patterns = {}
+        if isinstance(data.index, pd.DatetimeIndex):
+            # Missing values by time
+            time_missing = data.isnull().resample('M').sum()
+            time_patterns['monthly_missing'] = time_missing.sum(axis=1).to_dict()
+            # Missing values by day of week
+            data_with_dow = data.copy()
+            data_with_dow['dayofweek'] = data.index.dayofweek
+            dow_missing = data_with_dow.groupby('dayofweek').apply(lambda x: x.isnull().sum().sum())
+            time_patterns['dayofweek_missing'] = dow_missing.to_dict()
+        self.missing_patterns = {
+            'row_patterns': row_pattern_counts.to_dict(),
+            'col_patterns': col_pattern_counts.to_dict(),
+            'time_patterns': time_patterns,
+            'missing_correlation': missing_matrix.corr().to_dict()  # Missing value correlation
+        }
+        logger.debug(f"Found {len(row_pattern_counts)} unique row missing patterns")
+        logger.debug(f"Found {len(col_pattern_counts)} unique column missing patterns")
+    def _plot_missing_values(
+        self,
+        data: pd.DataFrame,
+        missing_df: pd.DataFrame
+    ) -> None:
+        """Visualise missing values"""
+        fig, axes = plt.subplots(3, 2, figsize=(16, 12))
+        # 1. Missing percentage histogram
+        axes[0, 0].barh(
+            missing_df.index,
+            missing_df['missing_percent']
+        )
+        axes[0, 0].axvline(self.config.missing_threshold, color='red', linestyle='--')
+        axes[0, 0].set_title('Missing Percentage by Column')
+        axes[0, 0].set_xlabel('Missing Percentage (%)')
+        axes[0, 0].set_ylabel('Columns')
+        axes[0, 0].grid(True, alpha=0.3)
+        # 2. Missing values heatmap
+        missing_matrix = data.isnull()
+        axes[0, 1].imshow(
+            missing_matrix.T if len(data) > 1000 else missing_matrix.T[:1000],
+            aspect='auto',
+            cmap='binary',
+            interpolation='none'
+        )
+        axes[0, 1].set_title('Missing Values Matrix')
+        axes[0, 1].set_xlabel('Observation Index')
+        axes[0, 1].set_ylabel('Variables')
+        axes[0, 1].set_yticks(range(len(data.columns)))
+        axes[0, 1].set_yticklabels(data.columns, fontsize=8)
+        # 3. Missing values over time (if time series)
+        if isinstance(data.index, pd.DatetimeIndex):
+            time_missing = data.isnull().resample('M').sum()
+            axes[1, 0].plot(time_missing.sum(axis=1))
+            axes[1, 0].set_title('Missing Values by Month')
+            axes[1, 0].set_xlabel('Date')
+            axes[1, 0].set_ylabel('Number of Missing Values')
+            axes[1, 0].grid(True, alpha=0.3)
+            # 4. Missing values by day of week
+            data_with_dow = data.copy()
+            data_with_dow['dayofweek'] = data.index.dayofweek
+            dow_missing = data_with_dow.groupby('dayofweek').apply(lambda x: x.isnull().sum().sum())
+            dow_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
+            axes[1, 1].bar(range(7), dow_missing)
+            axes[1, 1].set_title('Missing Values by Day of Week')
+            axes[1, 1].set_xlabel('Day of Week')
+            axes[1, 1].set_ylabel('Number of Missing Values')
+            axes[1, 1].set_xticks(range(7))
+            axes[1, 1].set_xticklabels(dow_names)
+            axes[1, 1].grid(True, alpha=0.3)
+        # 5. Missing value correlation
+        missing_corr = data.isnull().corr()
+        im = axes[2, 0].imshow(
+            missing_corr,
+            cmap='coolwarm',
+            vmin=-1,
+            vmax=1,
+            aspect='auto'
+        )
+        axes[2, 0].set_title('Missing Value Correlation Between Variables')
+        axes[2, 0].set_xlabel('Variables')
+        axes[2, 0].set_ylabel('Variables')
+        plt.colorbar(im, ax=axes[2, 0])
+        # 6. Cumulative missing sum
+        cumulative_missing = data.isnull().cumsum()
+        for col in data.columns[:5]:  # First 5 columns
+            if data[col].isnull().any():
+                axes[2, 1].plot(
+                    cumulative_missing.index,
+                    cumulative_missing[col],
+                    label=col[:20]
+                )
+        axes[2, 1].set_title('Cumulative Missing Values')
+        axes[2, 1].set_xlabel('Time/Index')
+        axes[2, 1].set_ylabel('Cumulative Missing')
+        axes[2, 1].legend(fontsize=8)
+        axes[2, 1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/missing_values_analysis.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def _log_missing_summary(self, missing_df: pd.DataFrame) -> None:
+        """Log missing value summary"""
+        missing_columns = missing_df[missing_df['missing_count'] > 0]
+        if len(missing_columns) > 0:
+            logger.info("MISSING VALUES FOUND:")
+            logger.info("-" * 50)
+            logger.info(f"Total missing values: {self.missing_info['overall']['total_missing']}")
+            logger.info(f"Overall missing percentage: {self.missing_info['overall']['overall_missing_percentage']:.2f}%")
+            logger.info(f"Rows with missing values: {self.missing_info['overall']['rows_with_any_missing']}")
+            logger.info(f"Columns with missing values: {len(self.missing_info['overall']['columns_with_missing'])}")
+            logger.info("\nTop-10 columns by missing values:")
+            top_missing = missing_df.nlargest(10, 'missing_percent')
+            for idx, (col, row) in enumerate(top_missing.iterrows(), 1):
+                logger.info(f"  {idx:2d}. {col}: {int(row['missing_count'])} missing ({row['missing_percent']:.2f}%)")
+        else:
+            logger.info("✓ No missing values found")
+    def handle(
+        self,
+        data: pd.DataFrame,
+        method: str = 'interpolate',
+        strategy: str = 'columnwise',
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Handle missing values
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str
+            Handling method: 'interpolate', 'ffill', 'bfill', 'mean', 'median', 'mode', 'knn', 'regression'
+        strategy : str
+            Strategy: 'columnwise', 'rowwise', 'global'
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Data with handled missing values
+        """
+        logger.info("\n" + "="*80)
+        logger.info("HANDLING MISSING VALUES")
+        logger.info("="*80)
+        data_processed = data.copy()
+        methods_applied = {}
+        # Determine columns to process
+        if strategy == 'columnwise':
+            columns_to_process = data_processed.columns
+        elif strategy == 'rowwise':
+            # Row-wise handling (for time series)
+            data_processed = self._handle_rowwise(data_processed, method, **kwargs)
+            return data_processed
+        else:
+            columns_to_process = data_processed.select_dtypes(include=[np.number]).columns
+        # Process each column
+        for col in columns_to_process:
+            missing_before = data_processed[col].isnull().sum()
+            if missing_before > 0:
+                # Check if missing percentage exceeds threshold
+                missing_percent = (missing_before / len(data_processed)) * 100
+                if missing_percent > self.config.missing_threshold:
+                    logger.warning(f"  {col}: {missing_before} missing ({missing_percent:.1f}%) > threshold {self.config.missing_threshold}%")
+                    if kwargs.get('drop_high_missing', False):
+                        data_processed = data_processed.drop(columns=[col])
+                        method_used = f"dropped (>{self.config.missing_threshold}% missing)"
+                        missing_after = 0
+                    else:
+                        # Use selected method
+                        data_processed[col], method_used = self._apply_imputation_method(
+                            data_processed[col], method, **kwargs
+                        )
+                        missing_after = data_processed[col].isnull().sum()
+                else:
+                    # Use selected method
+                    data_processed[col], method_used = self._apply_imputation_method(
+                        data_processed[col], method, **kwargs
+                    )
+                    missing_after = data_processed[col].isnull().sum()
+                methods_applied[col] = {
+                    'method': method_used,
+                    'missing_before': int(missing_before),
+                    'missing_after': int(missing_after),
+                    'missing_percent_before': float(missing_percent)
+                }
+                if missing_before > 0:
+                    logger.info(f"  {col}: {missing_before} → {missing_after} missing ({method_used})")
+        self.handling_methods = methods_applied
+        # Check that all missing values are handled
+        remaining_missing = data_processed.isnull().sum().sum()
+        if remaining_missing == 0:
+            logger.info("✓ All missing values successfully handled")
+        else:
+            logger.warning(f"⚠ {remaining_missing} missing values remain")
+            # Additional handling of remaining missing values
+            data_processed = data_processed.fillna(method='ffill').fillna(method='bfill')
+            remaining_after = data_processed.isnull().sum().sum()
+            if remaining_after == 0:
+                logger.info("✓ Remaining missing values handled with ffill/bfill combination")
+        return data_processed
+    def _apply_imputation_method(
+        self,
+        series: pd.Series,
+        method: str,
+        **kwargs
+    ) -> Tuple[pd.Series, str]:
+        """
+        Apply imputation method to individual series
+        Parameters:
+        -----------
+        series : pd.Series
+            Input series
+        method : str
+            Imputation method
+        **kwargs : dict
+            Additional parameters
+        Returns:
+        --------
+        Tuple[pd.Series, str]
+            Processed series and method description
+        """
+        if method == 'interpolate':
+            # Interpolation for time series
+            if isinstance(series.index, pd.DatetimeIndex):
+                method_name = f"{kwargs.get('interpolation_method', 'linear')} interpolation"
+                series_filled = series.interpolate(
+                    method=kwargs.get('interpolation_method', 'linear'),
+                    limit_direction=kwargs.get('limit_direction', 'both'),
+                    limit=kwargs.get('limit', None)
+                )
+            else:
+                method_name = 'linear interpolation'
+                series_filled = series.interpolate(method='linear')
+        elif method == 'time_weighted':
+            # Time-weighted interpolation
+            method_name = 'time-weighted interpolation'
+            series_filled = self._time_weighted_interpolation(series)
+        elif method == 'seasonal':
+            # Seasonal interpolation
+            method_name = 'seasonal interpolation'
+            series_filled = self._seasonal_interpolation(series, **kwargs)
+        elif method == 'ffill':
+            # Forward fill
+            method_name = 'forward fill'
+            series_filled = series.ffill(limit=kwargs.get('limit', None))
+        elif method == 'bfill':
+            # Backward fill
+            method_name = 'backward fill'
+            series_filled = series.bfill(limit=kwargs.get('limit', None))
+        elif method == 'mean':
+            # Mean imputation
+            method_name = 'mean imputation'
+            series_filled = series.fillna(series.mean())
+        elif method == 'median':
+            # Median imputation
+            method_name = 'median imputation'
+            series_filled = series.fillna(series.median())
+        elif method == 'mode':
+            # Mode imputation
+            method_name = 'mode imputation'
+            mode_value = series.mode()
+            if not mode_value.empty:
+                series_filled = series.fillna(mode_value.iloc[0])
+            else:
+                series_filled = series.fillna(series.median())
+        elif method == 'knn':
+            # KNN imputation
+            method_name = f"KNN imputation (k={kwargs.get('k', 5)})"
+            # Simplified version using nearest neighbour mean
+            series_filled = self._knn_imputation(series, k=kwargs.get('k', 5))
+        elif method == 'regression':
+            # Regression imputation
+            method_name = 'regression imputation'
+            series_filled = self._regression_imputation(series, **kwargs)
+        elif method == 'spline':
+            # Spline interpolation
+            method_name = 'spline interpolation'
+            series_filled = series.interpolate(method='spline', order=kwargs.get('order', 3))
+        elif method == 'stl':
+            # STL decomposition + interpolation
+            method_name = 'STL-based imputation'
+            series_filled = self._stl_imputation(series, **kwargs)
+        else:
+            raise ValueError(f"Unknown method: {method}")
+        # If missing values remain, fill with ffill/bfill
+        if series_filled.isnull().any():
+            series_filled = series_filled.ffill().bfill()
+            method_name += " + ffill/bfill"
+        return series_filled, method_name
+    def _time_weighted_interpolation(self, series: pd.Series) -> pd.Series:
+        """Time-weighted interpolation"""
+        if not isinstance(series.index, pd.DatetimeIndex):
+            return series.interpolate()
+        # Create timestamps
+        time_numeric = pd.Series(range(len(series)), index=series.index)
+        # Interpolate timestamps for missing values
+        time_interpolated = time_numeric.interpolate()
+        # Interpolate values based on timestamps
+        valid_mask = series.notna()
+        if valid_mask.sum() < 2:
+            return series.ffill().bfill()
+        # Use linear interpolation
+        valid_times = time_numeric[valid_mask]
+        valid_values = series[valid_mask]
+        # Interpolation
+        interp_func = interp1d(
+            valid_times,
+            valid_values,
+            kind='linear',
+            bounds_error=False,
+            fill_value='extrapolate'
+        )
+        series_filled = series.copy()
+        missing_mask = series.isna()
+        series_filled[missing_mask] = interp_func(time_interpolated[missing_mask])
+        return series_filled
+    def _seasonal_interpolation(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """Seasonal interpolation"""
+        if not isinstance(series.index, pd.DatetimeIndex):
+            return series.interpolate()
+        period = kwargs.get('period', self.config.seasonal_period)
+        # Create series copy
+        series_filled = series.copy()
+        # Interpolation considering seasonality
+        for i in range(len(series)):
+            if pd.isna(series.iloc[i]):
+                # Find values at same seasonal position
+                seasonal_indices = []
+                for offset in range(1, 10):  # Look in previous/next cycles
+                    idx_back = i - offset * period
+                    idx_forward = i + offset * period
+                    if idx_back >= 0 and not pd.isna(series.iloc[idx_back]):
+                        seasonal_indices.append(idx_back)
+                    if idx_forward < len(series) and not pd.isna(series.iloc[idx_forward]):
+                        seasonal_indices.append(idx_forward)
+                if seasonal_indices:
+                    # Take mean value from seasonal positions
+                    seasonal_values = series.iloc[seasonal_indices]
+                    series_filled.iloc[i] = seasonal_values.mean()
+        # Fill remaining missing values with regular interpolation
+        series_filled = series_filled.interpolate()
+        return series_filled
+    def _knn_imputation(
+        self,
+        series: pd.Series,
+        k: int = 5
+    ) -> pd.Series:
+        """KNN imputation for time series"""
+        # Simplified KNN for time series
+        series_filled = series.copy()
+        for i in range(len(series)):
+            if pd.isna(series.iloc[i]):
+                # Find nearest k non-missing values
+                distances = []
+                values = []
+                for j in range(max(0, i - k * 10), min(len(series), i + k * 10)):
+                    if j != i and not pd.isna(series.iloc[j]):
+                        distance = abs(i - j)
+                        distances.append(distance)
+                        values.append(series.iloc[j])
+                        if len(values) >= k:
+                            break
+                if values:
+                    # Distance-weighted average
+                    weights = [1 / (d + 1) for d in distances]
+                    weighted_avg = np.average(values, weights=weights)
+                    series_filled.iloc[i] = weighted_avg
+        return series_filled
+    def _regression_imputation(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """Regression imputation based on neighbouring values"""
+        # Simplified regression for time series
+        series_filled = series.copy()
+        if series.notna().sum() < 3:
+            return series.ffill().bfill()
+        # Use polynomial regression
+        x = np.arange(len(series))
+        y = series.values
+        # Valid values mask
+        valid_mask = ~np.isnan(y)
+        if valid_mask.sum() < 2:
+            return series.ffill().bfill()
+        # Polynomial regression degree 2
+        coeffs = np.polyfit(x[valid_mask], y[valid_mask], 2)
+        poly_func = np.poly1d(coeffs)
+        # Fill missing values
+        missing_mask = np.isnan(y)
+        series_filled.iloc[missing_mask] = poly_func(x[missing_mask])
+        return series_filled
+    def _stl_imputation(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """STL decomposition-based imputation"""
+        try:
+            if not isinstance(series.index, pd.DatetimeIndex):
+                return series.interpolate()
+            # STL decomposition
+            stl = STL(
+                series.ffill().bfill(),  # Fill missing for STL
+                period=kwargs.get('period', self.config.seasonal_period),
+                robust=True
+            )
+            result = stl.fit()
+            # Reconstruct series without noise
+            reconstructed = result.trend + result.seasonal
+            # Replace missing values with reconstructed values
+            series_filled = series.copy()
+            missing_mask = series.isna()
+            series_filled[missing_mask] = reconstructed[missing_mask]
+            return series_filled
+        except Exception as e:
+            logger.warning(f"STL imputation failed: {e}, using interpolation")
+            return series.interpolate()
+    def _handle_rowwise(
+        self,
+        data: pd.DataFrame,
+        method: str,
+        **kwargs
+    ) -> pd.DataFrame:
+        """Row-wise missing value handling"""
+        data_processed = data.copy()
+        # Remove rows with high missing counts
+        if kwargs.get('drop_rows_threshold', 0) > 0:
+            threshold = kwargs['drop_rows_threshold']
+            rows_before = len(data_processed)
+            missing_per_row = data_processed.isnull().sum(axis=1) / data_processed.shape[1] * 100
+            rows_to_drop = missing_per_row[missing_per_row > threshold].index
+            data_processed = data_processed.drop(rows_to_drop)
+            rows_after = len(data_processed)
+            logger.info(f"Rows removed: {rows_before - rows_after} (missing > {threshold}%)")
+        # Row-wise imputation
+        if method == 'row_mean':
+            data_processed = data_processed.T.fillna(data_processed.mean(axis=1)).T
+        elif method == 'row_median':
+            data_processed = data_processed.T.fillna(data_processed.median(axis=1)).T
+        elif method == 'row_ffill':
+            data_processed = data_processed.ffill(axis=1).bfill(axis=1)
+        return data_processed
+    def create_validation_rules(self) -> Dict:
+        """Create validation rules based on missing value analysis"""
+        rules = {}
+        for col, info in self.missing_info['summary'].items():
+            missing_percent = info['missing_percent']
+            if missing_percent > 50:
+                rules[col] = {
+                    'action': 'drop_column',
+                    'reason': f'Missing > 50%: {missing_percent:.1f}%'
+                }
+            elif missing_percent > 20:
+                rules[col] = {
+                    'action': 'advanced_imputation',
+                    'reason': f'High missing: {missing_percent:.1f}%',
+                    'recommended_method': 'knn'
+                }
+            elif missing_percent > 5:
+                rules[col] = {
+                    'action': 'standard_imputation',
+                    'reason': f'Moderate missing: {missing_percent:.1f}%',
+                    'recommended_method': 'interpolate'
+                }
+            elif missing_percent > 0:
+                rules[col] = {
+                    'action': 'simple_imputation',
+                    'reason': f'Low missing: {missing_percent:.1f}%',
+                    'recommended_method': 'ffill'
+                }
+        return rules
+    def get_report(self) -> Dict:
+        """Get missing values report"""
+        return {
+            'missing_info': self.missing_info,
+            'handling_methods': self.handling_methods,
+            'missing_patterns': self.missing_patterns,
+            'validation_rules': self.create_validation_rules()
+        }

outliers/__init__.py ADDED Viewed

File without changes

outliers/outlier_analyzer.py ADDED Viewed

	@@ -0,0 +1,857 @@

+# ============================================
+# CLASS 4: OUTLIER ANALYSER
+# ============================================
+from typing import Dict, List, Tuple
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.neighbors import LocalOutlierFactor
+from sklearn.covariance import EllipticEnvelope
+from scipy import stats
+class OutlierAnalyser:
+    """Class for analysing and handling outliers"""
+    def __init__(self, config: Config):
+        """
+        Initialise outlier analyser
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.outlier_info = {}
+        self.handling_methods = {}
+        self.detection_methods = {}
+        self.outlier_models = {}
+    def analyse(
+        self,
+        data: pd.DataFrame,
+        method: str = None,
+        columns: List[str] = None,
+        **kwargs
+    ) -> Dict:
+        """
+        Analyse outliers in data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str, optional
+            Detection method. If None, uses configuration value.
+        columns : List[str], optional
+            List of columns to analyse. If None, uses all numeric columns.
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        Dict
+            Information about outliers
+        """
+        logger.info("\n" + "="*80)
+        logger.info("OUTLIER ANALYSIS")
+        logger.info("="*80)
+        method = method or self.config.outlier_method
+        if columns is None:
+            columns = data.select_dtypes(include=[np.number]).columns
+        outliers_info = {}
+        # Apply various detection methods
+        detection_results = {}
+        # 1. Statistical methods
+        if method in ['iqr', 'zscore', 'sigma', 'all']:
+            detection_results.update(self._statistical_methods(data, columns, method, **kwargs))
+        # 2. ML methods
+        if method in ['lof', 'isolation_forest', 'elliptic_envelope', 'all']:
+            detection_results.update(self._ml_methods(data, columns, method, **kwargs))
+        # 3. Temporal methods
+        if isinstance(data.index, pd.DatetimeIndex):
+            detection_results.update(self._temporal_methods(data, columns, **kwargs))
+        # Aggregate results
+        for col in columns:
+            if col in detection_results:
+                # Combine results from different methods
+                combined_mask = self._combine_detection_methods(detection_results, col)
+                outliers_count = combined_mask.sum()
+                outliers_percent = (outliers_count / len(data)) * 100
+                # Detailed information
+                col_data = data[col].dropna()
+                stats = {
+                    'mean': float(col_data.mean()),
+                    'std': float(col_data.std()),
+                    'median': float(col_data.median()),
+                    'q1': float(col_data.quantile(0.25)),
+                    'q3': float(col_data.quantile(0.75)),
+                    'min': float(col_data.min()),
+                    'max': float(col_data.max()),
+                    'skewness': float(col_data.skew()),
+                    'kurtosis': float(col_data.kurtosis())
+                }
+                outliers_info[col] = {
+                    'method': method,
+                    'statistics': stats,
+                    'outliers_count': int(outliers_count),
+                    'outliers_percent': float(outliers_percent),
+                    'outlier_indices': data[combined_mask].index.tolist() if outliers_count > 0 else [],
+                    'outlier_values': data.loc[combined_mask, col].tolist() if outliers_count > 0 else [],
+                    'detection_methods': {
+                        name: {
+                            'count': int(mask.sum()),
+                            'percent': float(mask.sum() / len(data) * 100)
+                        }
+                        for name, mask in detection_results[col].items()
+                    }
+                }
+                logger.info(f"{col}: {outliers_count} outliers ({outliers_percent:.2f}%)")
+        self.outlier_info = outliers_info
+        self.detection_methods = detection_results
+        # Visualisation
+        if self.config.save_plots and len(columns) > 0:
+            self._plot_outlier_analysis(data, columns, outliers_info)
+        return outliers_info
+    def _statistical_methods(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        method: str,
+        **kwargs
+    ) -> Dict:
+        """Statistical outlier detection methods"""
+        results = {}
+        for col in columns:
+            col_results = {}
+            series = data[col].dropna()
+            if len(series) < 3:
+                continue
+            # IQR method
+            if method in ['iqr', 'all']:
+                q1 = series.quantile(0.25)
+                q3 = series.quantile(0.75)
+                iqr = q3 - q1
+                lower_bound = q1 - self.config.outlier_alpha * iqr
+                upper_bound = q3 + self.config.outlier_alpha * iqr
+                iqr_mask = (data[col] < lower_bound) | (data[col] > upper_bound)
+                col_results['iqr'] = iqr_mask
+            # Z-score method
+            if method in ['zscore', 'sigma', 'all']:
+                z_threshold = kwargs.get('z_threshold', 3)
+                z_scores = np.abs((data[col] - series.mean()) / series.std())
+                z_mask = z_scores > z_threshold
+                col_results['zscore'] = z_mask
+            # Modified Z-score method
+            if method in ['zscore', 'all']:
+                median = series.median()
+                mad = np.median(np.abs(series - median))
+                if mad != 0:
+                    modified_z_scores = 0.6745 * (data[col] - median) / mad
+                    mz_mask = np.abs(modified_z_scores) > 3.5
+                    col_results['modified_zscore'] = mz_mask
+            # Tukey's fences
+            if method in ['iqr', 'all']:
+                inner_lower = q1 - 1.5 * iqr
+                inner_upper = q3 + 1.5 * iqr
+                outer_lower = q1 - 3 * iqr
+                outer_upper = q3 + 3 * iqr
+                mild_mask = ((data[col] < inner_lower) | (data[col] > inner_upper)) & \
+                           ((data[col] >= outer_lower) & (data[col] <= outer_upper))
+                extreme_mask = (data[col] < outer_lower) | (data[col] > outer_upper)
+                col_results['tukey_mild'] = mild_mask
+                col_results['tukey_extreme'] = extreme_mask
+            results[col] = col_results
+        return results
+    def _ml_methods(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        method: str,
+        **kwargs
+    ) -> Dict:
+        """ML outlier detection methods"""
+        results = {}
+        numeric_data = data[columns].dropna()
+        if len(numeric_data) < 10:
+            return results
+        try:
+            # Local Outlier Factor
+            if method in ['lof', 'all']:
+                lof = LocalOutlierFactor(
+                    contamination=self.config.outlier_contamination,
+                    n_neighbors=kwargs.get('n_neighbors', 20)
+                )
+                lof_labels = lof.fit_predict(numeric_data)
+                lof_mask = pd.Series(lof_labels == -1, index=numeric_data.index)
+                for col in columns:
+                    if col in numeric_data.columns:
+                        if col not in results:
+                            results[col] = {}
+                        results[col]['lof'] = lof_mask
+            # Elliptic Envelope
+            if method in ['elliptic_envelope', 'all']:
+                try:
+                    envelope = EllipticEnvelope(
+                        contamination=self.config.outlier_contamination,
+                        random_state=42
+                    )
+                    envelope_labels = envelope.fit_predict(numeric_data)
+                    envelope_mask = pd.Series(envelope_labels == -1, index=numeric_data.index)
+                    for col in columns:
+                        if col in numeric_data.columns:
+                            if col not in results:
+                                results[col] = {}
+                            results[col]['elliptic_envelope'] = envelope_mask
+                except Exception as e:
+                    logger.warning(f"Elliptic Envelope failed: {e}")
+        except Exception as e:
+            logger.warning(f"ML outlier detection methods failed: {e}")
+        return results
+    def _temporal_methods(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        **kwargs
+    ) -> Dict:
+        """Outlier detection methods for time series"""
+        results = {}
+        for col in columns:
+            col_results = {}
+            series = data[col].dropna()
+            if len(series) < 30:
+                continue
+            # Rolling statistics method
+            window = kwargs.get('temporal_window', 30)
+            rolling_mean = series.rolling(window=window, center=True).mean()
+            rolling_std = series.rolling(window=window, center=True).std()
+            # Outliers relative to moving average
+            threshold = kwargs.get('temporal_threshold', 3)
+            temporal_mask = np.abs(series - rolling_mean) > (threshold * rolling_std)
+            col_results['temporal'] = temporal_mask
+            # Seasonal detrending + outlier detection
+            try:
+                # Simple seasonal detrending
+                if len(series) > 365:
+                    seasonal_period = kwargs.get('seasonal_period', 365)
+                    seasonal_mean = series.rolling(window=seasonal_period, center=True).mean()
+                    detrended = series - seasonal_mean
+                    # Outliers in detrended series
+                    q1 = detrended.quantile(0.25)
+                    q3 = detrended.quantile(0.75)
+                    iqr = q3 - q1
+                    seasonal_lower = q1 - 3 * iqr
+                    seasonal_upper = q3 + 3 * iqr
+                    seasonal_mask = (detrended < seasonal_lower) | (detrended > seasonal_upper)
+                    col_results['seasonal'] = seasonal_mask
+            except Exception as e:
+                logger.debug(f"Seasonal outlier detection failed for {col}: {e}")
+            results[col] = col_results
+        return results
+    def _combine_detection_methods(
+        self,
+        detection_results: Dict,
+        column: str
+    ) -> pd.Series:
+        """Combine results from different detection methods"""
+        if column not in detection_results:
+            return pd.Series(False, index=pd.RangeIndex(0))
+        methods = detection_results[column]
+        combined_mask = None
+        for method_name, mask in methods.items():
+            if combined_mask is None:
+                combined_mask = mask.copy()
+            else:
+                # Combine via OR (outlier by any method)
+                combined_mask = combined_mask | mask
+        return combined_mask.fillna(False)
+    def _plot_outlier_analysis(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        outliers_info: Dict
+    ) -> None:
+        """Visualise outlier analysis"""
+        n_cols = min(len(columns), 4)
+        n_rows = (len(columns) + n_cols - 1) // n_cols
+        fig = plt.figure(figsize=(16, 4 * n_rows))
+        gs = fig.add_gridspec(n_rows, n_cols)
+        for idx, col in enumerate(columns):
+            if col not in outliers_info:
+                continue
+            row = idx // n_cols
+            col_idx = idx % n_cols
+            ax = fig.add_subplot(gs[row, col_idx])
+            # Data
+            series = data[col].dropna()
+            # 1. Box plot
+            bp = ax.boxplot(
+                series.values,
+                vert=True,
+                patch_artist=True,
+                widths=0.6,
+                showfliers=False
+            )
+            # Colours for box plot
+            bp['boxes'][0].set_facecolor('lightblue')
+            bp['medians'][0].set_color('red')
+            bp['whiskers'][0].set_color('black')
+            bp['whiskers'][1].set_color('black')
+            bp['caps'][0].set_color('black')
+            bp['caps'][1].set_color('black')
+            # 2. Outliers
+            if outliers_info[col]['outliers_count'] > 0:
+                outlier_indices = outliers_info[col]['outlier_indices']
+                outlier_values = outliers_info[col]['outlier_values']
+                # Convert indices to positions for box plot
+                jitter = np.random.normal(0, 0.05, len(outlier_values))
+                ax.scatter(
+                    np.ones(len(outlier_values)) + jitter,
+                    outlier_values,
+                    color='red',
+                    alpha=0.6,
+                    s=30,
+                    edgecolors='black',
+                    label=f'Outliers ({outliers_info[col]["outliers_count"]})'
+                )
+            # 3. Histogram on same plot
+            ax2 = ax.twinx()
+            ax2.hist(
+                series.values,
+                bins=30,
+                alpha=0.3,
+                color='green',
+                density=True
+            )
+            # 4. Normal distribution for comparison
+            if len(series) > 10:
+                xmin, xmax = ax.get_xlim()
+                x = np.linspace(series.min(), series.max(), 100)
+                mean = series.mean()
+                std = series.std()
+                if std > 0:
+                    p = stats.norm.pdf(x, mean, std)
+                    ax2.plot(x, p, 'k--', linewidth=1, label='Normal distribution')
+            ax.set_title(f'{col}\nOutliers: {outliers_info[col]["outliers_count"]} ({outliers_info[col]["outliers_percent"]:.1f}%)')
+            ax.set_ylabel('Value')
+            ax.grid(True, alpha=0.3)
+            # Legend
+            if outliers_info[col]['outliers_count'] > 0:
+                ax.legend(loc='upper right', fontsize=8)
+            ax2.legend(loc='upper left', fontsize=8)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/outliers_analysis.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+        # Additional plots for time series
+        if isinstance(data.index, pd.DatetimeIndex) and len(columns) > 0:
+            self._plot_temporal_outliers(data, columns, outliers_info)
+    def _plot_temporal_outliers(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        outliers_info: Dict
+    ) -> None:
+        """Visualise outliers over time"""
+        n_plots = min(len(columns), 3)
+        fig, axes = plt.subplots(n_plots, 1, figsize=(14, 4 * n_plots))
+        if n_plots == 1:
+            axes = [axes]
+        for idx, (col, ax) in enumerate(zip(columns[:n_plots], axes)):
+            if col not in outliers_info:
+                continue
+            # Time series
+            ax.plot(data.index, data[col], alpha=0.7, linewidth=1, label='Original series')
+            # Outliers
+            if outliers_info[col]['outliers_count'] > 0:
+                outlier_indices = outliers_info[col]['outlier_indices']
+                outlier_values = outliers_info[col]['outlier_values']
+                ax.scatter(
+                    outlier_indices,
+                    outlier_values,
+                    color='red',
+                    s=40,
+                    edgecolors='black',
+                    zorder=5,
+                    label='Outliers'
+                )
+            # Moving average
+            if len(data) > 30:
+                rolling_mean = data[col].rolling(window=30, center=True).mean()
+                ax.plot(data.index, rolling_mean, 'orange', linewidth=2, label='Moving average (30)')
+            ax.set_title(f'Outliers over time: {col}')
+            ax.set_xlabel('Date')
+            ax.set_ylabel(col)
+            ax.legend(fontsize=8)
+            ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/temporal_outliers.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def handle(
+        self,
+        data: pd.DataFrame,
+        method: str = 'clip',
+        strategy: str = 'columnwise',
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Handle outliers
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str
+            Handling method: 'clip', 'remove', 'mean', 'median', 'winsorize', 'transform', 'impute'
+        strategy : str
+            Strategy: 'columnwise', 'global', 'adaptive'
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Data with handled outliers
+        """
+        logger.info("\n" + "="*80)
+        logger.info("HANDLING OUTLIERS")
+        logger.info("="*80)
+        if not self.outlier_info:
+            logger.warning("⚠ Perform outlier analysis first")
+            return data
+        data_processed = data.copy()
+        methods_applied = {}
+        for col, info in self.outlier_info.items():
+            if col not in data_processed.columns:
+                continue
+            outliers_count = info['outliers_count']
+            if outliers_count > 0:
+                # Create outlier mask
+                outlier_mask = pd.Series(False, index=data_processed.index)
+                if info['outlier_indices']:
+                    outlier_indices = [idx for idx in info['outlier_indices'] if idx in data_processed.index]
+                    outlier_mask.loc[outlier_indices] = True
+                # Determine boundaries
+                stats = info['statistics']
+                q1, q3 = stats['q1'], stats['q3']
+                iqr = q3 - q1
+                lower_bound = q1 - self.config.outlier_alpha * iqr
+                upper_bound = q3 + self.config.outlier_alpha * iqr
+                if method == 'clip':
+                    # Clip values to boundaries
+                    data_processed[col] = data_processed[col].clip(
+                        lower=lower_bound,
+                        upper=upper_bound
+                    )
+                    method_used = 'clipping'
+                    affected = outliers_count
+                elif method == 'remove':
+                    # Remove rows with outliers
+                    data_processed = data_processed[~outlier_mask]
+                    method_used = 'removal'
+                    affected = outliers_count
+                elif method == 'mean':
+                    # Replace outliers with mean value
+                    mean_val = data_processed[col].mean()
+                    data_processed.loc[outlier_mask, col] = mean_val
+                    method_used = 'mean imputation'
+                    affected = outliers_count
+                elif method == 'median':
+                    # Replace outliers with median
+                    median_val = data_processed[col].median()
+                    data_processed.loc[outlier_mask, col] = median_val
+                    method_used = 'median imputation'
+                    affected = outliers_count
+                elif method == 'winsorize':
+                    # Winsorisation
+                    data_processed[col] = self._winsorize_series(
+                        data_processed[col],
+                        limits=kwargs.get('limits', (0.05, 0.05))
+                    )
+                    method_used = 'winsorization'
+                    affected = outliers_count
+                elif method == 'transform':
+                    # Transformation to reduce outlier impact
+                    transform_method = kwargs.get('transform_method', 'log')
+                    data_processed[col] = self._transform_series(
+                        data_processed[col],
+                        method=transform_method
+                    )
+                    method_used = f'{transform_method} transformation'
+                    affected = 'all'  # Transformation applied to all values
+                elif method == 'impute':
+                    # Smart outlier imputation
+                    impute_method = kwargs.get('impute_method', 'neighbors')
+                    data_processed[col] = self._impute_outliers(
+                        data_processed[col],
+                        outlier_mask,
+                        method=impute_method,
+                        **kwargs
+                    )
+                    method_used = f'{impute_method} imputation'
+                    affected = outliers_count
+                elif method == 'adaptive':
+                    # Adaptive handling
+                    data_processed[col] = self._adaptive_outlier_handling(
+                        data_processed[col],
+                        outlier_mask,
+                        **kwargs
+                    )
+                    method_used = 'adaptive handling'
+                    affected = outliers_count
+                else:
+                    raise ValueError(f"Unknown method: {method}")
+                methods_applied[col] = {
+                    'method': method_used,
+                    'outliers_before': outliers_count,
+                    'affected': affected,
+                    'bounds': {
+                        'lower': float(lower_bound),
+                        'upper': float(upper_bound)
+                    }
+                }
+                logger.info(f"  {col}: {outliers_count} outliers handled ({method_used})")
+        self.handling_methods = methods_applied
+        # Handling statistics
+        total_outliers = sum(info['outliers_count'] for info in self.outlier_info.values())
+        total_affected = sum(method['affected'] for method in methods_applied.values()
+                           if isinstance(method['affected'], (int, np.integer)))
+        logger.info(f"\n✓ {total_affected} out of {total_outliers} outliers handled")
+        logger.info(f"  Data size before: {len(data)} rows")
+        logger.info(f"  Data size after: {len(data_processed)} rows")
+        # Visualise results
+        if self.config.save_plots and methods_applied:
+            self._plot_outlier_handling_results(data, data_processed, methods_applied)
+        return data_processed
+    def _winsorize_series(
+        self,
+        series: pd.Series,
+        limits: Tuple[float, float] = (0.05, 0.05)
+    ) -> pd.Series:
+        """Winsorize series"""
+        from scipy.stats.mstats import winsorize
+        try:
+            winsorized = winsorize(series.values, limits=limits)
+            return pd.Series(winsorized, index=series.index)
+        except:
+            return series
+    def _transform_series(
+        self,
+        series: pd.Series,
+        method: str = 'log'
+    ) -> pd.Series:
+        """Transform series to reduce outlier impact"""
+        series_transformed = series.copy()
+        if method == 'log':
+            # Logarithmic transformation
+            min_val = series.min()
+            if min_val <= 0:
+                shift = abs(min_val) + 1
+                series_transformed = np.log(series + shift)
+            else:
+                series_transformed = np.log(series)
+        elif method == 'boxcox':
+            # Box-Cox transformation
+            try:
+                from scipy.stats import boxcox
+                transformed, _ = boxcox(series - series.min() + 1)
+                series_transformed = pd.Series(transformed, index=series.index)
+            except:
+                logger.warning("Box-Cox transformation failed, using log")
+                return self._transform_series(series, 'log')
+        elif method == 'sqrt':
+            # Square root
+            min_val = series.min()
+            if min_val < 0:
+                series_transformed = np.sqrt(series - min_val)
+            else:
+                series_transformed = np.sqrt(series)
+        elif method == 'yeojohnson':
+            # Yeo-Johnson transformation
+            try:
+                from scipy.stats import yeojohnson
+                transformed, _ = yeojohnson(series)
+                series_transformed = pd.Series(transformed, index=series.index)
+            except:
+                logger.warning("Yeo-Johnson transformation failed, using log")
+                return self._transform_series(series, 'log')
+        return series_transformed
+    def _impute_outliers(
+        self,
+        series: pd.Series,
+        outlier_mask: pd.Series,
+        method: str = 'neighbors',
+        **kwargs
+    ) -> pd.Series:
+        """Smart outlier imputation"""
+        series_imputed = series.copy()
+        if method == 'neighbors':
+            # Replace with mean of neighbouring values
+            for idx in series[outlier_mask].index:
+                if idx in series.index:
+                    pos = series.index.get_loc(idx)
+                    neighbours = []
+                    # Find nearest non-outliers
+                    for offset in range(1, 6):
+                        if pos - offset >= 0 and not outlier_mask.iloc[pos - offset]:
+                            neighbours.append(series.iloc[pos - offset])
+                            break
+                    for offset in range(1, 6):
+                        if pos + offset < len(series) and not outlier_mask.iloc[pos + offset]:
+                            neighbours.append(series.iloc[pos + offset])
+                            break
+                    if neighbours:
+                        series_imputed.loc[idx] = np.mean(neighbours)
+        elif method == 'interpolate':
+            # Interpolation
+            series_imputed = series.mask(outlier_mask).interpolate()
+        elif method == 'rolling':
+            # Replace with moving average
+            window = kwargs.get('window', 5)
+            rolling_mean = series.rolling(window=window, center=True, min_periods=1).mean()
+            series_imputed = series.mask(outlier_mask, rolling_mean)
+        return series_imputed
+    def _adaptive_outlier_handling(
+        self,
+        series: pd.Series,
+        outlier_mask: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """Adaptive outlier handling"""
+        series_processed = series.copy()
+        outlier_indices = series[outlier_mask].index
+        for idx in outlier_indices:
+            if idx in series.index:
+                value = series.loc[idx]
+                stats = self.outlier_info.get(series.name, {}).get('statistics', {})
+                # Determine outlier type
+                q1 = stats.get('q1', series.quantile(0.25))
+                q3 = stats.get('q3', series.quantile(0.75))
+                iqr = q3 - q1
+                if value < q1 - 3 * iqr:
+                    # Extreme low outlier
+                    series_processed.loc[idx] = q1 - 1.5 * iqr
+                elif value > q3 + 3 * iqr:
+                    # Extreme high outlier
+                    series_processed.loc[idx] = q3 + 1.5 * iqr
+                else:
+                    # Moderate outlier
+                    pos = series.index.get_loc(idx)
+                    # Use linear interpolation
+                    if pos > 0 and pos < len(series) - 1:
+                        series_processed.loc[idx] = (series.iloc[pos-1] + series.iloc[pos+1]) / 2
+        return series_processed
+    def _plot_outlier_handling_results(
+        self,
+        original_data: pd.DataFrame,
+        processed_data: pd.DataFrame,
+        methods_applied: Dict
+    ) -> None:
+        """Visualise outlier handling results"""
+        cols_to_plot = list(methods_applied.keys())[:3]
+        if not cols_to_plot:
+            return
+        fig, axes = plt.subplots(len(cols_to_plot), 2, figsize=(14, 4 * len(cols_to_plot)))
+        if len(cols_to_plot) == 1:
+            axes = axes.reshape(1, -1)
+        for idx, col in enumerate(cols_to_plot):
+            if col not in original_data.columns or col not in processed_data.columns:
+                continue
+            # Distribution before handling
+            axes[idx, 0].hist(original_data[col].dropna(), bins=30, alpha=0.5, label='Before', density=True)
+            axes[idx, 0].hist(processed_data[col].dropna(), bins=30, alpha=0.5, label='After', density=True)
+            axes[idx, 0].set_title(f'{col}: Distribution before/after')
+            axes[idx, 0].set_xlabel('Value')
+            axes[idx, 0].set_ylabel('Density')
+            axes[idx, 0].legend()
+            axes[idx, 0].grid(True, alpha=0.3)
+            # QQ plot for normality check
+            stats.probplot(original_data[col].dropna(), dist="norm", plot=axes[idx, 1])
+            axes[idx, 1].set_title(f'{col}: Q-Q plot (before handling)')
+            axes[idx, 1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/outlier_handling_results.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def create_validation_rules(self) -> Dict:
+        """Create validation rules based on outlier analysis"""
+        rules = {}
+        for col, info in self.outlier_info.items():
+            outliers_percent = info['outliers_percent']
+            skewness = info['statistics']['skewness']
+            rule = {
+                'outliers_percent': outliers_percent,
+                'skewness': skewness,
+                'recommended_action': 'none'
+            }
+            if outliers_percent > 10:
+                rule['recommended_action'] = 'aggressive_handling'
+                rule['reason'] = f'High outliers: {outliers_percent:.1f}%'
+            elif outliers_percent > 5:
+                rule['recommended_action'] = 'moderate_handling'
+                rule['reason'] = f'Moderate outliers: {outliers_percent:.1f}%'
+            elif outliers_percent > 1:
+                rule['recommended_action'] = 'conservative_handling'
+                rule['reason'] = f'Low outliers: {outliers_percent:.1f}%'
+            if abs(skewness) > 1:
+                rule['skewness_issue'] = True
+                rule['skewness_reason'] = f'Strong skewness: {skewness:.2f}'
+                if rule['recommended_action'] == 'none':
+                    rule['recommended_action'] = 'transformation'
+            rules[col] = rule
+        return rules
+    def get_report(self) -> Dict:
+        """Get outlier analysis report"""
+        return {
+            'outlier_info': self.outlier_info,
+            'handling_methods': self.handling_methods,
+            'detection_methods': self.detection_methods,
+            'validation_rules': self.create_validation_rules()
+        }

pipeline/__init__.py ADDED Viewed

File without changes

pipeline/main_pipeline.py ADDED Viewed

	@@ -0,0 +1,603 @@

+# ============================================
+# CLASS 14: MAIN PIPELINE
+# ============================================
+from datetime import datetime
+import json
+import os
+import traceback
+from typing import Any, Dict, Optional
+from venv import logger
+from config.config import Config
+from correlations.correlation_analyzer import CorrelationAnalyzer
+from data_loader.data_loader import DataLoader
+from decomposition.decomposer import TimeSeriesDecomposer
+from feature_selection.feature_selector import FeatureSelector
+from features.feature_engineer import FeatureEngineer
+from missing_values.missing_analyzer import MissingValueAnalyser
+from outliers.outlier_analyzer import OutlierAnalyser
+from scaling.data_scaler import DataScaler
+from splitting.data_splitter import DataSplitter
+from stationarity.stationarity_checker import StationarityChecker
+from validation.data_validator import DataValidator
+import pandas as pd
+import numpy as np
+from visualization.visualization_manager import VisualisationManager
+class EnhancedDataPreprocessingPipeline:
+    """Enhanced main data preprocessing pipeline"""
+    def __init__(self, config: Config):
+        """
+        Initialise pipeline
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.data_loader = DataLoader(config)
+        self.missing_analyser = MissingValueAnalyser(config)
+        self.outlier_analyser = OutlierAnalyser(config)
+        self.feature_engineer = FeatureEngineer(config)
+        self.stationarity_checker = StationarityChecker(config)
+        self.decomposer = TimeSeriesDecomposer(config)
+        self.correlation_analyser = CorrelationAnalyzer(config)
+        self.data_splitter = DataSplitter(config)
+        self.data_scaler = DataScaler(config)
+        self.feature_selector = FeatureSelector(config)
+        self.data_validator = DataValidator(config)
+        self.visualisation_manager = VisualisationManager(config)
+        self.results = {}
+        self.processed_data = None
+        self.train_data = None
+        self.val_data = None
+        self.test_data = None
+        self.is_fitted = False
+    def run_full_pipeline(
+        self,
+        data_path: Optional[str] = None,
+        use_synthetic: bool = False,
+        save_intermediate: bool = True,
+        create_reports: bool = True
+    ) -> pd.DataFrame:
+        """
+        Run enhanced full data preprocessing pipeline
+        Parameters:
+        -----------
+        data_path : str, optional
+            Path to data. If None, uses configuration value.
+        use_synthetic : bool
+            Use synthetic data for testing
+        save_intermediate : bool
+            Save intermediate results
+        create_reports : bool
+            Create reports
+        Returns:
+        --------
+        pd.DataFrame
+            Processed data
+        """
+        logger.info("\n" + "="*80)
+        logger.info("RUNNING ENHANCED DATA PREPROCESSING PIPELINE")
+        logger.info("="*80)
+        start_time = datetime.now()
+        try:
+            # Step 1: Data loading
+            logger.info("\n" + "="*80)
+            logger.info("STEP 1: DATA LOADING")
+            logger.info("="*80)
+            if use_synthetic:
+                data = self.data_loader.create_synthetic_data(
+                    n_days=365*20,
+                    trend_strength=0.01,
+                    noise_std=10,
+                    include_exogenous=True
+                )
+            else:
+                data = self.data_loader.load_from_csv(
+                    data_path=data_path,
+                    parse_dates=['date']
+                )
+            # Check for date index
+            if not isinstance(data.index, pd.DatetimeIndex):
+                logger.warning("Index is not DatetimeIndex, setting...")
+                if 'date' in data.columns:
+                    data = data.set_index('date')
+                    logger.info("Index set from 'date' column")
+            self.results['data_loading'] = {
+                'shape': list(data.shape),
+                'columns': list(data.columns),
+                'date_range': {
+                    'min': data.index.min().strftime('%Y-%m-%d') if isinstance(data.index, pd.DatetimeIndex) else None,
+                    'max': data.index.max().strftime('%Y-%m-%d') if isinstance(data.index, pd.DatetimeIndex) else None
+                },
+                'is_datetime_index': isinstance(data.index, pd.DatetimeIndex)
+            }
+            # Save raw data information
+            self.data_loader.save_raw_data_info()
+            # Step 2: Raw data validation
+            logger.info("\n" + "="*80)
+            logger.info("STEP 2: RAW DATA VALIDATION")
+            logger.info("="*80)
+            raw_validation = self.data_validator.validate(
+                data, stage='raw', detailed=True
+            )
+            self.results['raw_validation'] = raw_validation
+            if raw_validation['status'] == 'FAIL':
+                logger.warning("⚠ Raw data has critical issues!")
+                if not self.config.enable_validation:
+                    logger.warning("Validation disabled in configuration, continuing processing")
+                else:
+                    logger.error("Pipeline interrupted due to data issues")
+                    return None
+            # Step 3: Missing values analysis and handling
+            logger.info("\n" + "="*80)
+            logger.info("STEP 3: MISSING VALUES HANDLING")
+            logger.info("="*80)
+            missing_info = self.missing_analyser.analyse(data, detailed=True)
+            self.results['missing_analysis'] = missing_info
+            # Handle missing values
+            data = self.missing_analyser.handle(
+                data,
+                method='interpolate',
+                strategy='columnwise'
+            )
+            self.results['missing_handling'] = self.missing_analyser.handling_methods
+            # Step 4: Outlier analysis and handling
+            logger.info("\n" + "="*80)
+            logger.info("STEP 4: OUTLIER HANDLING")
+            logger.info("="*80)
+            outlier_info = self.outlier_analyser.analyse(
+                data,
+                method=self.config.outlier_method,
+                columns=data.select_dtypes(include=[np.number]).columns.tolist()
+            )
+            self.results['outlier_analysis'] = outlier_info
+            # Handle outliers
+            data = self.outlier_analyser.handle(
+                data,
+                method='clip',
+                strategy='columnwise'
+            )
+            self.results['outlier_handling'] = self.outlier_analyser.handling_methods
+            # Step 5: Feature engineering
+            logger.info("\n" + "="*80)
+            logger.info("STEP 5: FEATURE ENGINEERING")
+            logger.info("="*80)
+            data = self.feature_engineer.create_all_features(data)
+            self.results['feature_engineering'] = self.feature_engineer.feature_info
+            # Check for data after feature engineering
+            if len(data) == 0:
+                logger.error("No data remaining after feature engineering!")
+                return None
+            # Step 6: Stationarity analysis
+            logger.info("\n" + "="*80)
+            logger.info("STEP 6: STATIONARITY ANALYSIS")
+            logger.info("="*80)
+            stationarity_results = self.stationarity_checker.check(
+                data,
+                target_col=self.config.target_column,
+                make_stationary=True,
+                try_transformations=True
+            )
+            self.results['stationarity_analysis'] = stationarity_results
+            # Step 7: Time series decomposition
+            logger.info("\n" + "="*80)
+            logger.info("STEP 7: TIME SERIES DECOMPOSITION")
+            logger.info("="*80)
+            if isinstance(data.index, pd.DatetimeIndex) and len(data) > 365:
+                decomposition_results = self.decomposer.decompose(
+                    data,
+                    target_col=self.config.target_column,
+                    method='stl',
+                    period=self.config.seasonal_period
+                )
+                self.results['decomposition'] = decomposition_results
+            else:
+                logger.info("Skipped: insufficient data or no DatetimeIndex")
+                self.results['decomposition'] = {'skipped': 'insufficient data or no DatetimeIndex'}
+            # Step 8: Correlation analysis
+            logger.info("\n" + "="*80)
+            logger.info("STEP 8: CORRELATION ANALYSIS")
+            logger.info("="*80)
+            corr_matrix = self.correlation_analyser.analyze(
+                data,
+                target_col=self.config.target_column,
+                threshold=0.8,
+                detailed=True
+            )
+            self.results['correlation_analysis'] = self.correlation_analyser.get_report()
+            # Remove highly correlated features
+            if not corr_matrix.empty:
+                data = self.correlation_analyser.remove_highly_correlated(
+                    data,
+                    threshold=0.95,
+                    method='variance',
+                    keep_target=True
+                )
+            else:
+                logger.warning("Correlation matrix empty, skipping feature removal")
+            # Step 9: Processed data validation
+            logger.info("\n" + "="*80)
+            logger.info("STEP 9: PROCESSED DATA VALIDATION")
+            logger.info("="*80)
+            processed_validation = self.data_validator.validate(
+                data, stage='processed', detailed=True
+            )
+            self.results['processed_validation'] = processed_validation
+            if processed_validation['status'] == 'FAIL':
+                logger.warning("⚠ Processed data failed validation!")
+                logger.warning("Continuing pipeline, but data quality may be low")
+            elif processed_validation['status'] == 'WARNING':
+                logger.warning("⚠ Processed data requires attention")
+            # Step 10: Data splitting
+            logger.info("\n" + "="*80)
+            logger.info("STEP 10: DATA SPLITTING")
+            logger.info("="*80)
+            train_data, val_data, test_data = self.data_splitter.split(
+                data,
+                method=self.config.split_method,
+                test_size=self.config.test_size,
+                validation_size=self.config.validation_size
+            )
+            self.train_data = train_data
+            self.val_data = val_data
+            self.test_data = test_data
+            self.results['data_splitting'] = self.data_splitter.split_info
+            # Step 11: Data scaling
+            logger.info("\n" + "="*80)
+            logger.info("STEP 11: DATA SCALING")
+            logger.info("="*80)
+            # Scale training data
+            train_data_scaled = self.data_scaler.fit_transform(
+                train_data,
+                method=self.config.scaling_method,
+                target_col=self.config.target_column,
+                fit_on_train=True
+            )
+            # Apply same scaling to validation and test data
+            val_data_scaled = self.data_scaler.transform(val_data)
+            test_data_scaled = self.data_scaler.transform(test_data)
+            self.train_data = train_data_scaled
+            self.val_data = val_data_scaled
+            self.test_data = test_data_scaled
+            self.results['data_scaling'] = self.data_scaler.get_report()
+            # Step 12: Feature selection
+            logger.info("\n" + "="*80)
+            logger.info("STEP 12: FEATURE SELECTION")
+            logger.info("="*80)
+            if len(train_data_scaled.columns) > 5:
+                # Select features on training data
+                train_data_selected = self.feature_selector.select(
+                    train_data_scaled,
+                    method=self.config.feature_selection_method,
+                    n_features=min(self.config.max_features, len(train_data_scaled.columns) - 1)
+                )
+                # Save selected features
+                selected_features = self.feature_selector.selected_features
+                # Apply same selection to validation and test data
+                features_to_keep = selected_features + [self.config.target_column]
+                features_to_keep = [f for f in features_to_keep if f in val_data_scaled.columns]
+                if len(features_to_keep) > 1:
+                    self.train_data = train_data_scaled[features_to_keep].copy()
+                    self.val_data = val_data_scaled[features_to_keep].copy()
+                    self.test_data = test_data_scaled[features_to_keep].copy()
+                else:
+                    logger.warning("Failed to select features, using all")
+                self.results['feature_selection'] = self.feature_selector.get_report()
+            else:
+                logger.info("Skipped: insufficient features for selection")
+                self.results['feature_selection'] = {'skipped': 'insufficient features'}
+            # Step 13: Final validation
+            logger.info("\n" + "="*80)
+            logger.info("STEP 13: FINAL VALIDATION")
+            logger.info("="*80)
+            # Combine all data for final validation
+            all_processed_data = pd.concat([self.train_data, self.val_data, self.test_data])
+            final_validation = self.data_validator.validate(
+                all_processed_data, stage='final', detailed=True
+            )
+            self.results['final_validation'] = final_validation
+            self.processed_data = all_processed_data
+            self.is_fitted = True
+            # Step 14: Additional multicollinearity cleaning
+            logger.info("\n" + "="*80)
+            logger.info("STEP 14: ADDITIONAL MULTICOLLINEARITY CLEANING")
+            logger.info("="*80)
+            # Remove temporal features with extreme VIF
+            self.processed_data = self._remove_extreme_vif_features(self.processed_data)
+            self.train_data = self.train_data[self.processed_data.columns]
+            self.val_data = self.val_data[self.processed_data.columns]
+            self.test_data = self.test_data[self.processed_data.columns]
+            # Step 15: Create visualisations and reports
+            logger.info("\n" + "="*80)
+            logger.info("STEP 15: CREATING REPORTS AND VISUALISATIONS")
+            logger.info("="*80)
+            if create_reports:
+                self.create_all_reports()
+                self.create_all_visualisations()
+            # Calculate execution time
+            execution_time = (datetime.now() - start_time).total_seconds()
+            # Save final results
+            self.results['pipeline_execution'] = {
+                'start_time': start_time.strftime('%Y-%m-%d %H:%M:%S'),
+                'end_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+                'execution_time_seconds': execution_time,
+                'success': True,
+                'stages_completed': 15
+            }
+            # Save configuration and results
+            self.save_pipeline_results()
+            logger.info("\n" + "="*80)
+            logger.info("ENHANCED PIPELINE SUCCESSFULLY COMPLETED!")
+            logger.info("="*80)
+            logger.info(f"Execution time: {execution_time:.2f} seconds")
+            logger.info(f"Initial data size: {self.results['data_loading']['shape']}")
+            logger.info(f"Final data size: {list(self.processed_data.shape)}")
+            logger.info(f"Data quality: {final_validation['overall_score']}/100")
+            logger.info(f"Status: {final_validation['status']}")
+            logger.info(f"Training data: {len(self.train_data)} records")
+            logger.info(f"Features in final set: {len(self.train_data.columns)}")
+            return self.processed_data
+        except Exception as e:
+            logger.error(f"✗ Pipeline error: {e}")
+            logger.error(traceback.format_exc())
+            self.results['pipeline_execution'] = {
+                'success': False,
+                'error': str(e),
+                'traceback': traceback.format_exc()
+            }
+            # Save partial results
+            self.save_pipeline_results()
+            return None
+    def _remove_extreme_vif_features(self, data: pd.DataFrame) -> pd.DataFrame:
+        """Remove features with extreme VIF"""
+        data_clean = data.copy()
+        # Identify features with extreme VIF for removal
+        extreme_vif_features = [
+            'year', 'day', 'dayofyear', 'days_from_start',
+            'raskhodvoda_zscore'  # Usually has extreme VIF
+        ]
+        # Remove only those present in data
+        features_to_remove = [f for f in extreme_vif_features if f in data_clean.columns]
+        if features_to_remove:
+            logger.info(f"Removing features with extreme VIF: {features_to_remove}")
+            data_clean = data_clean.drop(columns=features_to_remove)
+        return data_clean
+    def create_all_reports(self) -> None:
+        """Create all reports"""
+        logger.info("Creating reports...")
+        # 1. Save validation results
+        for stage in ['raw', 'processed', 'final']:
+            if stage in self.data_validator.validation_results:
+                self.data_validator.save_report(stage)
+        # 2. Save plots information
+        self.visualisation_manager.save_plots_info()
+        # 3. Create summary report
+        self.create_summary_report()
+        logger.info("✓ All reports created")
+    def create_all_visualisations(self) -> None:
+        """Create all visualisations"""
+        logger.info("Creating visualisations...")
+        if self.processed_data is not None:
+            # 1. Summary dashboard
+            preprocessing_stages = {
+                'Loading': self.results['data_loading']['shape'][1] if 'data_loading' in self.results else 0,
+                'After cleaning': len(self.processed_data.columns),
+                'Features created': self.feature_engineer.feature_info.get('features_created', 0),
+                'Features selected': len(self.feature_selector.selected_features) if hasattr(self.feature_selector, 'selected_features') else 0
+            }
+            self.visualisation_manager.create_summary_dashboard(
+                self.processed_data,
+                preprocessing_stages
+            )
+        logger.info("✓ All visualisations created")
+    def create_summary_report(self) -> None:
+        """Create summary report"""
+        report = {
+            'pipeline_summary': {
+                'config': self.config.to_dict(),
+                'execution': self.results.get('pipeline_execution', {}),
+                'data_statistics': {
+                    'initial_shape': self.results.get('data_loading', {}).get('shape', []),
+                    'final_shape': list(self.processed_data.shape) if self.processed_data is not None else [],
+                    'target_column': self.config.target_column,
+                    'features_created': self.feature_engineer.feature_info.get('features_created', 0),
+                    'features_selected': len(self.feature_selector.selected_features) if hasattr(self.feature_selector, 'selected_features') else 0
+                }
+            },
+            'validation_summary': {},
+            'quality_metrics': {}
+        }
+        # Add validation results
+        for stage in ['raw', 'processed', 'final']:
+            if stage in self.data_validator.validation_results:
+                stage_results = self.data_validator.validation_results[stage]
+                report['validation_summary'][stage] = {
+                    'status': stage_results.get('status'),
+                    'score': stage_results.get('overall_score'),
+                    'issues_count': sum(len(issues) for issues in stage_results.get('issues', {}).values()),
+                    'checks_passed': sum(1 for check in stage_results.get('basic_checks', {}).values()
+                                       if check.get('passed', False))
+                }
+        # Save report
+        report_path = f'{self.config.results_dir}/reports/pipeline_summary.json'
+        with open(report_path, 'w', encoding='utf-8') as f:
+            json.dump(report, f, indent=4, ensure_ascii=False)
+        logger.info(f"✓ Summary report saved: {report_path}")
+    def save_pipeline_results(self) -> None:
+        """Save all pipeline results"""
+        # Save configuration
+        self.config.save()
+        # Save data
+        if self.processed_data is not None:
+            # Save processed data
+            data_path = f'{self.config.results_dir}/processed_data/processed_data.csv'
+            self.processed_data.to_csv(data_path)
+            logger.info(f"✓ Processed data saved: {data_path}")
+            # Save split data
+            if self.train_data is not None:
+                self.train_data.to_csv(f'{self.config.results_dir}/processed_data/train_data.csv')
+                self.val_data.to_csv(f'{self.config.results_dir}/processed_data/val_data.csv')
+                self.test_data.to_csv(f'{self.config.results_dir}/processed_data/test_data.csv')
+    def get_final_data_for_modelling(self) -> Dict[str, Any]:
+        """Prepare data for modelling"""
+        if not self.is_fitted:
+            logger.warning("Pipeline not executed, data not ready")
+            return {}
+        return {
+            'X_train': self.train_data.drop(columns=[self.config.target_column]),
+            'y_train': self.train_data[self.config.target_column],
+            'X_val': self.val_data.drop(columns=[self.config.target_column]),
+            'y_val': self.val_data[self.config.target_column],
+            'X_test': self.test_data.drop(columns=[self.config.target_column]),
+            'y_test': self.test_data[self.config.target_column],
+            'feature_names': self.train_data.drop(columns=[self.config.target_column]).columns.tolist(),
+            'scaler': self.data_scaler,
+            'feature_selector': self.feature_selector,
+            'results': self.results
+        }
+# ============================================
+# QUICK LAUNCH FUNCTION
+# ============================================
+def run_enhanced_preprocessing(
+    config_path: Optional[str] = None,
+    data_path: Optional[str] = None,
+    use_synthetic: bool = False,
+    save_results: bool = True
+) -> EnhancedDataPreprocessingPipeline:
+    """
+    Quick launch function for enhanced pipeline
+    Parameters:
+    -----------
+    config_path : str, optional
+        Path to configuration file
+    data_path : str, optional
+        Path to data
+    use_synthetic : bool
+        Use synthetic data
+    save_results : bool
+        Save results
+    Returns:
+    --------
+    EnhancedDataPreprocessingPipeline
+        Pipeline object with results
+    """
+    # Load or create configuration
+    if config_path and os.path.exists(config_path):
+        config = Config.load(config_path)
+        logger.info(f"Configuration loaded from {config_path}")
+    else:
+        config = Config()
+        logger.info("Using default configuration")
+    # Update data path if specified
+    if data_path:
+        config.data_path = data_path
+    # Create and run pipeline
+    pipeline = EnhancedDataPreprocessingPipeline(config)
+    pipeline.run_full_pipeline(
+        data_path=data_path,
+        use_synthetic=use_synthetic,
+        save_intermediate=save_results,
+        create_reports=save_results
+    )
+    return pipeline

requirements.txt CHANGED Viewed

@@ -1,3 +1,100 @@
-altair
-pandas
-streamlit

+absl-py==2.3.1
+altair==6.0.0
+anyio==4.11.0
+astunparse==1.6.3
+attrs==25.4.0
+blinker==1.9.0
+cachetools==6.2.4
+certifi==2025.11.12
+charset-normalizer==3.4.4
+click==8.3.1
+colorama==0.4.6
+contourpy==1.3.2
+cycler==0.12.1
+et_xmlfile==2.0.0
+filelock==3.20.1
+flatbuffers==25.12.19
+fonttools==4.61.1
+fsspec==2025.12.0
+gast==0.7.0
+gensim==4.4.0
+gitdb==4.0.12
+GitPython==3.1.46
+google-pasta==0.2.0
+grpcio==1.76.0
+h5py==3.15.1
+h11==0.16.0
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface_hub==1.1.2
+idna==3.11
+Jinja2==3.1.6
+joblib==1.5.3
+jsonschema==4.25.1
+jsonschema-specifications==2025.9.1
+keras==3.13.0
+kiwisolver==1.4.9
+libclang==18.1.1
+Markdown==3.10
+markdown-it-py==4.0.0
+MarkupSafe==3.0.3
+matplotlib==3.10.8
+mdurl==0.1.2
+ml_dtypes==0.5.4
+mpmath==1.3.0
+namex==0.1.0
+narwhals==2.14.0
+networkx==3.6.1
+numpy==2.4.0
+openpyxl==3.1.5
+opt_einsum==3.4.0
+optree==0.18.0
+packaging==25.0
+pandas==2.3.3
+patsy==1.0.2
+pillow==12.0.0
+plotly==6.5.0
+protobuf==6.33.2
+pyarrow==22.0.0
+pydeck==0.9.1
+Pygments==2.19.2
+pyparsing==3.3.1
+pyperclip==1.11.0
+python-dateutil==2.9.0.post0
+pytz==2025.2
+PyYAML==6.0.3
+referencing==0.37.0
+requests==2.32.5
+rich==14.2.0
+rpds-py==0.30.0
+scikit-learn==1.8.0
+scipy==1.16.3
+seaborn==0.13.2
+setuptools==80.9.0
+shellingham==1.5.4
+six==1.17.0
+smart_open==7.5.0
+smmap==5.0.2
+sniffio==1.3.1
+statsmodels==0.14.6
+streamlit==1.52.2
+sympy==1.14.0
+tenacity==9.1.2
+tensorboard==2.20.0
+tensorboard-data-server==0.7.2
+tensorflow==2.20.0
+termcolor==3.3.0
+threadpoolctl==3.6.0
+toml==0.10.2
+torch==2.9.1
+tornado==6.5.4
+tqdm==4.67.1
+typer-slim==0.20.0
+typing_extensions==4.15.0
+tzdata==2025.3
+urllib3==2.6.2
+watchdog==6.0.0
+Werkzeug==3.1.4
+wheel==0.45.1
+wrapt==2.0.1

run_pipeline.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# ============================================
+# RUN
+# ============================================
+from config.config import Config
+from pipeline.main_pipeline import EnhancedDataPreprocessingPipeline
+import pandas as pd
+if __name__ == "__main__":
+    """
+    Pipeline execution
+    """
+    # Configuration with reasonable parameters
+    config = Config(
+        data_path='temp_data.csv',
+        results_dir='enhanced_preprocessing_results',
+        target_column='raskhodvoda',
+        start_year=1970,
+        end_year=1990,
+        max_lags=5,
+        seasonal_period=365,
+        rolling_windows=[7, 30, 90],
+        expanding_windows=[30, 90],
+        test_size=0.2,
+        validation_size=0.1,
+        scaling_method='robust',
+        feature_selection_method='correlation',
+        max_features=20,
+        missing_threshold=0.3,
+        outlier_method='iqr',
+        enable_validation=True
+    )
+    # Run enhanced pipeline
+    pipeline = EnhancedDataPreprocessingPipeline(config)
+    processed_data = pipeline.run_full_pipeline(
+        use_synthetic=False,
+        save_intermediate=True,
+        create_reports=True
+    )
+    if processed_data is not None:
+        print("\n" + "="*80)
+        print("ENHANCED PIPELINE SUCCESSFULLY COMPLETED!")
+        print("="*80)
+        print(f"Final data size: {processed_data.shape}")
+        print(f"Columns: {list(processed_data.columns)}")
+        # Get modeling data
+        modeling_data = pipeline.get_final_data_for_modeling()
+        if modeling_data:
+            print(f"\nModeling data ready:")
+            print(f"  X_train: {modeling_data['X_train'].shape}")
+            print(f"  X_val: {modeling_data['X_val'].shape}")
+            print(f"  X_test: {modeling_data['X_test'].shape}")
+            print(f"  Features: {len(modeling_data['feature_names'])}")
+        # Save final data
+        processed_data.to_csv('enhanced_preprocessing_results\processed_data\enhanced_final_processed_data.csv',
+                            index=True if isinstance(processed_data.index, pd.DatetimeIndex) else False)
+        print(f"\n✓ Final data saved to 'enhanced_final_processed_data.csv'")

scaling/__init__.py ADDED Viewed

File without changes

scaling/data_scaler.py ADDED Viewed

	@@ -0,0 +1,634 @@

+# ============================================
+# CLASS 10: DATA SCALING
+# ============================================
+from typing import Dict, List, Optional, Tuple
+from venv import logger
+import pandas as pd
+from config.config import Config
+import numpy as np
+import matplotlib.pyplot as plt
+class DataScaler:
+    """Class for data scaling and normalisation"""
+    def __init__(self, config: Config):
+        """
+        Initialise scaler
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.scalers = {}
+        self.scaling_info = {}
+        self.transforms_applied = {}
+    def fit_transform(
+        self,
+        data: pd.DataFrame,
+        method: str = None,
+        columns: List[str] = None,
+        target_col: Optional[str] = None,
+        fit_on_train: bool = True,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Scale data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str, optional
+            Scaling method. If None, uses configuration value.
+        columns : List[str], optional
+            List of columns to scale. If None, uses all numeric columns.
+        target_col : str, optional
+            Target variable (not scaled by default)
+        fit_on_train : bool
+            Whether to save scaling parameters for applying to new data
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Scaled data
+        """
+        logger.info("\n" + "="*80)
+        logger.info("DATA SCALING")
+        logger.info("="*80)
+        method = method or self.config.scaling_method
+        data_scaled = data.copy()
+        if columns is None:
+            # Select all numeric columns except target
+            numeric_cols = data_scaled.select_dtypes(include=[np.number]).columns
+            if target_col and target_col in numeric_cols:
+                columns = [col for col in numeric_cols if col != target_col]
+            else:
+                columns = list(numeric_cols)
+        logger.info(f"Scaling method: {method}")
+        logger.info(f"Columns to process: {len(columns)}")
+        # Apply scaling
+        for col in columns:
+            if col in data_scaled.columns:
+                try:
+                    # Check feature type
+                    series = data_scaled[col].dropna()
+                    # Special handling for different feature types
+                    if self._is_binary_feature(series):
+                        logger.debug(f"  {col}: binary feature, scaling not applied")
+                        scaler_info = {
+                            'method': 'none',
+                            'scaler_type': 'binary',
+                            'original_values': sorted(series.unique().tolist()),
+                            'note': 'binary feature, no scaling applied'
+                        }
+                        self.scaling_info[col] = scaler_info
+                        if fit_on_train:
+                            self.scalers[col] = scaler_info
+                    elif self._is_categorical_feature(series):
+                        logger.debug(f"  {col}: categorical feature, using min-max")
+                        scaled_series, scaler_info = self._apply_scaling(
+                            data_scaled[col], 'minmax', fit_on_train, **kwargs
+                        )
+                        data_scaled[col] = scaled_series
+                        self.scaling_info[col] = scaler_info
+                        if fit_on_train:
+                            if scaler_info.get('scaler_type') == 'sklearn':
+                                self.scalers[col] = scaler_info['scaler_object']
+                            else:
+                                self.scalers[col] = scaler_info
+                    else:
+                        # Regular scaling for continuous features
+                        scaled_series, scaler_info = self._apply_scaling(
+                            data_scaled[col], method, fit_on_train, **kwargs
+                        )
+                        data_scaled[col] = scaled_series
+                        self.scaling_info[col] = scaler_info
+                        if fit_on_train:
+                            if scaler_info.get('scaler_type') == 'sklearn':
+                                self.scalers[col] = scaler_info['scaler_object']
+                            else:
+                                self.scalers[col] = scaler_info
+                except Exception as e:
+                    logger.warning(f"Error processing column {col}: {e}")
+                    # Save error information
+                    self.scaling_info[col] = {
+                        'method': 'error',
+                        'error': str(e),
+                        'scaler_type': 'none'
+                    }
+        logger.info(f"✓ Data processed using {method} method")
+        # Visualisation of results
+        if self.config.save_plots and columns:
+            self._plot_scaling_results(data, data_scaled, columns, method)
+        return data_scaled
+    def _is_binary_feature(self, series: pd.Series) -> bool:
+        """Check if feature is binary"""
+        unique_values = series.dropna().unique()
+        return len(unique_values) == 2 and set(unique_values).issubset({0, 1})
+    def _is_categorical_feature(self, series: pd.Series, max_categories: int = 10) -> bool:
+        """Check if feature is categorical"""
+        unique_values = series.dropna().unique()
+        return len(unique_values) <= max_categories and series.dtype in ['int64', 'float64']
+    def _apply_scaling(
+        self,
+        series: pd.Series,
+        method: str,
+        fit_on_train: bool,
+        **kwargs
+    ) -> Tuple[pd.Series, Dict]:
+        """Apply specific scaling method"""
+        series_clean = series.dropna()
+        if len(series_clean) == 0:
+            return series, {
+                'method': 'none',
+                'scaler_type': 'none',
+                'error': 'all values are NaN'
+            }
+        scaler_info = {
+            'method': method,
+            'scaler_type': 'simple',
+            'original_mean': float(series_clean.mean()),
+            'original_std': float(series_clean.std()),
+            'original_min': float(series_clean.min()),
+            'original_max': float(series_clean.max()),
+            'scaler': None,
+            'scaler_object': None
+        }
+        try:
+            if method == 'standard':
+                # Standardisation (z-score normalisation)
+                mean = series_clean.mean()
+                std = series_clean.std()
+                if std > 0:
+                    series_scaled = (series - mean) / std
+                    scaler_info['scaler'] = {'mean': float(mean), 'std': float(std)}
+                else:
+                    series_scaled = series - mean  # If std = 0, just center
+                    scaler_info['scaler'] = {'mean': float(mean), 'std': 0}
+            elif method == 'minmax':
+                # Min-Max normalisation
+                min_val = series_clean.min()
+                max_val = series_clean.max()
+                if max_val > min_val:
+                    series_scaled = (series - min_val) / (max_val - min_val)
+                    scaler_info['scaler'] = {'min': float(min_val), 'max': float(max_val)}
+                else:
+                    series_scaled = series - min_val  # If all values equal
+                    scaler_info['scaler'] = {'min': float(min_val), 'max': float(min_val)}
+            elif method == 'robust':
+                # Robust scaling (outlier resistant)
+                # Check sufficient values for quartile calculation
+                if len(series_clean) >= 4:
+                    median = series_clean.median()
+                    q1 = series_clean.quantile(0.25)
+                    q3 = series_clean.quantile(0.75)
+                    iqr = q3 - q1
+                    if iqr > 0:
+                        series_scaled = (series - median) / iqr
+                        scaler_info['scaler'] = {
+                            'median': float(median),
+                            'q1': float(q1),
+                            'q3': float(q3),
+                            'iqr': float(iqr)
+                        }
+                    else:
+                        # If IQR = 0, use standard deviation
+                        std = series_clean.std()
+                        if std > 0:
+                            series_scaled = (series - median) / std
+                            scaler_info['scaler'] = {'median': float(median), 'std': float(std)}
+                        else:
+                            series_scaled = series - median
+                            scaler_info['scaler'] = {'median': float(median), 'iqr': 0}
+                else:
+                    # If insufficient data, use standardisation
+                    mean = series_clean.mean()
+                    std = series_clean.std()
+                    if std > 0:
+                        series_scaled = (series - mean) / std
+                        scaler_info['scaler'] = {'mean': float(mean), 'std': float(std)}
+                        scaler_info['method'] = 'standard'  # Change method in info
+                    else:
+                        series_scaled = series - mean
+                        scaler_info['scaler'] = {'mean': float(mean), 'std': 0}
+                        scaler_info['method'] = 'standard'
+            elif method == 'log':
+                # Logarithmic transformation
+                min_val = series_clean.min()
+                if min_val <= 0:
+                    shift = abs(min_val) + 1
+                    series_scaled = np.log(series + shift)
+                    scaler_info['scaler'] = {'shift': float(shift)}
+                else:
+                    series_scaled = np.log(series)
+                    scaler_info['scaler'] = {'shift': 0}
+            elif method == 'boxcox':
+                # Box-Cox transformation
+                try:
+                    from scipy.stats import boxcox
+                    min_val = series_clean.min()
+                    if min_val <= 0:
+                        shift = abs(min_val) + 1
+                        series_to_transform = series + shift
+                    else:
+                        shift = 0
+                        series_to_transform = series
+                    transformed, lambda_val = boxcox(series_to_transform.dropna())
+                    # Interpolate for all values
+                    series_scaled = series.copy()
+                    valid_mask = series_to_transform.notna()
+                    series_scaled[valid_mask] = transformed
+                    scaler_info['scaler'] = {
+                        'lambda': float(lambda_val),
+                        'shift': float(shift)
+                    }
+                except Exception as e:
+                    logger.warning(f"Box-Cox transformation failed for {series.name}: {e}")
+                    # Return original series and change method
+                    series_scaled = series
+                    scaler_info['method'] = 'none'
+                    scaler_info['scaler_type'] = 'none'
+                    scaler_info['error'] = str(e)
+            elif method == 'quantile':
+                # Quantile transformation (rank-based)
+                try:
+                    from sklearn.preprocessing import QuantileTransformer
+                    qt = QuantileTransformer(
+                        n_quantiles=kwargs.get('n_quantiles', min(100, len(series_clean))),
+                        output_distribution=kwargs.get('output_distribution', 'normal'),
+                        random_state=kwargs.get('random_state', 42)
+                    )
+                    series_reshaped = series.values.reshape(-1, 1)
+                    series_scaled_values = qt.fit_transform(series_reshaped)
+                    series_scaled = pd.Series(series_scaled_values.flatten(), index=series.index)
+                    scaler_info['scaler_type'] = 'sklearn'
+                    scaler_info['scaler_object'] = qt
+                except Exception as e:
+                    logger.warning(f"Quantile transform failed for {series.name}: {e}")
+                    series_scaled = series
+                    scaler_info['method'] = 'none'
+                    scaler_info['scaler_type'] = 'none'
+                    scaler_info['error'] = str(e)
+            elif method == 'power':
+                # Power transform (Yeo-Johnson)
+                try:
+                    from sklearn.preprocessing import PowerTransformer
+                    pt = PowerTransformer(method='yeo-johnson', standardize=True)
+                    series_reshaped = series.values.reshape(-1, 1)
+                    series_scaled_values = pt.fit_transform(series_reshaped)
+                    series_scaled = pd.Series(series_scaled_values.flatten(), index=series.index)
+                    scaler_info['scaler_type'] = 'sklearn'
+                    scaler_info['scaler_object'] = pt
+                except Exception as e:
+                    logger.warning(f"Power transform failed for {series.name}: {e}")
+                    series_scaled = series
+                    scaler_info['method'] = 'none'
+                    scaler_info['scaler_type'] = 'none'
+                    scaler_info['error'] = str(e)
+            elif method == 'none':
+                # No scaling
+                series_scaled = series
+                scaler_info['method'] = 'none'
+                scaler_info['scaler_type'] = 'none'
+            else:
+                logger.warning(f"Unknown scaling method: {method}, using standardisation")
+                return self._apply_scaling(series, 'standard', fit_on_train, **kwargs)
+            # Add statistics after scaling
+            scaled_clean = series_scaled.dropna()
+            if len(scaled_clean) > 0:
+                scaler_info.update({
+                    'scaled_mean': float(scaled_clean.mean()),
+                    'scaled_std': float(scaled_clean.std()),
+                    'scaled_min': float(scaled_clean.min()),
+                    'scaled_max': float(scaled_clean.max()),
+                    'skewness_before': float(series_clean.skew()),
+                    'skewness_after': float(scaled_clean.skew()),
+                    'kurtosis_before': float(series_clean.kurtosis()),
+                    'kurtosis_after': float(scaled_clean.kurtosis())
+                })
+            return series_scaled, scaler_info
+        except Exception as e:
+            logger.warning(f"Error applying method {method} for {series.name}: {e}")
+            return series, {
+                'method': 'error',
+                'scaler_type': 'none',
+                'error': str(e)
+            }
+    def transform(
+        self,
+        data: pd.DataFrame,
+        columns: List[str] = None
+    ) -> pd.DataFrame:
+        """
+        Apply saved scaling to new data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            New data
+        columns : List[str], optional
+            List of columns to transform
+        Returns:
+        --------
+        pd.DataFrame
+            Transformed data
+        """
+        if not self.scalers:
+            logger.warning("Scalers not trained, use fit_transform first")
+            return data
+        data_transformed = data.copy()
+        if columns is None:
+            columns = [col for col in self.scalers.keys() if col in data_transformed.columns]
+        for col in columns:
+            if col in data_transformed.columns and col in self.scalers:
+                try:
+                    scaler_info = self.scaling_info.get(col, {})
+                    scaler_data = self.scalers[col]
+                    method = scaler_info.get('method', 'unknown')
+                    # For binary features, do nothing
+                    if method == 'none' and scaler_info.get('scaler_type') == 'binary':
+                        continue
+                    # Skip errors
+                    if method == 'error':
+                        continue
+                    if isinstance(scaler_data, dict) and 'scaler' in scaler_data:
+                        scaler_params = scaler_data['scaler']
+                        if method == 'standard':
+                            mean = scaler_params.get('mean', 0)
+                            std = scaler_params.get('std', 1)
+                            if std > 0:
+                                data_transformed[col] = (data_transformed[col] - mean) / std
+                        elif method == 'minmax':
+                            min_val = scaler_params.get('min', 0)
+                            max_val = scaler_params.get('max', 1)
+                            if max_val > min_val:
+                                data_transformed[col] = (data_transformed[col] - min_val) / (max_val - min_val)
+                        elif method == 'robust':
+                            median = scaler_params.get('median', 0)
+                            iqr = scaler_params.get('iqr', 1)
+                            if iqr > 0:
+                                data_transformed[col] = (data_transformed[col] - median) / iqr
+                            else:
+                                std = scaler_params.get('std', 1)
+                                if std > 0:
+                                    data_transformed[col] = (data_transformed[col] - median) / std
+                    elif hasattr(scaler_data, 'transform'):
+                        # For sklearn objects
+                        from sklearn.base import BaseEstimator
+                        if isinstance(scaler_data, BaseEstimator):
+                            try:
+                                transformed = scaler_data.transform(
+                                    data_transformed[[col]].values.reshape(-1, 1)
+                                ).flatten()
+                                data_transformed[col] = transformed
+                            except Exception as e:
+                                logger.warning(f"Error in sklearn transformation for {col}: {e}")
+                except Exception as e:
+                    logger.warning(f"Error transforming column {col}: {e}")
+        return data_transformed
+    def inverse_transform(
+        self,
+        data: pd.DataFrame,
+        columns: List[str] = None
+    ) -> pd.DataFrame:
+        """
+        Inverse transform scaled data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Scaled data
+        columns : List[str], optional
+            List of columns for inverse transform
+        Returns:
+        --------
+        pd.DataFrame
+            Data in original scale
+        """
+        if not self.scalers:
+            logger.warning("Scalers not trained")
+            return data
+        data_inverse = data.copy()
+        if columns is None:
+            columns = [col for col in self.scalers.keys() if col in data_inverse.columns]
+        for col in columns:
+            if col in data_inverse.columns and col in self.scalers:
+                try:
+                    scaler_info = self.scaling_info.get(col, {})
+                    scaler_data = self.scalers[col]
+                    method = scaler_info.get('method', 'unknown')
+                    # For binary and categorical features, do nothing
+                    if method in ['none', 'error']:
+                        continue
+                    if isinstance(scaler_data, dict) and 'scaler' in scaler_data:
+                        scaler_params = scaler_data['scaler']
+                        if method == 'standard':
+                            mean = scaler_params.get('mean', 0)
+                            std = scaler_params.get('std', 1)
+                            data_inverse[col] = data_inverse[col] * std + mean
+                        elif method == 'minmax':
+                            min_val = scaler_params.get('min', 0)
+                            max_val = scaler_params.get('max', 1)
+                            if max_val > min_val:
+                                data_inverse[col] = data_inverse[col] * (max_val - min_val) + min_val
+                        elif method == 'robust':
+                            median = scaler_params.get('median', 0)
+                            iqr = scaler_params.get('iqr', 1)
+                            if iqr > 0:
+                                data_inverse[col] = data_inverse[col] * iqr + median
+                            else:
+                                std = scaler_params.get('std', 1)
+                                if std > 0:
+                                    data_inverse[col] = data_inverse[col] * std + median
+                    elif hasattr(scaler_data, 'inverse_transform'):
+                        # For sklearn objects
+                        from sklearn.base import BaseEstimator
+                        if isinstance(scaler_data, BaseEstimator):
+                            try:
+                                inverse_transformed = scaler_data.inverse_transform(
+                                    data_inverse[[col]].values.reshape(-1, 1)
+                                ).flatten()
+                                data_inverse[col] = inverse_transformed
+                            except Exception as e:
+                                logger.warning(f"Error in sklearn inverse transformation for {col}: {e}")
+                except Exception as e:
+                    logger.warning(f"Error in inverse transformation for column {col}: {e}")
+        return data_inverse
+    def _plot_scaling_results(
+        self,
+        original_data: pd.DataFrame,
+        scaled_data: pd.DataFrame,
+        columns: List[str],
+        method: str
+    ) -> None:
+        """Visualise scaling results"""
+        # Limit number of columns for visualisation
+        cols_to_plot = [col for col in columns if col in original_data.columns and col in scaled_data.columns][:8]
+        if not cols_to_plot:
+            return
+        n_cols = 4
+        n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols
+        fig, axes = plt.subplots(n_rows, n_cols * 2, figsize=(16, 4 * n_rows))
+        for idx, col in enumerate(cols_to_plot):
+            row = idx // n_cols
+            col_idx = (idx % n_cols) * 2
+            # Distribution before scaling
+            axes[row, col_idx].hist(
+                original_data[col].dropna(),
+                bins=30,
+                alpha=0.7,
+                color='blue',
+                density=True
+            )
+            axes[row, col_idx].set_title(f'{col} (before)', fontsize=10)
+            axes[row, col_idx].set_xlabel('Value')
+            axes[row, col_idx].set_ylabel('Density')
+            axes[row, col_idx].grid(True, alpha=0.3)
+            # Distribution after scaling
+            axes[row, col_idx + 1].hist(
+                scaled_data[col].dropna(),
+                bins=30,
+                alpha=0.7,
+                color='green',
+                density=True
+            )
+            axes[row, col_idx + 1].set_title(f'{col} (after)', fontsize=10)
+            axes[row, col_idx + 1].set_xlabel('Scaled value')
+            axes[row, col_idx + 1].set_ylabel('Density')
+            axes[row, col_idx + 1].grid(True, alpha=0.3)
+        # Hide unused subplots
+        total_plots = n_rows * n_cols * 2
+        for idx in range(len(cols_to_plot) * 2, total_plots):
+            row = idx // (n_cols * 2)
+            col_idx = idx % (n_cols * 2)
+            axes[row, col_idx].set_visible(False)
+        plt.suptitle(f'Scaling results using {method} method', fontsize=14)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/scaling_results.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get scaling report"""
+        summary = {
+            'total_columns': len(self.scaling_info),
+            'methods_used': {},
+            'binary_features': [],
+            'categorical_features': [],
+            'continuous_features': [],
+            'errors': []
+        }
+        for col, info in self.scaling_info.items():
+            method = info.get('method', 'unknown')
+            if method not in summary['methods_used']:
+                summary['methods_used'][method] = 0
+            summary['methods_used'][method] += 1
+            if method == 'none' and info.get('scaler_type') == 'binary':
+                summary['binary_features'].append(col)
+            elif method in ['minmax', 'standard', 'robust']:
+                summary['continuous_features'].append(col)
+            elif method == 'error':
+                summary['errors'].append({
+                    'column': col,
+                    'error': info.get('error', 'unknown')
+                })
+        return {
+            'summary': summary,
+            'details': self.scaling_info
+        }

splitting/__init__.py ADDED Viewed

File without changes

splitting/data_splitter.py ADDED Viewed

	@@ -0,0 +1,403 @@

+# ============================================
+# CLASS 9: DATA SPLITTING
+# ============================================
+from datetime import datetime
+from typing import Dict, Optional, Tuple
+from venv import logger
+import pandas as pd
+from config.config import Config
+import numpy as np
+import matplotlib.pyplot as plt
+class DataSplitter:
+    """Class for splitting data into train, validation and test sets"""
+    def __init__(self, config: Config):
+        """
+        Initialise data splitter
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.split_info = {}
+        self.split_indices = {}
+        self.split_strategy = None
+    def split(
+        self,
+        data: pd.DataFrame,
+        test_size: Optional[float] = None,
+        validation_size: Optional[float] = None,
+        method: str = None,
+        random_state: int = 42,
+        **kwargs
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """
+        Split data into train, validation and test sets
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        test_size : float, optional
+            Test set size. If None, uses configuration value.
+        validation_size : float, optional
+            Validation set size. If None, uses configuration value.
+        method : str, optional
+            Splitting method: 'time', 'random', 'expanding_window', 'sliding_window'
+        random_state : int
+            Seed for reproducibility
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
+            Train, validation and test data
+        """
+        logger.info("\n" + "="*80)
+        logger.info("DATA SPLITTING")
+        logger.info("="*80)
+        test_size = test_size or self.config.test_size
+        validation_size = validation_size or self.config.validation_size
+        method = method or self.config.split_method
+        n = len(data)
+        logger.info(f"Total data: {n} records")
+        logger.info(f"Splitting method: {method}")
+        logger.info(f"Sizes: train={1-test_size-validation_size:.1%}, val={validation_size:.1%}, test={test_size:.1%}")
+        if method == 'time':
+            train_data, val_data, test_data = self._time_based_split(
+                data, test_size, validation_size
+            )
+        elif method == 'random':
+            train_data, val_data, test_data = self._random_split(
+                data, test_size, validation_size, random_state
+            )
+        elif method == 'expanding_window':
+            train_data, val_data, test_data = self._expanding_window_split(
+                data, test_size, validation_size, **kwargs
+            )
+        elif method == 'sliding_window':
+            train_data, val_data, test_data = self._sliding_window_split(
+                data, **kwargs
+            )
+        else:
+            logger.warning(f"Method {method} not supported, using time-based split")
+            train_data, val_data, test_data = self._time_based_split(
+                data, test_size, validation_size
+            )
+        # Save splitting information
+        self._save_split_info(data, train_data, val_data, test_data, method)
+        # Output information
+        self._log_split_summary(train_data, val_data, test_data)
+        # Visualisation of split
+        if self.config.save_plots:
+            self._plot_data_split(data, train_data, val_data, test_data)
+        return train_data, val_data, test_data
+    def _time_based_split(
+        self,
+        data: pd.DataFrame,
+        test_size: float,
+        validation_size: float
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Time-based splitting preserving temporal order"""
+        n = len(data)
+        # Calculate set sizes
+        test_size_int = int(n * test_size)
+        val_size_int = int(n * validation_size)
+        train_size_int = n - test_size_int - val_size_int
+        # Split data
+        train_data = data.iloc[:train_size_int].copy()
+        val_data = data.iloc[train_size_int:train_size_int + val_size_int].copy()
+        test_data = data.iloc[train_size_int + val_size_int:].copy()
+        self.split_strategy = 'time_based'
+        return train_data, val_data, test_data
+    def _random_split(
+        self,
+        data: pd.DataFrame,
+        test_size: float,
+        validation_size: float,
+        random_state: int
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Random data splitting"""
+        from sklearn.model_selection import train_test_split
+        # First split into train+val and test
+        train_val_data, test_data = train_test_split(
+            data,
+            test_size=test_size,
+            random_state=random_state,
+            shuffle=True
+        )
+        # Then split train+val into train and val
+        val_relative_size = validation_size / (1 - test_size)
+        train_data, val_data = train_test_split(
+            train_val_data,
+            test_size=val_relative_size,
+            random_state=random_state,
+            shuffle=True
+        )
+        self.split_strategy = 'random'
+        return train_data, val_data, test_data
+    def _expanding_window_split(
+        self,
+        data: pd.DataFrame,
+        test_size: float,
+        validation_size: float,
+        **kwargs
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Expanding window split"""
+        n = len(data)
+        # Minimum initial window size
+        initial_window = kwargs.get('initial_window', max(100, int(n * 0.1)))
+        # Final set sizes
+        test_size_int = int(n * test_size)
+        val_size_int = int(n * validation_size)
+        # Determine boundaries
+        test_start = n - test_size_int
+        val_start = test_start - val_size_int
+        # For expanding window, use all data up to val_start for training
+        train_data = data.iloc[:val_start].copy()
+        val_data = data.iloc[val_start:test_start].copy()
+        test_data = data.iloc[test_start:].copy()
+        self.split_strategy = 'expanding_window'
+        return train_data, val_data, test_data
+    def _sliding_window_split(
+        self,
+        data: pd.DataFrame,
+        **kwargs
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Sliding window split (for multiple train-val-test pairs)"""
+        window_size = kwargs.get('window_size', len(data) // 3)
+        step = kwargs.get('step', window_size // 2)
+        # For simplicity return single split
+        # In real scenarios can return list of splits
+        n = len(data)
+        train_end = n - window_size
+        val_end = train_end + window_size // 3
+        test_end = n
+        train_data = data.iloc[:train_end].copy()
+        val_data = data.iloc[train_end:val_end].copy()
+        test_data = data.iloc[val_end:].copy()
+        self.split_strategy = 'sliding_window'
+        return train_data, val_data, test_data
+    def _save_split_info(
+        self,
+        full_data: pd.DataFrame,
+        train_data: pd.DataFrame,
+        val_data: pd.DataFrame,
+        test_data: pd.DataFrame,
+        method: str
+    ) -> None:
+        """Save splitting information"""
+        n = len(full_data)
+        self.split_info = {
+            'method': method,
+            'strategy': self.split_strategy,
+            'train_size': len(train_data),
+            'val_size': len(val_data),
+            'test_size': len(test_data),
+            'train_percent': len(train_data) / n * 100,
+            'val_percent': len(val_data) / n * 100,
+            'test_percent': len(test_data) / n * 100,
+            'total_samples': n,
+            'features_count': len(full_data.columns),
+            'split_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+        }
+        # Add temporal period information if available
+        if isinstance(full_data.index, pd.DatetimeIndex):
+            self.split_info.update({
+                'train_period': {
+                    'start': train_data.index.min().strftime('%Y-%m-%d'),
+                    'end': train_data.index.max().strftime('%Y-%m-%d')
+                },
+                'val_period': {
+                    'start': val_data.index.min().strftime('%Y-%m-%d'),
+                    'end': val_data.index.max().strftime('%Y-%m-%d')
+                },
+                'test_period': {
+                    'start': test_data.index.min().strftime('%Y-%m-%d'),
+                    'end': test_data.index.max().strftime('%Y-%m-%d')
+                }
+            })
+        # Save split indices
+        self.split_indices = {
+            'train': train_data.index.tolist(),
+            'val': val_data.index.tolist(),
+            'test': test_data.index.tolist()
+        }
+    def _log_split_summary(
+        self,
+        train_data: pd.DataFrame,
+        val_data: pd.DataFrame,
+        test_data: pd.DataFrame
+    ) -> None:
+        """Log splitting summary"""
+        logger.info("✓ Data split completed:")
+        logger.info(f"  Train: {len(train_data)} records ({self.split_info['train_percent']:.1f}%)")
+        logger.info(f"  Validation: {len(val_data)} records ({self.split_info['val_percent']:.1f}%)")
+        logger.info(f"  Test: {len(test_data)} records ({self.split_info['test_percent']:.1f}%)")
+        if 'train_period' in self.split_info:
+            logger.info(f"\nPeriods:")
+            logger.info(f"  Train: {self.split_info['train_period']['start']} - {self.split_info['train_period']['end']}")
+            logger.info(f"  Validation: {self.split_info['val_period']['start']} - {self.split_info['val_period']['end']}")
+            logger.info(f"  Test: {self.split_info['test_period']['start']} - {self.split_info['test_period']['end']}")
+        # Target variable statistics
+        target = self.config.target_column
+        if target in train_data.columns:
+            logger.info(f"\nTarget variable '{target}' statistics:")
+            logger.info(f"  Train: mean={train_data[target].mean():.2f}, std={train_data[target].std():.2f}")
+            logger.info(f"  Validation: mean={val_data[target].mean():.2f}, std={val_data[target].std():.2f}")
+            logger.info(f"  Test: mean={test_data[target].mean():.2f}, std={test_data[target].std():.2f}")
+    def _plot_data_split(
+        self,
+        full_data: pd.DataFrame,
+        train_data: pd.DataFrame,
+        val_data: pd.DataFrame,
+        test_data: pd.DataFrame
+    ) -> None:
+        """Visualise data splitting"""
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        target = self.config.target_column
+        # 1. Time series with set highlighting
+        if target in full_data.columns and isinstance(full_data.index, pd.DatetimeIndex):
+            axes[0, 0].plot(train_data.index, train_data[target],
+                          label='Train', colour='blue', alpha=0.7, linewidth=1)
+            axes[0, 0].plot(val_data.index, val_data[target],
+                          label='Validation', colour='orange', alpha=0.7, linewidth=1)
+            axes[0, 0].plot(test_data.index, test_data[target],
+                          label='Test', colour='red', alpha=0.7, linewidth=1)
+            axes[0, 0].set_title(f'Data Split: {target}')
+            axes[0, 0].set_xlabel('Date')
+            axes[0, 0].set_ylabel(target)
+            axes[0, 0].legend()
+            axes[0, 0].grid(True, alpha=0.3)
+        # 2. Yearly distribution
+        if isinstance(full_data.index, pd.DatetimeIndex):
+            full_data['year'] = full_data.index.year
+            train_data['year'] = train_data.index.year
+            val_data['year'] = val_data.index.year
+            test_data['year'] = test_data.index.year
+            years = sorted(full_data['year'].unique())
+            train_counts = [len(train_data[train_data['year'] == year]) for year in years]
+            val_counts = [len(val_data[val_data['year'] == year]) for year in years]
+            test_counts = [len(test_data[test_data['year'] == year]) for year in years]
+            x = np.arange(len(years))
+            width = 0.25
+            axes[0, 1].bar(x - width, train_counts, width, label='Train', colour='blue', alpha=0.7)
+            axes[0, 1].bar(x, val_counts, width, label='Validation', colour='orange', alpha=0.7)
+            axes[0, 1].bar(x + width, test_counts, width, label='Test', colour='red', alpha=0.7)
+            axes[0, 1].set_title('Yearly Data Distribution')
+            axes[0, 1].set_xlabel('Year')
+            axes[0, 1].set_ylabel('Number of Records')
+            axes[0, 1].set_xticks(x)
+            axes[0, 1].set_xticklabels(years, rotation=45)
+            axes[0, 1].legend()
+            axes[0, 1].grid(True, alpha=0.3)
+            # Remove added columns
+            for df in [full_data, train_data, val_data, test_data]:
+                if 'year' in df.columns:
+                    df.drop('year', axis=1, inplace=True)
+        # 3. Target variable distribution
+        if target in full_data.columns:
+            axes[1, 0].hist(train_data[target].dropna(), bins=30, alpha=0.5, label='Train', density=True)
+            axes[1, 0].hist(val_data[target].dropna(), bins=30, alpha=0.5, label='Validation', density=True)
+            axes[1, 0].hist(test_data[target].dropna(), bins=30, alpha=0.5, label='Test', density=True)
+            axes[1, 0].set_title(f'{target} Distribution Across Sets')
+            axes[1, 0].set_xlabel(target)
+            axes[1, 0].set_ylabel('Density')
+            axes[1, 0].legend()
+            axes[1, 0].grid(True, alpha=0.3)
+        # 4. Set statistics
+        if target in full_data.columns:
+            stats_data = []
+            for name, df in [('Train', train_data), ('Validation', val_data), ('Test', test_data)]:
+                if target in df.columns:
+                    stats_data.append({
+                        'Dataset': name,
+                        'Mean': df[target].mean(),
+                        'Std': df[target].std(),
+                        'Min': df[target].min(),
+                        'Max': df[target].max()
+                    })
+            if stats_data:
+                stats_df = pd.DataFrame(stats_data)
+                stats_table = axes[1, 1].table(
+                    cellText=stats_df.round(2).values,
+                    colLabels=stats_df.columns,
+                    cellLoc='center',
+                    loc='center'
+                )
+                stats_table.auto_set_font_size(False)
+                stats_table.set_fontsize(9)
+                stats_table.scale(1, 1.5)
+                axes[1, 1].axis('off')
+                axes[1, 1].set_title('Set Statistics')
+        plt.suptitle(f'Data Splitting: {self.split_info["method"]} method', fontsize=14)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/data_split.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get data splitting report"""
+        return self.split_info

src/streamlit_app.py DELETED Viewed

@@ -1,40 +0,0 @@
-import altair as alt
-import numpy as np
-import pandas as pd
-import streamlit as st
-"""
-# Welcome to Streamlit!
-Edit `/streamlit_app.py` to customize this app to your heart's desire :heart:.
-If you have any questions, checkout our [documentation](https://docs.streamlit.io) and [community
-forums](https://discuss.streamlit.io).
-In the meantime, below is an example of what you can do with just a few lines of code:
-"""
-num_points = st.slider("Number of points in spiral", 1, 10000, 1100)
-num_turns = st.slider("Number of turns in spiral", 1, 300, 31)
-indices = np.linspace(0, 1, num_points)
-theta = 2 * np.pi * num_turns * indices
-radius = indices
-x = radius * np.cos(theta)
-y = radius * np.sin(theta)
-df = pd.DataFrame({
-    "x": x,
-    "y": y,
-    "idx": indices,
-    "rand": np.random.randn(num_points),
-})
-st.altair_chart(alt.Chart(df, height=700, width=700)
-    .mark_point(filled=True)
-    .encode(
-        x=alt.X("x", axis=None),
-        y=alt.Y("y", axis=None),
-        color=alt.Color("idx", legend=None, scale=alt.Scale()),
-        size=alt.Size("rand", legend=None, scale=alt.Scale(range=[1, 150])),
-    ))

stationarity/__init__.py ADDED Viewed

File without changes

stationarity/stationarity_checker.py ADDED Viewed

	@@ -0,0 +1,631 @@

+# ============================================
+# CLASS 6: STATIONARITY ANALYSIS
+# ============================================
+from typing import Dict, Optional
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from statsmodels.tsa.stattools import adfuller, kpss, acf, pacf
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+class StationarityChecker:
+    """Class for checking time series stationarity"""
+    def __init__(self, config: Config):
+        """
+        Initialise stationarity checker
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.test_results = {}
+        self.transformed_series = {}
+        self.best_transformation = {}
+    def check(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        make_stationary: bool = True,
+        try_transformations: bool = True
+    ) -> Dict:
+        """
+        Check time series stationarity
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration default.
+        make_stationary : bool
+            Transform series to stationary form
+        try_transformations : bool
+            Try various transformations to achieve stationarity
+        Returns:
+        --------
+        Dict
+            Stationarity test results
+        """
+        logger.info("\n" + "="*80)
+        logger.info("STATIONARITY ANALYSIS")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        if target_col not in data.columns:
+            logger.error(f"Target variable '{target_col}' not found")
+            return {}
+        series = data[target_col].dropna()
+        if len(series) < 10:
+            logger.warning("Insufficient data for stationarity analysis")
+            return {}
+        # Perform analysis
+        results = self._perform_stationarity_tests(series, target_col)
+        # Save results
+        self.test_results[target_col] = results
+        # Visualisation
+        if self.config.save_plots:
+            self._plot_stationarity_analysis(data, target_col, results)
+        # Log results
+        self._log_test_results(target_col, results)
+        # Transform to stationary form
+        if make_stationary and not results['overall']['is_stationary']:
+            if try_transformations:
+                transformed_data = self._make_stationary(data, target_col, results)
+                if transformed_data is not None:
+                    data = transformed_data
+        return results
+    def _perform_stationarity_tests(
+        self,
+        series: pd.Series,
+        target_col: str
+    ) -> Dict:
+        """Perform various stationarity tests"""
+        results = {
+            'adf': self._adf_test(series),
+            'kpss': self._kpss_test(series),
+            'pp': self._pp_test(series),
+            'hurst': self._hurst_exponent(series),
+            'variance_ratio': self._variance_ratio_test(series),
+            'overall': {}
+        }
+        # Determine overall stationarity
+        adf_stationary = results['adf'].get('is_stationary', False)
+        kpss_stationary = results['kpss'].get('is_stationary', False)
+        pp_stationary = results['pp'].get('is_stationary', False)
+        # Stationarity determination logic
+        if adf_stationary and kpss_stationary:
+            overall_stationary = True
+            confidence = 'high'
+        elif adf_stationary and not kpss_stationary:
+            overall_stationary = True  # ADF more reliable for detecting stationarity
+            confidence = 'medium'
+        elif not adf_stationary and kpss_stationary:
+            overall_stationary = False  # KPSS indicates non-stationarity
+            confidence = 'medium'
+        else:
+            overall_stationary = False
+            confidence = 'high'
+        results['overall'] = {
+            'is_stationary': overall_stationary,
+            'confidence': confidence,
+            'recommendation': self._get_stationarity_recommendation(results)
+        }
+        return results
+    def _adf_test(self, series: pd.Series) -> Dict:
+        """Augmented Dickey-Fuller (ADF) test"""
+        try:
+            adf_result = adfuller(series, autolag='AIC')
+            return {
+                'statistic': float(adf_result[0]),
+                'pvalue': float(adf_result[1]),
+                'critical_values': {k: float(v) for k, v in adf_result[4].items()},
+                'is_stationary': adf_result[1] < 0.05,
+                'used_lag': int(adf_result[2]),
+                'nobs': int(adf_result[3])
+            }
+        except Exception as e:
+            logger.warning(f"ADF test failed: {e}")
+            return {
+                'statistic': np.nan,
+                'pvalue': np.nan,
+                'critical_values': {},
+                'is_stationary': False,
+                'error': str(e)
+            }
+    def _kpss_test(self, series: pd.Series) -> Dict:
+        """KPSS test"""
+        try:
+            kpss_result = kpss(series, regression='c', nlags='auto')
+            return {
+                'statistic': float(kpss_result[0]),
+                'pvalue': float(kpss_result[1]),
+                'critical_values': {k: float(v) for k, v in kpss_result[3].items()},
+                'is_stationary': kpss_result[1] > 0.05,  # KPSS: p > 0.05 indicates stationarity
+                'used_lag': int(kpss_result[2])
+            }
+        except Exception as e:
+            logger.warning(f"KPSS test failed: {e}")
+            return {
+                'statistic': np.nan,
+                'pvalue': np.nan,
+                'critical_values': {},
+                'is_stationary': False,
+                'error': str(e)
+            }
+    def _pp_test(self, series: pd.Series) -> Dict:
+        """Phillips-Perron test"""
+        try:
+            # Simplified PP test version
+            from statsmodels.tsa.stattools import PhillipsPerron
+            pp_result = PhillipsPerron(series)
+            return {
+                'statistic': float(pp_result.stat),
+                'pvalue': float(pp_result.pvalue),
+                'critical_values': pp_result.critical_values,
+                'is_stationary': pp_result.pvalue < 0.05
+            }
+        except:
+            # If statsmodels with PP test not available
+            return {
+                'statistic': np.nan,
+                'pvalue': np.nan,
+                'critical_values': {},
+                'is_stationary': False,
+                'note': 'Phillips-Perron test not available'
+            }
+    def _hurst_exponent(self, series: pd.Series) -> Dict:
+        """Calculate Hurst exponent"""
+        try:
+            # Simplified Hurst exponent calculation
+            lags = range(2, min(100, len(series)//4))
+            tau = []
+            for lag in lags:
+                # Split series into subsequences of length lag
+                n = len(series) // lag
+                if n < 2:
+                    continue
+                subseries = [series[i*lag:(i+1)*lag] for i in range(n)]
+                # Calculate R/S for each subsequence
+                rs_values = []
+                for sub in subseries:
+                    if len(sub) > 1:
+                        mean = np.mean(sub)
+                        deviations = sub - mean
+                        z = np.cumsum(deviations)
+                        r = np.max(z) - np.min(z)
+                        s = np.std(sub)
+                        if s > 0:
+                            rs_values.append(r / s)
+                if rs_values:
+                    tau.append(np.mean(rs_values))
+            if len(tau) > 2:
+                # Linear regression in log coordinates
+                x = np.log(lags[:len(tau)])
+                y = np.log(tau)
+                if len(x) > 1 and len(y) > 1:
+                    slope = np.polyfit(x, y, 1)[0]
+                    # Hurst exponent interpretation
+                    if slope > 0.5:
+                        trend_type = 'persistent'
+                    elif slope < 0.5:
+                        trend_type = 'anti-persistent'
+                    else:
+                        trend_type = 'random'
+                    return {
+                        'exponent': float(slope),
+                        'trend_type': trend_type,
+                        'interpretation': self._interpret_hurst(slope)
+                    }
+            return {
+                'exponent': np.nan,
+                'trend_type': 'unknown',
+                'interpretation': 'Insufficient data'
+            }
+        except Exception as e:
+            logger.debug(f"Hurst exponent not calculated: {e}")
+            return {
+                'exponent': np.nan,
+                'trend_type': 'unknown',
+                'error': str(e)
+            }
+    def _interpret_hurst(self, hurst_exponent: float) -> str:
+        """Interpret Hurst exponent"""
+        if hurst_exponent > 0.75:
+            return "Strong persistence (long-term memory)"
+        elif hurst_exponent > 0.6:
+            return "Moderate persistence"
+        elif hurst_exponent > 0.4:
+            return "Weak persistence / random walk"
+        elif hurst_exponent > 0.25:
+            return "Weak anti-persistence"
+        else:
+            return "Strong anti-persistence (frequent trend reversal)"
+    def _variance_ratio_test(self, series: pd.Series) -> Dict:
+        """Variance Ratio test for random walk"""
+        try:
+            # Simplified variance ratio test
+            if len(series) < 20:
+                return {'ratio': np.nan, 'is_random_walk': False}
+            # Calculate differences
+            diff1 = series.diff(1).dropna()
+            diff2 = series.diff(2).dropna()[1:]  # Shift to align indices
+            if len(diff1) < 5 or len(diff2) < 5:
+                return {'ratio': np.nan, 'is_random_walk': False}
+            var1 = np.var(diff1)
+            var2 = np.var(diff2)
+            if var1 > 0:
+                ratio = var2 / (2 * var1)
+                # For random walk ratio ≈ 1
+                is_random_walk = 0.8 < ratio < 1.2
+                return {
+                    'ratio': float(ratio),
+                    'is_random_walk': bool(is_random_walk),
+                    'var_diff1': float(var1),
+                    'var_diff2': float(var2)
+                }
+            else:
+                return {'ratio': np.nan, 'is_random_walk': False}
+        except Exception as e:
+            logger.debug(f"Variance ratio test failed: {e}")
+            return {'ratio': np.nan, 'is_random_walk': False, 'error': str(e)}
+    def _get_stationarity_recommendation(self, results: Dict) -> str:
+        """Get stationarity recommendations"""
+        # Check for keys before access
+        if 'overall' not in results or 'is_stationary' not in results['overall']:
+            return "Could not determine stationarity. Check data and test settings."
+        if results['overall']['is_stationary']:
+            return "Series is stationary, suitable for modelling"
+        else:
+            recommendations = []
+            # Check Hurst test results
+            if 'hurst' in results and 'exponent' in results['hurst']:
+                hurst_exponent = results['hurst']['exponent']
+                if not np.isnan(hurst_exponent) and hurst_exponent > 0.6:
+                    recommendations.append("Apply differencing to remove trend")
+            # Check ADF test
+            if 'adf' in results and 'pvalue' in results['adf']:
+                adf_pvalue = results['adf']['pvalue']
+                if not np.isnan(adf_pvalue) and adf_pvalue > 0.1:
+                    recommendations.append("Consider seasonal differencing due to non-stationarity")
+            if len(recommendations) == 0:
+                recommendations.append("Try logarithmic transformation and differencing")
+            return "; ".join(recommendations)
+    def _plot_stationarity_analysis(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        results: Dict
+    ) -> None:
+        """Visualise stationarity analysis"""
+        series = data[target_col]
+        fig, axes = plt.subplots(2, 3, figsize=(16, 10))
+        # 1. Original series
+        axes[0, 0].plot(series.index, series, linewidth=1)
+        axes[0, 0].set_title(f'Original Time Series: {target_col}')
+        axes[0, 0].set_xlabel('Date')
+        axes[0, 0].set_ylabel(target_col)
+        axes[0, 0].grid(True, alpha=0.3)
+        # 2. Rolling statistics
+        rolling_mean = series.rolling(window=365, center=True, min_periods=1).mean()
+        rolling_std = series.rolling(window=365, center=True, min_periods=1).std()
+        axes[0, 1].plot(series.index, series, label='Original series', alpha=0.7, linewidth=0.5)
+        axes[0, 1].plot(rolling_mean.index, rolling_mean, label='Rolling mean (365)', color='red', linewidth=2)
+        axes[0, 1].plot(rolling_std.index, rolling_std, label='Rolling STD (365)', color='green', linewidth=2)
+        axes[0, 1].set_title(f'Rolling Statistics: {target_col}')
+        axes[0, 1].set_xlabel('Date')
+        axes[0, 1].set_ylabel(target_col)
+        axes[0, 1].legend(fontsize=8)
+        axes[0, 1].grid(True, alpha=0.3)
+        # 3. ACF
+        plot_acf(series.dropna(), lags=50, ax=axes[0, 2], alpha=0.05)
+        axes[0, 2].set_title(f'Autocorrelation Function (ACF): {target_col}')
+        axes[0, 2].set_xlabel('Lag')
+        axes[0, 2].set_ylabel('Autocorrelation')
+        axes[0, 2].grid(True, alpha=0.3)
+        # 4. PACF
+        plot_pacf(series.dropna(), lags=50, ax=axes[1, 0], alpha=0.05)
+        axes[1, 0].set_title(f'Partial Autocorrelation Function (PACF): {target_col}')
+        axes[1, 0].set_xlabel('Lag')
+        axes[1, 0].set_ylabel('Partial Autocorrelation')
+        axes[1, 0].grid(True, alpha=0.3)
+        # 5. Histogram and Q-Q plot
+        axes[1, 1].hist(series.dropna(), bins=30, edgecolor='black', alpha=0.7, density=True)
+        axes[1, 1].set_title(f'Distribution: {target_col}')
+        axes[1, 1].set_xlabel('Value')
+        axes[1, 1].set_ylabel('Density')
+        axes[1, 1].grid(True, alpha=0.3)
+        # 6. Series differences
+        diff1 = series.diff(1).dropna()
+        axes[1, 2].plot(diff1.index, diff1, linewidth=0.5)
+        axes[1, 2].set_title(f'First Difference: {target_col}')
+        axes[1, 2].set_xlabel('Date')
+        axes[1, 2].set_ylabel(f'Δ{target_col}')
+        axes[1, 2].grid(True, alpha=0.3)
+        plt.suptitle(
+            f'Stationarity Analysis: {target_col}\n'
+            f'Stationary: {"✓ Yes" if results["overall"]["is_stationary"] else "✗ No"} '
+            f'(confidence: {results["overall"]["confidence"]})',
+            fontsize=14
+        )
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/stationarity_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def _log_test_results(self, target_col: str, results: Dict) -> None:
+        """Log test results"""
+        logger.info("\nSTATIONARITY TEST RESULTS:")
+        logger.info("-" * 50)
+        # ADF test
+        adf = results['adf']
+        logger.info(f"Augmented Dickey-Fuller (ADF) test:")
+        logger.info(f"  Statistic: {adf['statistic']:.4f}")
+        logger.info(f"  p-value: {adf['pvalue']:.4f}")
+        logger.info(f"  Stationary: {'✓ Yes' if adf['is_stationary'] else '✗ No'}")
+        # KPSS test
+        kpss_test = results['kpss']
+        if 'statistic' in kpss_test and not np.isnan(kpss_test['statistic']):
+            logger.info(f"\nKPSS test:")
+            logger.info(f"  Statistic: {kpss_test['statistic']:.4f}")
+            logger.info(f"  p-value: {kpss_test['pvalue']:.4f}")
+            logger.info(f"  Stationary: {'✓ Yes' if kpss_test['is_stationary'] else '✗ No'}")
+        # Hurst exponent
+        hurst = results['hurst']
+        if 'exponent' in hurst and not np.isnan(hurst['exponent']):
+            logger.info(f"\nHurst exponent:")
+            logger.info(f"  Value: {hurst['exponent']:.3f}")
+            logger.info(f"  Trend type: {hurst['trend_type']}")
+            logger.info(f"  Interpretation: {hurst.get('interpretation', '')}")
+        # Overall interpretation
+        logger.info(f"\nOVERALL CONCLUSION:")
+        logger.info("-" * 30)
+        logger.info(f"Stationary: {'✓ Yes' if results['overall']['is_stationary'] else '✗ No'}")
+        logger.info(f"Confidence: {results['overall']['confidence']}")
+        logger.info(f"Recommendation: {results['overall']['recommendation']}")
+    def _make_stationary(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        results: Dict
+    ) -> Optional[pd.DataFrame]:
+        """
+        Transform series to stationary form
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        results : Dict
+            Stationarity test results
+        Returns:
+        --------
+        Optional[pd.DataFrame]
+            Data with stationary series or None if transformation failed
+        """
+        logger.info("\nTRANSFORMING TO STATIONARY FORM:")
+        logger.info("-" * 40)
+        data_processed = data.copy()
+        series = data_processed[target_col]
+        # Stationarisation methods in order of preference
+        methods = [
+            ('diff', 'first-order differencing'),
+            ('seasonal_diff', f'seasonal differencing (period={self.config.seasonal_period})'),
+            ('log_diff', 'logarithmic differencing'),
+            ('boxcox_diff', 'Box-Cox + differencing'),
+            ('detrend', 'detrending'),
+            ('combination', 'combined method')
+        ]
+        best_method = None
+        best_series = None
+        best_pvalue = 1.0
+        best_stationary = False
+        for method, method_name in methods:
+            try:
+                if method == 'diff':
+                    # Simple differencing
+                    transformed = series.diff(1).dropna()
+                    test_series = transformed
+                elif method == 'seasonal_diff':
+                    # Seasonal differencing
+                    transformed = series.diff(self.config.seasonal_period).dropna()
+                    test_series = transformed
+                elif method == 'log_diff':
+                    # Logarithmic differencing
+                    if (series > 0).all():
+                        log_series = np.log(series)
+                        transformed = log_series.diff(1).dropna()
+                        test_series = transformed
+                    else:
+                        # Shift for negative values
+                        shift = abs(series.min()) + 1 if series.min() <= 0 else 0
+                        log_series = np.log(series + shift)
+                        transformed = log_series.diff(1).dropna()
+                        test_series = transformed
+                elif method == 'boxcox_diff':
+                    # Box-Cox transformation + differencing
+                    try:
+                        from scipy.stats import boxcox
+                        # Add constant for positive values
+                        shift = abs(series.min()) + 1 if series.min() <= 0 else 0
+                        boxcox_series, _ = boxcox(series + shift)
+                        transformed = pd.Series(boxcox_series, index=series.index).diff(1).dropna()
+                        test_series = transformed
+                    except:
+                        continue
+                elif method == 'detrend':
+                    # Linear detrending
+                    x = np.arange(len(series))
+                    y = series.values
+                    coeffs = np.polyfit(x, y, 1)
+                    trend = np.polyval(coeffs, x)
+                    transformed = pd.Series(y - trend, index=series.index)
+                    test_series = transformed
+                elif method == 'combination':
+                    # Combined method: log + differencing + detrending
+                    if (series > 0).all():
+                        log_series = np.log(series)
+                        diff_series = log_series.diff(1)
+                        # Detrending residuals
+                        x = np.arange(len(diff_series))
+                        y = diff_series.values
+                        valid_mask = ~np.isnan(y)
+                        if valid_mask.sum() > 2:
+                            coeffs = np.polyfit(x[valid_mask], y[valid_mask], 1)
+                            trend = np.polyval(coeffs, x)
+                            transformed = pd.Series(y - trend, index=series.index)
+                            test_series = transformed.dropna()
+                        else:
+                            test_series = diff_series.dropna()
+                    else:
+                        continue
+                # Check stationarity after transformation
+                if len(test_series) > 10:
+                    adf_result = adfuller(test_series.dropna())
+                    is_stationary = adf_result[1] < 0.05
+                    pvalue = adf_result[1]
+                    logger.info(f"  Method: {method_name}")
+                    logger.info(f"    ADF p-value: {pvalue:.4f}")
+                    logger.info(f"    Stationary: {'✓ Yes' if is_stationary else '✗ No'}")
+                    # Save best method
+                    if is_stationary and pvalue < best_pvalue:
+                        best_pvalue = pvalue
+                        best_method = method
+                        best_series = transformed
+                        best_stationary = True
+                        if pvalue < 0.01:  # Very good result
+                            break
+            except Exception as e:
+                logger.debug(f"  Method {method} failed: {e}")
+                continue
+        # Save results
+        if best_series is not None:
+            new_col_name = f'{target_col}_stationary_{best_method}'
+            # Align indices
+            aligned_series = pd.Series(
+                best_series.values,
+                index=data_processed.index[-len(best_series):]
+            )
+            data_processed[new_col_name] = aligned_series
+            self.transformed_series[target_col] = {
+                'method': best_method,
+                'new_column': new_col_name,
+                'pvalue': float(best_pvalue),
+                'is_stationary': best_stationary,
+                'original_shape': len(series),
+                'transformed_shape': len(best_series)
+            }
+            self.best_transformation[target_col] = best_method
+            logger.info(f"\n✓ Selected method: {best_method}")
+            logger.info(f"  Saved as '{new_col_name}'")
+            logger.info(f"  p-value: {best_pvalue:.4f}")
+            return data_processed
+        else:
+            logger.warning("✗ Could not find suitable transformation for stationarisation")
+            return None
+    def get_report(self) -> Dict:
+        """Get stationarity report"""
+        return {
+            'test_results': self.test_results,
+            'transformed_series': self.transformed_series,
+            'best_transformations': self.best_transformation
+        }

streamlit/streamlit_app.py ADDED Viewed

The diff for this file is too large to render. See raw diff

temp_data.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

validation/__init__.py ADDED Viewed

File without changes

validation/data_validator.py ADDED Viewed

	@@ -0,0 +1,655 @@

+# ============================================
+# CLASS 12: DATA VALIDATION
+# ============================================
+from datetime import datetime
+import json
+from pathlib import Path
+from typing import Dict, List
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+class DataValidator:
+    """Class for data quality validation"""
+    def __init__(self, config: Config):
+        """
+        Initialise data validator
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.validation_results = {}
+        self.quality_metrics = {}
+        self.issues_found = {}
+    def validate(
+        self,
+        data: pd.DataFrame,
+        stage: str = 'final',
+        rules: Dict = None,
+        detailed: bool = True
+    ) -> Dict:
+        """
+        Validate data quality
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        stage : str
+            Validation stage: 'raw', 'processed', 'final'
+        rules : Dict, optional
+            Validation rules. If None, uses configuration defaults.
+        detailed : bool
+            Whether to perform detailed validation
+        Returns:
+        --------
+        Dict
+            Validation results
+        """
+        logger.info("\n" + "="*80)
+        logger.info(f"DATA VALIDATION ({stage.upper()})")
+        logger.info("="*80)
+        rules = rules or self.config.validation_rules
+        validation_results = {
+            'stage': stage,
+            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+            'data_shape': list(data.shape),
+            'basic_checks': {},
+            'quality_metrics': {},
+            'issues': {},
+            'recommendations': [],
+            'overall_score': 0,
+            'status': 'PASS'
+        }
+        # Basic checks
+        validation_results['basic_checks'] = self._basic_checks(data, rules)
+        # Quality checks
+        validation_results['quality_metrics'] = self._quality_metrics(data, rules)
+        # Problem detection
+        if detailed:
+            validation_results['issues'] = self._find_issues(data, rules)
+        # Recommendation generation
+        validation_results['recommendations'] = self._generate_recommendations(
+            validation_results['basic_checks'],
+            validation_results['quality_metrics'],
+            validation_results['issues']
+        )
+        # Overall score calculation
+        validation_results['overall_score'] = self._calculate_overall_score(validation_results)
+        # Status determination
+        if validation_results['overall_score'] >= 80:
+            validation_results['status'] = 'PASS'
+        elif validation_results['overall_score'] >= 60:
+            validation_results['status'] = 'WARNING'
+        else:
+            validation_results['status'] = 'FAIL'
+        # Save results
+        self.validation_results[stage] = validation_results
+        self.quality_metrics[stage] = validation_results['quality_metrics']
+        # Log results
+        self._log_validation_results(validation_results)
+        return validation_results
+    def _basic_checks(self, data: pd.DataFrame, rules: Dict) -> Dict:
+        """Basic data checks"""
+        checks = {}
+        # 1. Data size check
+        checks['min_rows'] = {
+            'value': len(data),
+            'threshold': rules.get('min_rows', 100),
+            'passed': len(data) >= rules.get('min_rows', 100)
+        }
+        # 2. Target variable presence check
+        target = self.config.target_column
+        checks['has_target'] = {
+            'value': target in data.columns,
+            'passed': target in data.columns
+        }
+        # 3. Missing values check
+        missing_percentage = (data.isnull().sum().sum() / data.size) * 100
+        checks['missing_percentage'] = {
+            'value': missing_percentage,
+            'threshold': rules.get('max_missing_percentage', 30),
+            'passed': missing_percentage <= rules.get('max_missing_percentage', 30)
+        }
+        # 4. Duplicates check
+        duplicate_count = data.duplicated().sum()
+        duplicate_percentage = (duplicate_count / len(data)) * 100
+        checks['duplicates'] = {
+            'value': duplicate_percentage,
+            'threshold': 5,  # Maximum 5% duplicates
+            'passed': duplicate_percentage <= 5
+        }
+        # 5. Data types check
+        numeric_count = len(data.select_dtypes(include=[np.number]).columns)
+        checks['numeric_features'] = {
+            'value': numeric_count,
+            'passed': numeric_count >= 1  # At least one numeric feature required
+        }
+        return checks
+    def _quality_metrics(self, data: pd.DataFrame, rules: Dict) -> Dict:
+        """Data quality metrics"""
+        metrics = {}
+        # 1. Numeric features statistics
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        if len(numeric_cols) > 0:
+            numeric_stats = {}
+            for col in numeric_cols:
+                col_data = data[col].dropna()
+                if len(col_data) > 0:
+                    numeric_stats[col] = {
+                        'mean': float(col_data.mean()),
+                        'std': float(col_data.std()),
+                        'skewness': float(col_data.skew()),
+                        'kurtosis': float(col_data.kurtosis()),
+                        'zeros_percentage': float((col_data == 0).sum() / len(col_data) * 100),
+                        'unique_percentage': float(col_data.nunique() / len(col_data) * 100)
+                    }
+            metrics['numeric_statistics'] = numeric_stats
+        # 2. Data stability (for time series)
+        if isinstance(data.index, pd.DatetimeIndex):
+            stability_metrics = self._calculate_temporal_stability(data)
+            metrics['temporal_stability'] = stability_metrics
+        # 3. Feature informativeness
+        if self.config.target_column in data.columns:
+            informativeness = self._calculate_feature_informativeness(data)
+            metrics['feature_informativeness'] = informativeness
+        # 4. Target variable quality
+        target = self.config.target_column
+        if target in data.columns:
+            target_data = data[target].dropna()
+            if len(target_data) > 0:
+                target_metrics = {
+                    'missing_percentage': float(target_data.isnull().sum() / len(data) * 100),
+                    'unique_values': int(target_data.nunique()),
+                    'is_constant': bool(target_data.nunique() <= 1),
+                    'has_outliers': self._check_target_outliers(target_data),
+                    'distribution_type': self._identify_distribution(target_data)
+                }
+                metrics['target_quality'] = target_metrics
+        # 5. Class balance (for classification) - not applicable here, but kept as placeholder
+        metrics['class_balance'] = {'note': 'Not applicable for regression'}
+        return metrics
+    def _calculate_temporal_stability(self, data: pd.DataFrame) -> Dict:
+        """Calculate time series stability metrics"""
+        stability = {}
+        if not isinstance(data.index, pd.DatetimeIndex):
+            return stability
+        # Split into periods (e.g., by years)
+        if 'year' not in data.columns:
+            data_copy = data.copy()
+            data_copy['year'] = data_copy.index.year
+        else:
+            data_copy = data
+        years = sorted(data_copy['year'].unique())
+        if len(years) > 1:
+            # Statistics by years for numeric columns
+            year_stats = {}
+            for col in data.select_dtypes(include=[np.number]).columns[:5]:  # First 5 columns
+                yearly_means = data_copy.groupby('year')[col].mean()
+                yearly_stds = data_copy.groupby('year')[col].std()
+                # Coefficient of variation between years
+                if yearly_means.std() > 0:
+                    cv_between_years = yearly_means.std() / yearly_means.mean()
+                else:
+                    cv_between_years = 0
+                year_stats[col] = {
+                    'yearly_means': yearly_means.to_dict(),
+                    'yearly_stds': yearly_stds.to_dict(),
+                    'cv_between_years': float(cv_between_years),
+                    'mean_stability': float(1 - cv_between_years)  # 1 - CV, closer to 1 means more stable
+                }
+            stability['yearly_statistics'] = year_stats
+        # Check for time gaps
+        time_diff = pd.Series(data.index).diff().dropna()
+        if len(time_diff) > 0:
+            max_gap = time_diff.max()
+            avg_gap = time_diff.mean()
+            gap_std = time_diff.std()
+            stability['time_gaps'] = {
+                'max_gap_days': float(max_gap.days if hasattr(max_gap, 'days') else max_gap),
+                'avg_gap_days': float(avg_gap.days if hasattr(avg_gap, 'days') else avg_gap),
+                'gap_std': float(gap_std.days if hasattr(gap_std, 'days') else gap_std),
+                'has_irregular_gaps': gap_std > avg_gap * 0.5  # If standard deviation > 50% of mean
+            }
+        # Seasonal stability
+        if len(data) > 365:
+            try:
+                # Analyse seasonal patterns
+                seasonal_stability = self._analyse_seasonal_stability(data)
+                stability['seasonal_stability'] = seasonal_stability
+            except:
+                pass
+        return stability
+    def _analyse_seasonal_stability(self, data: pd.DataFrame) -> Dict:
+        """Analyse seasonal patterns stability"""
+        if not isinstance(data.index, pd.DatetimeIndex):
+            return {}
+        # For simplicity, analyse only target variable
+        target = self.config.target_column
+        if target not in data.columns:
+            return {}
+        series = data[target]
+        # Split by years and compare seasonal patterns
+        data_copy = data.copy()
+        data_copy['year'] = data_copy.index.year
+        data_copy['month'] = data_copy.index.month
+        if 'year' in data_copy.columns and 'month' in data_copy.columns:
+            monthly_means = data_copy.groupby(['year', 'month'])[target].mean().unstack()
+            if not monthly_means.empty:
+                # Correlation between years
+                yearly_corr = monthly_means.corr().mean().mean()
+                # Variation between years
+                monthly_cv = monthly_means.std() / monthly_means.mean()
+                avg_monthly_cv = monthly_cv.mean()
+                return {
+                    'yearly_correlation': float(yearly_corr),
+                    'average_monthly_cv': float(avg_monthly_cv),
+                    'seasonal_consistency': 'high' if yearly_corr > 0.8 and avg_monthly_cv < 0.3 else
+                                           'medium' if yearly_corr > 0.6 else 'low'
+                }
+        return {}
+    def _calculate_feature_informativeness(self, data: pd.DataFrame) -> Dict:
+        """Calculate feature informativeness"""
+        informativeness = {}
+        target = self.config.target_column
+        if target not in data.columns:
+            return informativeness
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        numeric_cols = [col for col in numeric_cols if col != target]
+        for col in numeric_cols[:20]:  # Limit number of features for analysis
+            try:
+                # Correlation with target variable
+                correlation = data[col].corr(data[target])
+                # Mutual information (approximated)
+                # For simplicity, use absolute correlation as informativeness measure
+                informativeness[col] = {
+                    'correlation_with_target': float(correlation),
+                    'abs_correlation': float(abs(correlation)),
+                    'informativeness': 'high' if abs(correlation) > 0.5 else
+                                      'medium' if abs(correlation) > 0.3 else 'low'
+                }
+            except:
+                continue
+        return informativeness
+    def _check_target_outliers(self, target_series: pd.Series) -> Dict:
+        """Check target variable for outliers"""
+        if len(target_series) < 10:
+            return {'has_outliers': False, 'outlier_percentage': 0}
+        q1 = target_series.quantile(0.25)
+        q3 = target_series.quantile(0.75)
+        iqr = q3 - q1
+        if iqr > 0:
+            lower_bound = q1 - 1.5 * iqr
+            upper_bound = q3 + 1.5 * iqr
+            outliers = target_series[(target_series < lower_bound) | (target_series > upper_bound)]
+            outlier_percentage = len(outliers) / len(target_series) * 100
+            return {
+                'has_outliers': len(outliers) > 0,
+                'outlier_count': int(len(outliers)),
+                'outlier_percentage': float(outlier_percentage),
+                'outlier_bounds': {'lower': float(lower_bound), 'upper': float(upper_bound)}
+            }
+        return {'has_outliers': False, 'outlier_percentage': 0}
+    def _identify_distribution(self, series: pd.Series) -> str:
+        """Identify distribution type"""
+        if len(series) < 30:
+            return 'insufficient_data'
+        skewness = series.skew()
+        kurtosis = series.kurtosis()
+        if abs(skewness) < 0.5 and abs(kurtosis) < 1:
+            return 'normal_like'
+        elif skewness > 1:
+            return 'right_skewed'
+        elif skewness < -1:
+            return 'left_skewed'
+        elif kurtosis > 3:
+            return 'heavy_tailed'
+        elif kurtosis < 2:
+            return 'light_tailed'
+        else:
+            return 'unknown'
+    def _find_issues(self, data: pd.DataFrame, rules: Dict) -> Dict:
+        """Find data problems"""
+        issues = {
+            'critical': [],
+            'warning': [],
+            'info': []
+        }
+        # 1. Check missing values in important features
+        missing_info = data.isnull().sum()
+        high_missing_cols = missing_info[missing_info / len(data) * 100 > 20].index.tolist()
+        for col in high_missing_cols:
+            missing_pct = missing_info[col] / len(data) * 100
+            if missing_pct > 50:
+                issues['critical'].append(f"Column '{col}': {missing_pct:.1f}% missing values (critical)")
+            elif missing_pct > 20:
+                issues['warning'].append(f"Column '{col}': {missing_pct:.1f}% missing values")
+        # 2. Check constant features
+        for col in data.columns:
+            if data[col].nunique() <= 1:
+                issues['critical'].append(f"Column '{col}': constant value")
+        # 3. Check feature correlation with itself (lags)
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        for col in numeric_cols:
+            if '_lag_' in col or '_diff_' in col:
+                base_col = col.split('_lag_')[0] if '_lag_' in col else col.split('_diff_')[0]
+                if base_col in numeric_cols:
+                    correlation = data[col].corr(data[base_col])
+                    if pd.notna(correlation) and abs(correlation) > 0.95:
+                        issues['info'].append(f"Column '{col}': high correlation with '{base_col}' ({correlation:.3f})")
+        # 4. Check time gaps
+        if isinstance(data.index, pd.DatetimeIndex):
+            time_diff = pd.Series(data.index).diff().dropna()
+            if len(time_diff) > 0:
+                max_gap = time_diff.max()
+                if hasattr(max_gap, 'days') and max_gap.days > 30:
+                    issues['warning'].append(f"Detected time gap: {max_gap.days} days")
+        # 5. Check target variable
+        target = self.config.target_column
+        if target in data.columns:
+            target_data = data[target].dropna()
+            if len(target_data) > 0:
+                if target_data.nunique() <= 1:
+                    issues['critical'].append(f"Target variable '{target}': constant value")
+                # Check for outliers
+                outlier_check = self._check_target_outliers(target_data)
+                if outlier_check.get('has_outliers', False) and outlier_check.get('outlier_percentage', 0) > 10:
+                    issues['warning'].append(f"Target variable '{target}': {outlier_check['outlier_percentage']:.1f}% outliers")
+        # 6. Check multicollinearity (simplified)
+        if len(numeric_cols) > 5:
+            corr_matrix = data[numeric_cols].corr().abs()
+            high_corr_pairs = []
+            for i in range(len(corr_matrix.columns)):
+                for j in range(i+1, len(corr_matrix.columns)):
+                    if corr_matrix.iloc[i, j] > 0.9:
+                        col1 = corr_matrix.columns[i]
+                        col2 = corr_matrix.columns[j]
+                        high_corr_pairs.append((col1, col2, corr_matrix.iloc[i, j]))
+            if len(high_corr_pairs) > 5:
+                issues['warning'].append(f"Detected multicollinearity: {len(high_corr_pairs)} pairs with correlation > 0.9")
+        return issues
+    def _generate_recommendations(
+        self,
+        basic_checks: Dict,
+        quality_metrics: Dict,
+        issues: Dict
+    ) -> List[str]:
+        """Generate data improvement recommendations"""
+        recommendations = []
+        # Recommendations based on basic checks
+        for check_name, check_info in basic_checks.items():
+            if not check_info.get('passed', True):
+                if check_name == 'min_rows':
+                    recommendations.append(f"Increase data volume: current row count ({check_info['value']}) below minimum threshold ({check_info['threshold']})")
+                elif check_name == 'has_target':
+                    recommendations.append(f"Add target variable '{self.config.target_column}' to data")
+                elif check_name == 'missing_percentage':
+                    recommendations.append(f"Handle missing values: {check_info['value']:.1f}% missing exceeds threshold {check_info['threshold']}%")
+                elif check_name == 'duplicates':
+                    recommendations.append(f"Remove duplicates: {check_info['value']:.1f}% duplicate rows")
+        # Recommendations based on issues
+        if issues.get('critical'):
+            recommendations.append("Resolve critical issues before using data")
+        if issues.get('warning'):
+            recommendations.append("Consider addressing warnings to improve data quality")
+        # Recommendations based on quality metrics
+        target_metrics = quality_metrics.get('target_quality', {})
+        if target_metrics.get('is_constant', False):
+            recommendations.append(f"Target variable '{self.config.target_column}' is constant, different target variable needed")
+        if target_metrics.get('has_outliers', {}).get('has_outliers', False):
+            outlier_pct = target_metrics['has_outliers'].get('outlier_percentage', 0)
+            if outlier_pct > 5:
+                recommendations.append(f"Handle outliers in target variable: {outlier_pct:.1f}% outliers")
+        # Time series stability recommendations
+        temporal_stability = quality_metrics.get('temporal_stability', {})
+        if temporal_stability.get('time_gaps', {}).get('has_irregular_gaps', False):
+            recommendations.append("Detected irregular time intervals, consider resampling")
+        return recommendations
+    def _calculate_overall_score(self, validation_results: Dict) -> float:
+        """Calculate overall data quality score"""
+        score = 100
+        # Penalties for basic checks
+        basic_checks = validation_results.get('basic_checks', {})
+        for check_name, check_info in basic_checks.items():
+            if not check_info.get('passed', True):
+                if check_name == 'min_rows':
+                    score -= 30
+                elif check_name == 'has_target':
+                    score -= 50
+                elif check_name == 'missing_percentage':
+                    missing_pct = check_info.get('value', 0)
+                    if missing_pct > 50:
+                        score -= 40
+                    elif missing_pct > 20:
+                        score -= 20
+                    elif missing_pct > 5:
+                        score -= 10
+                elif check_name == 'duplicates':
+                    duplicate_pct = check_info.get('value', 0)
+                    if duplicate_pct > 20:
+                        score -= 30
+                    elif duplicate_pct > 10:
+                        score -= 15
+                    elif duplicate_pct > 5:
+                        score -= 5
+        # Penalties for issues
+        issues = validation_results.get('issues', {})
+        if issues.get('critical'):
+            score -= len(issues['critical']) * 20
+        if issues.get('warning'):
+            score -= len(issues['warning']) * 5
+        # Bonuses for good metrics
+        quality_metrics = validation_results.get('quality_metrics', {})
+        target_metrics = quality_metrics.get('target_quality', {})
+        if not target_metrics.get('is_constant', True):
+            score += 10
+        if target_metrics.get('missing_percentage', 100) < 1:
+            score += 5
+        # Limit score to 0-100 range
+        return max(0, min(100, score))
+    def _log_validation_results(self, validation_results: Dict) -> None:
+        """Log validation results"""
+        stage = validation_results['stage']
+        status = validation_results['status']
+        score = validation_results['overall_score']
+        logger.info(f"VALIDATION RESULTS ({stage}):")
+        logger.info(f"  Status: {status}")
+        logger.info(f"  Overall score: {score}/100")
+        logger.info(f"  Data shape: {validation_results['data_shape'][0]}x{validation_results['data_shape'][1]}")
+        # Basic checks
+        logger.info("\nBASIC CHECKS:")
+        for check_name, check_info in validation_results['basic_checks'].items():
+            status_icon = "✓" if check_info.get('passed', True) else "✗"
+            logger.info(f"  {status_icon} {check_name}: {check_info.get('value', 'N/A')}")
+        # Issues
+        issues = validation_results['issues']
+        if any(issues.values()):
+            logger.info("\nDETECTED ISSUES:")
+            for severity, issue_list in issues.items():
+                if issue_list:
+                    logger.info(f"  {severity.upper()}:")
+                    for issue in issue_list[:5]:  # Show only first 5 issues of each type
+                        logger.info(f"    - {issue}")
+                    if len(issue_list) > 5:
+                        logger.info(f"    ... and {len(issue_list) - 5} more issues")
+        else:
+            logger.info("\n✓ No issues detected")
+        # Recommendations
+        recommendations = validation_results['recommendations']
+        if recommendations:
+            logger.info("\nRECOMMENDATIONS:")
+            for i, rec in enumerate(recommendations, 1):
+                logger.info(f"  {i}. {rec}")
+        # Conclusion
+        if status == 'PASS':
+            logger.info("\n✓ Data passed validation and is ready for use")
+        elif status == 'WARNING':
+            logger.info("\n⚠ Data requires attention, there are issues to address")
+        else:
+            logger.info("\n✗ Data requires significant improvement before use")
+    def generate_report(self, stage: str = 'final') -> Dict:
+        """Generate detailed validation report"""
+        if stage not in self.validation_results:
+            return {}
+        report = self.validation_results[stage].copy()
+        # Add metadata
+        report['config'] = self.config.to_dict()
+        report['validator_version'] = '1.0'
+        report['generation_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+        # Add detailed metrics
+        quality_metrics = report.get('quality_metrics', {})
+        if 'numeric_statistics' in quality_metrics:
+            # Numeric features summary
+            numeric_stats = quality_metrics['numeric_statistics']
+            report['numeric_summary'] = {
+                'total_numeric_features': len(numeric_stats),
+                'features_with_high_skewness': sum(1 for s in numeric_stats.values() if abs(s.get('skewness', 0)) > 1),
+                'features_with_high_kurtosis': sum(1 for s in numeric_stats.values() if abs(s.get('kurtosis', 0)) > 3),
+                'features_with_many_zeros': sum(1 for s in numeric_stats.values() if s.get('zeros_percentage', 0) > 50)
+            }
+        return report
+    def save_report(self, stage: str = 'final', path: str = None) -> None:
+        """Save validation report to file"""
+        if stage not in self.validation_results:
+            logger.warning(f"Report for stage '{stage}' not found")
+            return
+        report = self.generate_report(stage)
+        if path is None:
+            path = f'{self.config.results_dir}/reports/validation_report_{stage}.json'
+        # Create directory if needed
+        Path(path).parent.mkdir(parents=True, exist_ok=True)
+        # Custom JSON encoder
+        class NumpyEncoder(json.JSONEncoder):
+            def default(self, obj):
+                if isinstance(obj, (np.integer, np.floating)):
+                    if np.isnan(obj):
+                        return None
+                    return float(obj)
+                elif isinstance(obj, np.bool_):
+                    return bool(obj)
+                elif isinstance(obj, np.ndarray):
+                    return obj.tolist()
+                elif isinstance(obj, pd.Timestamp):
+                    return obj.strftime('%Y-%m-%d %H:%M:%S')
+                return super().default(obj)
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(report, f, indent=4, ensure_ascii=False, cls=NumpyEncoder)
+        logger.info(f"✓ Validation report saved: {path}")

visualization/__init__.py ADDED Viewed

File without changes

visualization/visualization_manager.py ADDED Viewed

	@@ -0,0 +1,1462 @@

+# ============================================
+# CLASS 13: VISUALISATION MANAGER (UPDATED)
+# ============================================
+import os
+from datetime import datetime
+import json
+from typing import Dict, List, Optional, Tuple, Union, Any
+import pandas as pd
+import numpy as np
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+import matplotlib.pyplot as plt
+import seaborn as sns
+from scipy.stats import gaussian_kde
+import matplotlib
+matplotlib.use('Agg')  # Use non-display backend
+from config.config import Config
+import logging
+# Logging setup
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class VisualisationManager:
+    """Class for managing all visualisations"""
+    def __init__(self, config: Config):
+        """
+        Initialise visualisation manager
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.plots_generated = {}
+        self.plot_files = {}
+        self.figure_count = 0
+        # Create directory structure for saving plots
+        self._create_directory_structure()
+    def _create_directory_structure(self) -> None:
+        """Create directory structure for saving plots"""
+        base_dir = self.config.results_dir
+        # Main plot directories
+        self.plots_dir = os.path.join(base_dir, "plots")
+        self.correlations_dir = os.path.join(base_dir, "plots", "correlations")
+        self.distributions_dir = os.path.join(base_dir, "plots", "distributions")
+        self.features_dir = os.path.join(base_dir, "plots", "features")
+        self.time_series_dir = os.path.join(base_dir, "plots", "time_series")
+        self.preprocessing_dir = os.path.join(base_dir, "plots", "preprocessing")
+        self.summary_dir = os.path.join(base_dir, "plots", "summary")
+        self.reports_dir = os.path.join(base_dir, "reports")
+        # Create directories
+        directories = [
+            self.plots_dir,
+            self.correlations_dir,
+            self.distributions_dir,
+            self.features_dir,
+            self.time_series_dir,
+            self.preprocessing_dir,
+            self.summary_dir,
+            self.reports_dir
+        ]
+        for directory in directories:
+            os.makedirs(directory, exist_ok=True)
+            logger.debug(f"Created directory: {directory}")
+    def _save_figure(self, fig: plt.Figure, filename: str,
+                    subdirectory: str = None, dpi: int = 300) -> str:
+        """
+        Save plot and close it
+        Parameters:
+        -----------
+        fig : matplotlib.figure.Figure
+            Plot figure object
+        filename : str
+            Filename for saving
+        subdirectory : str, optional
+            Subdirectory for saving
+        dpi : int
+            Save quality
+        Returns:
+        --------
+        str : full path to saved file
+        """
+        if not filename.endswith('.png'):
+            filename = f"{filename}.png"
+        if subdirectory:
+            save_dir = os.path.join(self.plots_dir, subdirectory)
+            os.makedirs(save_dir, exist_ok=True)
+        else:
+            save_dir = self.plots_dir
+        filepath = os.path.join(save_dir, filename)
+        try:
+            fig.savefig(filepath, dpi=dpi, bbox_inches='tight', facecolor='white')
+            logger.info(f"✓ Plot saved: {filepath}")
+        except Exception as e:
+            logger.error(f"✗ Error saving plot {filename}: {e}")
+            filepath = None
+        # Close plot without display
+        plt.close(fig)
+        return filepath
+    # ============================================
+    # MAIN VISUALISATION METHODS
+    # ============================================
+    def create_summary_dashboard(
+        self,
+        data: pd.DataFrame,
+        preprocessing_stages: Dict = None,
+        filename: str = "summary_dashboard"
+    ) -> str:
+        """
+        Create summary visualisation dashboard
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Data for visualisation
+        preprocessing_stages : Dict, optional
+            Preprocessing stages information
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        logger.info("\n" + "="*80)
+        logger.info("CREATING SUMMARY DASHBOARD")
+        logger.info("="*80)
+        target_col = self.config.target_column
+        try:
+            # Create large dashboard
+            fig = plt.figure(figsize=(20, 24))
+            gs = fig.add_gridspec(6, 4, hspace=0.3, wspace=0.3)
+            # 1. Time series of target variable
+            ax1 = fig.add_subplot(gs[0, :2])
+            if target_col in data.columns and isinstance(data.index, pd.DatetimeIndex):
+                ax1.plot(data.index, data[target_col], linewidth=1, color='blue', alpha=0.7)
+                ax1.set_title(f'Time Series: {target_col}', fontsize=12, fontweight='bold')
+                ax1.set_xlabel('Date', fontsize=10)
+                ax1.set_ylabel(target_col, fontsize=10)
+                ax1.grid(True, alpha=0.3)
+                ax1.tick_params(axis='x', rotation=45)
+            else:
+                ax1.text(0.5, 0.5, 'No time series data available',
+                        ha='center', va='center', transform=ax1.transAxes)
+            # 2. Target variable distribution
+            ax2 = fig.add_subplot(gs[0, 2:])
+            if target_col in data.columns:
+                values = data[target_col].dropna()
+                if len(values) > 0:
+                    ax2.hist(values, bins=30, edgecolor='black', alpha=0.7, color='green')
+                    ax2.set_title(f'Distribution: {target_col}', fontsize=12, fontweight='bold')
+                    ax2.set_xlabel(target_col, fontsize=10)
+                    ax2.set_ylabel('Frequency', fontsize=10)
+                    ax2.grid(True, alpha=0.3)
+                else:
+                    ax2.text(0.5, 0.5, 'No data for distribution',
+                            ha='center', va='center', transform=ax2.transAxes)
+            # 3. Correlation matrix (top features)
+            ax3 = fig.add_subplot(gs[1, :])
+            numeric_cols = data.select_dtypes(include=[np.number]).columns
+            if len(numeric_cols) > 1:
+                display_cols = list(numeric_cols[:15])
+                if target_col not in display_cols and target_col in data.columns:
+                    display_cols = [target_col] + [c for c in display_cols if c != target_col][:14]
+                corr_matrix = data[display_cols].corr()
+                mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+                im = ax3.imshow(corr_matrix.where(~mask), cmap='coolwarm', vmin=-1, vmax=1, aspect='auto')
+                ax3.set_title('Correlation Matrix (Top 15 Features)',
+                             fontsize=12, fontweight='bold')
+                ax3.set_xticks(range(len(display_cols)))
+                ax3.set_yticks(range(len(display_cols)))
+                ax3.set_xticklabels(display_cols, rotation=90, fontsize=8)
+                ax3.set_yticklabels(display_cols, fontsize=8)
+                plt.colorbar(im, ax=ax3, shrink=0.8)
+            # 4. Seasonal patterns
+            ax4 = fig.add_subplot(gs[2, :2])
+            if target_col in data.columns and isinstance(data.index, pd.DatetimeIndex):
+                data_copy = data.copy()
+                data_copy['month'] = data_copy.index.month
+                monthly_avg = data_copy.groupby('month')[target_col].mean()
+                colors = plt.cm.Set3(np.linspace(0, 1, len(monthly_avg)))
+                ax4.bar(monthly_avg.index, monthly_avg.values, color=colors, edgecolor='black')
+                ax4.set_title('Average Values by Month', fontsize=12, fontweight='bold')
+                ax4.set_xlabel('Month', fontsize=10)
+                ax4.set_ylabel(f'Average {target_col}', fontsize=10)
+                ax4.set_xticks(range(1, 13))
+                month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
+                              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
+                ax4.set_xticklabels(month_names)
+                ax4.grid(True, alpha=0.3, axis='y')
+            # 5. Weekly patterns
+            ax5 = fig.add_subplot(gs[2, 2:])
+            if target_col in data.columns and isinstance(data.index, pd.DatetimeIndex):
+                data_copy = data.copy()
+                data_copy['dayofweek'] = data_copy.index.dayofweek
+                daily_avg = data_copy.groupby('dayofweek')[target_col].mean()
+                colors = plt.cm.Paired(np.linspace(0, 1, len(daily_avg)))
+                ax5.bar(daily_avg.index, daily_avg.values, color=colors, edgecolor='black')
+                ax5.set_title('Average Values by Day of Week', fontsize=12, fontweight='bold')
+                ax5.set_xlabel('Day of Week', fontsize=10)
+                ax5.set_ylabel(f'Average {target_col}', fontsize=10)
+                ax5.set_xticks(range(7))
+                ax5.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
+                ax5.grid(True, alpha=0.3, axis='y')
+            # 6. Trend and seasonality
+            ax6 = fig.add_subplot(gs[3, :])
+            if target_col in data.columns and len(data) > 30:
+                try:
+                    window_size = min(365, len(data) // 10)
+                    if window_size >= 7:
+                        rolling_mean = data[target_col].rolling(window=window_size, center=True).mean()
+                        rolling_std = data[target_col].rolling(window=window_size, center=True).std()
+                        ax6.plot(data.index, data[target_col], alpha=0.5,
+                                label='Original Series', linewidth=0.5, color='blue')
+                        ax6.plot(rolling_mean.index, rolling_mean,
+                                label=f'Rolling Mean ({window_size} days)',
+                                color='red', linewidth=2)
+                        ax6.fill_between(rolling_mean.index,
+                                        rolling_mean - rolling_std,
+                                        rolling_mean + rolling_std,
+                                        alpha=0.2, color='red')
+                        ax6.set_title('Trend and Volatility', fontsize=12, fontweight='bold')
+                        ax6.set_xlabel('Date', fontsize=10)
+                        ax6.set_ylabel(target_col, fontsize=10)
+                        ax6.legend(fontsize=9, loc='upper left')
+                        ax6.grid(True, alpha=0.3)
+                    else:
+                        ax6.text(0.5, 0.5, 'Insufficient data for trend analysis',
+                                ha='center', va='center', transform=ax6.transAxes)
+                except Exception as e:
+                    logger.warning(f"Error plotting trend: {e}")
+                    ax6.text(0.5, 0.5, 'Error plotting trend',
+                            ha='center', va='center', transform=ax6.transAxes)
+            # 7. Preprocessing statistics
+            if preprocessing_stages:
+                ax7 = fig.add_subplot(gs[4, :2])
+                stages = list(preprocessing_stages.keys())
+                values = list(preprocessing_stages.values())
+                colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(stages)))
+                bars = ax7.bar(range(len(stages)), values, color=colors, edgecolor='black')
+                ax7.set_title('Preprocessing Statistics', fontsize=12, fontweight='bold')
+                ax7.set_xlabel('Processing Stage', fontsize=10)
+                ax7.set_ylabel('Value', fontsize=10)
+                ax7.set_xticks(range(len(stages)))
+                ax7.set_xticklabels([s[:15] + '...' if len(s) > 15 else s for s in stages],
+                                   rotation=45, ha='right', fontsize=9)
+                ax7.grid(True, alpha=0.3, axis='y')
+                # Add values on bars
+                for bar, value in zip(bars, values):
+                    height = bar.get_height()
+                    ax7.text(bar.get_x() + bar.get_width()/2., height,
+                            f'{value:.2f}', ha='center', va='bottom', fontsize=8)
+            # 8. Data information
+            ax8 = fig.add_subplot(gs[4, 2:])
+            ax8.axis('off')
+            info_text = []
+            info_text.append("GENERAL CHARACTERISTICS:")
+            info_text.append(f"• Number of records: {len(data):,}")
+            info_text.append(f"• Number of features: {len(data.columns)}")
+            if isinstance(data.index, pd.DatetimeIndex):
+                info_text.append(f"• Period: {data.index.min().strftime('%Y-%m-%d')} - "
+                               f"{data.index.max().strftime('%Y-%m-%d')}")
+                info_text.append(f"• Days of data: {(data.index.max() - data.index.min()).days}")
+            if target_col in data.columns:
+                target_stats = data[target_col].describe()
+                info_text.append(f"\nTARGET VARIABLE '{target_col}':")
+                info_text.append(f"• Mean: {target_stats['mean']:.2f}")
+                info_text.append(f"• Standard deviation: {target_stats['std']:.2f}")
+                info_text.append(f"• Minimum: {target_stats['min']:.2f}")
+                info_text.append(f"• 25%: {target_stats['25%']:.2f}")
+                info_text.append(f"• 50% (median): {target_stats['50%']:.2f}")
+                info_text.append(f"• 75%: {target_stats['75%']:.2f}")
+                info_text.append(f"• Maximum: {target_stats['max']:.2f}")
+            info_text.append(f"\nDATA TYPES:")
+            for dtype, count in data.dtypes.value_counts().items():
+                info_text.append(f"• {dtype}: {count} columns")
+            missing_info = data.isnull().sum()
+            missing_total = missing_info.sum()
+            missing_percent = missing_total / data.size * 100
+            info_text.append(f"\nMISSING VALUES:")
+            info_text.append(f"• Total missing: {missing_total:,}")
+            info_text.append(f"• Missing percentage: {missing_percent:.2f}%")
+            if missing_total > 0:
+                top_missing = missing_info.nlargest(5)
+                info_text.append(f"• Top 5 columns with missing values:")
+                for col, count in top_missing.items():
+                    percent = count / len(data) * 100
+                    info_text.append(f"  {col}: {count} ({percent:.1f}%)")
+            ax8.text(0.02, 0.98, '\n'.join(info_text), transform=ax8.transAxes,
+                    fontsize=8, verticalalignment='top', fontfamily='monospace')
+            # 9. Autocorrelation plot
+            ax9 = fig.add_subplot(gs[5, :2])
+            if target_col in data.columns:
+                try:
+                    series = data[target_col].dropna()
+                    if len(series) > 50:
+                        plot_acf(series, lags=min(50, len(series)-1), ax=ax9, alpha=0.05)
+                        ax9.set_title('Autocorrelation Function (ACF)', fontsize=12, fontweight='bold')
+                        ax9.set_xlabel('Lag', fontsize=10)
+                        ax9.set_ylabel('Autocorrelation', fontsize=10)
+                        ax9.grid(True, alpha=0.3)
+                    else:
+                        ax9.text(0.5, 0.5, 'Insufficient data for ACF',
+                                ha='center', va='center', transform=ax9.transAxes)
+                except Exception as e:
+                    logger.warning(f"Error plotting ACF: {e}")
+                    ax9.text(0.5, 0.5, 'Error calculating ACF',
+                            ha='center', va='center', transform=ax9.transAxes)
+            # 10. Partial autocorrelation plot
+            ax10 = fig.add_subplot(gs[5, 2:])
+            if target_col in data.columns:
+                try:
+                    series = data[target_col].dropna()
+                    if len(series) > 50:
+                        plot_pacf(series, lags=min(50, len(series)-1), ax=ax10, alpha=0.05)
+                        ax10.set_title('Partial Autocorrelation Function (PACF)',
+                                      fontsize=12, fontweight='bold')
+                        ax10.set_xlabel('Lag', fontsize=10)
+                        ax10.set_ylabel('Partial Autocorrelation', fontsize=10)
+                        ax10.grid(True, alpha=0.3)
+                    else:
+                        ax10.text(0.5, 0.5, 'Insufficient data for PACF',
+                                 ha='center', va='center', transform=ax10.transAxes)
+                except Exception as e:
+                    logger.warning(f"Error plotting PACF: {e}")
+                    ax10.text(0.5, 0.5, 'Error calculating PACF',
+                             ha='center', va='center', transform=ax10.transAxes)
+            plt.suptitle('Data Analysis Summary Dashboard', fontsize=16, fontweight='bold', y=0.98)
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "summary")
+            self.plot_files['summary_dashboard'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating summary dashboard: {e}")
+            return None
+    # ============================================
+    # SPECIFIC METHODS FOR SAVING YOUR PLOTS
+    # ============================================
+    def save_data_split_plot(self, filename: str = "data_split.png") -> str:
+        """
+        Save data split plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['data_split'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving data_split plot: {e}")
+            return None
+    def save_feature_selection_correlation_plot(self, filename: str = "feature_selection_correlation.png") -> str:
+        """
+        Save feature selection correlation plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "correlations")
+            self.plot_files['feature_selection_correlation'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving feature_selection_correlation plot: {e}")
+            return None
+    def save_missing_values_analysis_plot(self, filename: str = "missing_values_analysis.png") -> str:
+        """
+        Save missing values analysis plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['missing_values_analysis'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving missing_values_analysis plot: {e}")
+            return None
+    def save_outlier_handling_results_plot(self, filename: str = "outlier_handling_results.png") -> str:
+        """
+        Save outlier handling results plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['outlier_handling_results'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving outlier_handling_results plot: {e}")
+            return None
+    def save_outliers_analysis_plot(self, filename: str = "outliers_analysis.png") -> str:
+        """
+        Save outliers analysis plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['outliers_analysis'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving outliers_analysis plot: {e}")
+            return None
+    def save_scaling_results_plot(self, filename: str = "scaling_results.png") -> str:
+        """
+        Save scaling results plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['scaling_results'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving scaling_results plot: {e}")
+            return None
+    def save_stationarity_analysis_plot(self, filename: str = "stationarity_analysis.png") -> str:
+        """
+        Save stationarity analysis plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['stationarity_analysis'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving stationarity_analysis plot: {e}")
+            return None
+    def save_temporal_outliers_plot(self, filename: str = "temporal_outliers.png") -> str:
+        """
+        Save temporal outliers plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['temporal_outliers'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving temporal_outliers plot: {e}")
+            return None
+    # ============================================
+    # UNIVERSAL METHOD FOR SAVING ANY PLOT
+    # ============================================
+    def save_current_plot(self, filename: str, subdirectory: str = None) -> str:
+        """
+        Universal method for saving current plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        subdirectory : str, optional
+            Subdirectory for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, subdirectory)
+            # Save plot information
+            plot_key = filename.replace('.png', '').replace('.jpg', '')
+            self.plot_files[plot_key] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving plot {filename}: {e}")
+            return None
+    # ============================================
+    # ADDITIONAL VISUALISATION METHODS
+    # ============================================
+    def create_feature_importance_plot(
+        self,
+        feature_importance: Dict,
+        top_n: int = 20,
+        filename: str = "feature_importance"
+    ) -> str:
+        """
+        Create feature importance plot
+        Parameters:
+        -----------
+        feature_importance : Dict
+            Dictionary with feature importance
+        top_n : int
+            Number of top features to display
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        if not feature_importance:
+            logger.warning("No feature importance data for visualisation")
+            return None
+        try:
+            # Convert to Series and sort
+            importance_series = pd.Series(feature_importance).sort_values(ascending=False)
+            top_features = importance_series.head(top_n)
+            # Create plot
+            fig, ax = plt.subplots(figsize=(12, 8))
+            y_pos = np.arange(len(top_features))
+            colors = plt.cm.plasma(np.linspace(0.2, 0.9, len(top_features)))
+            bars = ax.barh(y_pos, top_features.values, color=colors, edgecolor='black')
+            ax.set_yticks(y_pos)
+            ax.set_yticklabels(top_features.index, fontsize=10)
+            ax.invert_yaxis()
+            ax.set_xlabel('Feature Importance', fontsize=11, fontweight='bold')
+            ax.set_title(f'Top-{top_n} Most Important Features', fontsize=14, fontweight='bold')
+            ax.grid(True, alpha=0.3, axis='x')
+            # Add values on bars
+            for i, (bar, value) in enumerate(zip(bars, top_features.values)):
+                width = bar.get_width()
+                ax.text(width * 1.01, bar.get_y() + bar.get_height()/2,
+                       f'{value:.4f}', va='center', fontsize=9, fontweight='bold')
+            # Add additional information
+            plt.text(0.02, 0.98, f'Total features: {len(importance_series)}',
+                    transform=fig.transFigure, fontsize=9, verticalalignment='top')
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "features")
+            self.plot_files['feature_importance'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating feature importance plot: {e}")
+            return None
+    def create_correlation_heatmap(
+        self,
+        data: pd.DataFrame,
+        top_n: int = 20,
+        filename: str = "correlation_heatmap"
+    ) -> Tuple[str, Optional[str]]:
+        """
+        Create correlation heatmap
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Data for analysis
+        top_n : int
+            Number of top features to display
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        Tuple[str, Optional[str]]:
+            (path to main heatmap, path to target correlation heatmap)
+        """
+        target_col = self.config.target_column
+        try:
+            numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()
+            if len(numeric_cols) < 2:
+                logger.warning("Insufficient numeric features for correlation analysis")
+                return None, None
+            # Create two heatmaps
+            # 1. Main correlation heatmap between all features
+            main_filepath = self._create_main_correlation_heatmap(data, numeric_cols, top_n, filename)
+            # 2. Target correlation heatmap
+            target_filepath = None
+            if target_col in data.columns and target_col in numeric_cols:
+                target_filepath = self._create_target_correlation_heatmap(data, target_col, numeric_cols, filename)
+            return main_filepath, target_filepath
+        except Exception as e:
+            logger.error(f"Error creating correlation heatmap: {e}")
+            return None, None
+    def _create_main_correlation_heatmap(
+        self,
+        data: pd.DataFrame,
+        numeric_cols: List[str],
+        top_n: int,
+        filename: str
+    ) -> str:
+        """Create main correlation heatmap"""
+        # Limit number of features for better readability
+        if len(numeric_cols) > top_n:
+            # Select features with highest variance
+            variances = data[numeric_cols].var().sort_values(ascending=False)
+            selected_cols = variances.head(top_n).index.tolist()
+        else:
+            selected_cols = numeric_cols
+        # Calculate correlation
+        corr_matrix = data[selected_cols].corr()
+        fig, ax = plt.subplots(figsize=(14, 12))
+        # Mask for upper triangle
+        mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+        # Create heatmap
+        sns.heatmap(
+            corr_matrix,
+            annot=True,
+            fmt='.2f',
+            cmap='coolwarm',
+            center=0,
+            square=True,
+            mask=mask,
+            cbar_kws={'shrink': 0.8, 'label': 'Correlation Coefficient'},
+            linewidths=0.5,
+            linecolor='white',
+            ax=ax,
+            annot_kws={'size': 8}
+        )
+        ax.set_title(f'Correlation Matrix Between Features (Top-{top_n})',
+                    fontsize=14, fontweight='bold', pad=20)
+        plt.tight_layout()
+        # Save
+        filepath = self._save_figure(fig, filename, "correlations")
+        self.plot_files['correlation_heatmap_main'] = filepath
+        return filepath
+    def _create_target_correlation_heatmap(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        numeric_cols: List[str],
+        filename: str
+    ) -> str:
+        """Create target correlation heatmap"""
+        # Calculate correlations with target variable
+        correlations = data[numeric_cols].corrwith(data[target_col]).sort_values(key=abs, ascending=False)
+        # Exclude target variable itself
+        correlations = correlations[correlations.index != target_col]
+        # Take top 15 features
+        top_features = correlations.head(15)
+        fig, ax = plt.subplots(figsize=(10, 8))
+        colors = ['red' if x < 0 else 'green' for x in top_features.values]
+        bars = ax.barh(range(len(top_features)), top_features.values, color=colors, edgecolor='black')
+        ax.set_yticks(range(len(top_features)))
+        ax.set_yticklabels(top_features.index, fontsize=10)
+        ax.invert_yaxis()
+        ax.set_xlabel('Correlation Coefficient', fontsize=11, fontweight='bold')
+        ax.set_title(f'Feature Correlations with Target Variable "{target_col}"',
+                    fontsize=14, fontweight='bold', pad=20)
+        ax.grid(True, alpha=0.3, axis='x')
+        ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
+        # Add values on bars
+        for bar, value in zip(bars, top_features.values):
+            width = bar.get_width()
+            ax.text(width + (0.01 if width >= 0 else -0.04),
+                   bar.get_y() + bar.get_height()/2,
+                   f'{value:.3f}',
+                   va='center',
+                   ha='left' if width >= 0 else 'right',
+                   fontsize=9,
+                   fontweight='bold',
+                   color='black')
+        plt.tight_layout()
+        # Save
+        target_filename = f"{filename}_with_target"
+        filepath = self._save_figure(fig, target_filename, "correlations")
+        self.plot_files['correlation_with_target'] = filepath
+        return filepath
+    def create_distribution_comparison(
+        self,
+        original_data: pd.DataFrame,
+        processed_data: pd.DataFrame,
+        columns: List[str] = None,
+        max_columns: int = 12,
+        filename: str = "distribution_comparison"
+    ) -> str:
+        """
+        Compare distributions before and after processing
+        Parameters:
+        -----------
+        original_data : pd.DataFrame
+            Original data
+        processed_data : pd.DataFrame
+            Processed data
+        columns : List[str], optional
+            List of columns to compare
+        max_columns : int
+            Maximum number of columns to display
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        try:
+            if columns is None:
+                # Select numeric columns common to both datasets
+                numeric_cols_original = original_data.select_dtypes(include=[np.number]).columns
+                numeric_cols_processed = processed_data.select_dtypes(include=[np.number]).columns
+                common_cols = list(set(numeric_cols_original) & set(numeric_cols_processed))
+                # Sort by variance in original data
+                variances = original_data[common_cols].var().sort_values(ascending=False)
+                columns = variances.head(max_columns).index.tolist()
+            n_cols = min(4, len(columns))
+            n_rows = (len(columns) + n_cols - 1) // n_cols
+            fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3.5))
+            fig.suptitle('Distribution Comparison Before and After Processing',
+                        fontsize=16, fontweight='bold', y=0.98)
+            if n_rows == 1 and n_cols == 1:
+                axes = np.array([axes])
+            axes = axes.flat if hasattr(axes, 'flat') else [axes]
+            for idx, col in enumerate(columns):
+                if idx >= len(axes):
+                    break
+                ax = axes[idx]
+                if col in original_data.columns and col in processed_data.columns:
+                    original_values = original_data[col].dropna()
+                    processed_values = processed_data[col].dropna()
+                    if len(original_values) > 0 and len(processed_values) > 0:
+                        # Use common bins for comparison
+                        all_values = pd.concat([original_values, processed_values])
+                        bins = np.histogram_bin_edges(all_values, bins=30)
+                        # Histograms
+                        ax.hist(original_values, bins=bins, alpha=0.5,
+                               label='Before Processing', density=True, color='blue')
+                        ax.hist(processed_values, bins=bins, alpha=0.5,
+                               label='After Processing', density=True, color='orange')
+                        # Add KDE
+                        try:
+                            if len(original_values) > 10:
+                                kde_original = gaussian_kde(original_values)
+                                x_range = np.linspace(original_values.min(), original_values.max(), 100)
+                                ax.plot(x_range, kde_original(x_range), 'b-', linewidth=1.5, alpha=0.8)
+                            if len(processed_values) > 10:
+                                kde_processed = gaussian_kde(processed_values)
+                                x_range = np.linspace(processed_values.min(), processed_values.max(), 100)
+                                ax.plot(x_range, kde_processed(x_range), 'orange', linewidth=1.5, alpha=0.8)
+                        except:
+                            pass
+                        # Add statistics
+                        stats_text = []
+                        if len(original_values) > 0:
+                            stats_text.append(f"Before: μ={original_values.mean():.2f}, σ={original_values.std():.2f}")
+                        if len(processed_values) > 0:
+                            stats_text.append(f"After: μ={processed_values.mean():.2f}, σ={processed_values.std():.2f}")
+                        if stats_text:
+                            ax.text(0.02, 0.98, '\n'.join(stats_text),
+                                   transform=ax.transAxes, fontsize=8,
+                                   verticalalignment='top',
+                                   bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
+                        ax.set_title(f'{col}', fontsize=11, fontweight='bold')
+                        ax.set_xlabel('Value', fontsize=9)
+                        ax.set_ylabel('Density', fontsize=9)
+                        ax.legend(fontsize=8)
+                        ax.grid(True, alpha=0.3)
+                    else:
+                        ax.text(0.5, 0.5, 'No data',
+                               ha='center', va='center', transform=ax.transAxes)
+                else:
+                    ax.text(0.5, 0.5, 'Column not found',
+                           ha='center', va='center', transform=ax.transAxes)
+            # Hide unused subplots
+            for idx in range(len(columns), len(axes)):
+                axes[idx].set_visible(False)
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "distributions")
+            self.plot_files['distribution_comparison'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating distribution comparison: {e}")
+            return None
+    def create_time_series_decomposition_plot(
+        self,
+        decomposition_result: Dict,
+        filename: str = "time_series_decomposition"
+    ) -> str:
+        """
+        Visualise time series decomposition
+        Parameters:
+        -----------
+        decomposition_result : Dict
+            Decomposition results
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        target_col = self.config.target_column
+        try:
+            fig, axes = plt.subplots(4, 1, figsize=(14, 10))
+            fig.suptitle(f'Time Series Decomposition: {target_col}',
+                        fontsize=16, fontweight='bold', y=0.98)
+            # Original series
+            if 'observed' in decomposition_result:
+                observed = decomposition_result['observed']
+                axes[0].plot(observed, color='blue', linewidth=1.5)
+                axes[0].set_ylabel('Observed', fontsize=11, fontweight='bold')
+                axes[0].grid(True, alpha=0.3)
+                axes[0].set_title('Original Time Series', fontsize=12)
+            # Trend
+            if 'trend' in decomposition_result and decomposition_result['trend'] is not None:
+                trend = decomposition_result['trend']
+                axes[1].plot(trend, color='red', linewidth=2)
+                axes[1].set_ylabel('Trend', fontsize=11, fontweight='bold')
+                axes[1].grid(True, alpha=0.3)
+                axes[1].set_title('Trend Component', fontsize=12)
+            # Seasonality
+            if 'seasonal' in decomposition_result and decomposition_result['seasonal'] is not None:
+                seasonal = decomposition_result['seasonal']
+                axes[2].plot(seasonal, color='green', linewidth=1.5)
+                axes[2].set_ylabel('Seasonal', fontsize=11, fontweight='bold')
+                axes[2].grid(True, alpha=0.3)
+                axes[2].set_title('Seasonal Component', fontsize=12)
+            # Residuals
+            if 'residual' in decomposition_result and decomposition_result['residual'] is not None:
+                residual = decomposition_result['residual']
+                axes[3].plot(residual, color='purple', linewidth=1, alpha=0.7)
+                axes[3].set_ylabel('Residuals', fontsize=11, fontweight='bold')
+                axes[3].set_xlabel('Date', fontsize=11, fontweight='bold')
+                axes[3].grid(True, alpha=0.3)
+                axes[3].set_title('Residual Component', fontsize=12)
+                # Add residual statistics
+                if len(residual) > 0:
+                    stats_text = (f"Mean: {residual.mean():.4f}\n"
+                                 f"Std: {residual.std():.4f}\n"
+                                 f"Min: {residual.min():.4f}\n"
+                                 f"Max: {residual.max():.4f}")
+                    axes[3].text(0.02, 0.98, stats_text, transform=axes[3].transAxes,
+                                fontsize=8, verticalalignment='top',
+                                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['time_series_decomposition'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating time series decomposition: {e}")
+            return None
+    def create_data_quality_report(
+        self,
+        validation_results: Dict,
+        filename: str = "data_quality_report"
+    ) -> str:
+        """
+        Create visual data quality report
+        Parameters:
+        -----------
+        validation_results : Dict
+            Validation results
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        try:
+            fig = plt.figure(figsize=(16, 12))
+            fig.suptitle('Data Quality Report', fontsize=18, fontweight='bold', y=0.98)
+            # Use GridSpec for more complex layout
+            gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
+            # 1. Quality radar chart (top left)
+            ax1 = fig.add_subplot(gs[0, 0], projection='polar')
+            categories = ['Size', 'Missing', 'Duplicates', 'Stability', 'Informativeness']
+            # Extract values from validation results
+            if 'quality_metrics' in validation_results:
+                values = [
+                    validation_results['quality_metrics'].get('size_score', 0.5),
+                    validation_results['quality_metrics'].get('missing_score', 0.5),
+                    validation_results['quality_metrics'].get('duplicates_score', 0.5),
+                    validation_results['quality_metrics'].get('stability_score', 0.5),
+                    validation_results['quality_metrics'].get('informativeness_score', 0.5)
+                ]
+            else:
+                values = [0.8, 0.7, 0.9, 0.6, 0.8]
+            N = len(categories)
+            angles = [n / float(N) * 2 * np.pi for n in range(N)]
+            angles += angles[:1]
+            values += values[:1]
+            ax1.plot(angles, values, 'o-', linewidth=2, color='blue')
+            ax1.fill(angles, values, alpha=0.25, color='blue')
+            ax1.set_xticks(angles[:-1])
+            ax1.set_xticklabels(categories, fontsize=10)
+            ax1.set_ylim(0, 1)
+            ax1.set_title('Data Quality Radar Chart', fontsize=12, fontweight='bold')
+            ax1.grid(True)
+            # 2. Check status (top right)
+            ax2 = fig.add_subplot(gs[0, 1])
+            basic_checks = validation_results.get('basic_checks', {})
+            checks_passed = sum(1 for check in basic_checks.values() if check.get('passed', False))
+            checks_total = len(basic_checks)
+            checks_failed = checks_total - checks_passed
+            if checks_total > 0:
+                colors = ['#4CAF50' if checks_passed > 0 else '#FF6B6B',
+                         '#FF6B6B' if checks_failed > 0 else '#4CAF50']
+                bars = ax2.bar(['Passed', 'Failed'],
+                              [checks_passed, checks_failed],
+                              color=colors, edgecolor='black')
+                ax2.set_title(f'Basic Checks: {checks_passed}/{checks_total}',
+                            fontsize=12, fontweight='bold')
+                ax2.set_ylabel('Number of Checks', fontsize=10)
+                ax2.grid(True, alpha=0.3, axis='y')
+                # Add values on bars
+                for bar, value in zip(bars, [checks_passed, checks_failed]):
+                    height = bar.get_height()
+                    ax2.text(bar.get_x() + bar.get_width()/2., height,
+                            f'{value}', ha='center', va='bottom', fontsize=10, fontweight='bold')
+            else:
+                ax2.text(0.5, 0.5, 'No check data available',
+                        ha='center', va='center', transform=ax2.transAxes)
+                ax2.set_title('Basic Checks', fontsize=12, fontweight='bold')
+            # 3. Overall score (top right)
+            ax3 = fig.add_subplot(gs[0, 2])
+            overall_score = validation_results.get('overall_score', 0)
+            status = validation_results.get('status', 'UNKNOWN')
+            # Score pie chart
+            sizes = [overall_score, 100 - overall_score]
+            if overall_score >= 80:
+                colors = ['#4CAF50', '#E0E0E0']  # Green
+            elif overall_score >= 60:
+                colors = ['#FFC107', '#E0E0E0']  # Yellow
+            else:
+                colors = ['#F44336', '#E0E0E0']  # Red
+            wedges, texts, autotexts = ax3.pie(sizes, colors=colors, startangle=90,
+                                              autopct='%1.1f%%', pctdistance=0.85)
+            # Central text
+            status_colors = {'PASS': '#4CAF50', 'WARNING': '#FFC107', 'FAIL': '#F44336'}
+            status_color = status_colors.get(status, '#757575')
+            ax3.text(0, 0, f'{overall_score}/100\n{status}',
+                    ha='center', va='center', fontsize=14, fontweight='bold',
+                    color=status_color)
+            ax3.set_title('Overall Quality Score', fontsize=12, fontweight='bold')
+            # 4. Issue distribution by type (left middle)
+            ax4 = fig.add_subplot(gs[1, 0])
+            issues = validation_results.get('issues', {})
+            issue_counts = {
+                'Critical': len(issues.get('critical', [])),
+                'Warnings': len(issues.get('warning', [])),
+                'Informational': len(issues.get('info', []))
+            }
+            if any(issue_counts.values()):
+                colors = ['#F44336', '#FF9800', '#2196F3']
+                bars = ax4.bar(issue_counts.keys(), issue_counts.values(),
+                              color=colors, edgecolor='black')
+                ax4.set_title('Data Issues by Type', fontsize=12, fontweight='bold')
+                ax4.set_ylabel('Number of Issues', fontsize=10)
+                ax4.tick_params(axis='x', rotation=45)
+                ax4.grid(True, alpha=0.3, axis='y')
+                # Add values on bars
+                for bar, value in zip(bars, issue_counts.values()):
+                    height = bar.get_height()
+                    ax4.text(bar.get_x() + bar.get_width()/2., height,
+                            f'{value}', ha='center', va='bottom', fontsize=10, fontweight='bold')
+            else:
+                ax4.text(0.5, 0.5, 'No issues detected',
+                        ha='center', va='center', transform=ax4.transAxes, fontsize=12)
+                ax4.set_title('Data Issues', fontsize=12, fontweight='bold')
+            # 5. Detailed information (remaining cells)
+            ax5 = fig.add_subplot(gs[1:, 1:])
+            ax5.axis('off')
+            # Form text report
+            report_text = []
+            report_text.append("DETAILED REPORT:")
+            report_text.append("=" * 40)
+            # Basic information
+            report_text.append("\nBASIC INFORMATION:")
+            report_text.append(f"• Overall score: {overall_score}/100")
+            report_text.append(f"• Status: {status}")
+            report_text.append(f"• Checks passed: {checks_passed}/{checks_total}")
+            # Check details
+            if basic_checks:
+                report_text.append("\nCHECK DETAILS:")
+                for check_name, check_result in basic_checks.items():
+                    status_icon = "✓" if check_result.get('passed', False) else "✗"
+                    report_text.append(f"• {status_icon} {check_name}: {check_result.get('message', '')}")
+            # Issues
+            if any(issue_counts.values()):
+                report_text.append("\nDETECTED ISSUES:")
+                if issue_counts['Critical'] > 0:
+                    report_text.append("\nCRITICAL:")
+                    for issue in issues.get('critical', []):
+                        report_text.append(f"  • {issue}")
+                if issue_counts['Warnings'] > 0:
+                    report_text.append("\nWARNINGS:")
+                    for issue in issues.get('warning', []):
+                        report_text.append(f"  • {issue}")
+                if issue_counts['Informational'] > 0:
+                    report_text.append("\nINFORMATIONAL:")
+                    for issue in issues.get('info', []):
+                        report_text.append(f"  • {issue}")
+            # Recommendations
+            recommendations = validation_results.get('recommendations', [])
+            if recommendations:
+                report_text.append("\nRECOMMENDATIONS:")
+                for i, rec in enumerate(recommendations, 1):
+                    report_text.append(f"{i}. {rec}")
+            ax5.text(0.02, 0.98, '\n'.join(report_text), transform=ax5.transAxes,
+                    fontsize=9, verticalalignment='top', fontfamily='monospace',
+                    bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.1))
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "reports")
+            self.plot_files['data_quality_report'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating data quality report: {e}")
+            return None
+    # ============================================
+    # METHODS FOR BATCH SAVING
+    # ============================================
+    def save_all_preprocessing_plots(self) -> Dict[str, str]:
+        """
+        Save all preprocessing plots from current session
+        Returns:
+        --------
+        Dict[str, str] : dictionary with paths to saved plots
+        """
+        logger.info("Saving all preprocessing plots...")
+        plots_saved = {}
+        # Get all open figures
+        figure_numbers = plt.get_fignums()
+        if not figure_numbers:
+            logger.warning("No open plots to save")
+            return plots_saved
+        # Save each plot
+        for fig_num in figure_numbers:
+            fig = plt.figure(fig_num)
+            filename = f"preprocessing_plot_{fig_num}.png"
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            if filepath:
+                plots_saved[f"plot_{fig_num}"] = filepath
+        logger.info(f"Saved {len(plots_saved)} preprocessing plots")
+        return plots_saved
+    def create_all_visualizations(
+        self,
+        data: pd.DataFrame,
+        processed_data: pd.DataFrame = None,
+        feature_importance: Dict = None,
+        decomposition_result: Dict = None,
+        validation_results: Dict = None,
+        preprocessing_stages: Dict = None
+    ) -> Dict[str, str]:
+        """
+        Create all visualisations in one call
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Original data
+        processed_data : pd.DataFrame, optional
+            Processed data
+        feature_importance : Dict, optional
+            Feature importance
+        decomposition_result : Dict, optional
+            Decomposition results
+        validation_results : Dict, optional
+            Validation results
+        preprocessing_stages : Dict, optional
+            Preprocessing stages
+        Returns:
+        --------
+        Dict[str, str] : dictionary with paths to created plots
+        """
+        logger.info("\n" + "="*80)
+        logger.info("STARTING ALL VISUALISATIONS CREATION")
+        logger.info("="*80)
+        result_files = {}
+        # 1. Summary dashboard
+        if data is not None:
+            logger.info("Creating summary dashboard...")
+            summary_path = self.create_summary_dashboard(data, preprocessing_stages)
+            if summary_path:
+                result_files['summary'] = summary_path
+        # 2. Correlation heatmaps
+        if data is not None:
+            logger.info("Creating correlation heatmaps...")
+            main_corr, target_corr = self.create_correlation_heatmap(data)
+            if main_corr:
+                result_files['correlation_main'] = main_corr
+            if target_corr:
+                result_files['correlation_target'] = target_corr
+        # 3. Distribution comparison
+        if data is not None and processed_data is not None:
+            logger.info("Creating distribution comparison...")
+            dist_path = self.create_distribution_comparison(data, processed_data)
+            if dist_path:
+                result_files['distribution'] = dist_path
+        # 4. Feature importance
+        if feature_importance:
+            logger.info("Creating feature importance plot...")
+            feat_path = self.create_feature_importance_plot(feature_importance)
+            if feat_path:
+                result_files['feature_importance'] = feat_path
+        # 5. Time series decomposition
+        if decomposition_result:
+            logger.info("Creating time series decomposition...")
+            decomp_path = self.create_time_series_decomposition_plot(decomposition_result)
+            if decomp_path:
+                result_files['decomposition'] = decomp_path
+        # 6. Data quality report
+        if validation_results:
+            logger.info("Creating data quality report...")
+            quality_path = self.create_data_quality_report(validation_results)
+            if quality_path:
+                result_files['quality_report'] = quality_path
+        # Save information about all plots
+        self.save_plots_info()
+        logger.info("\n" + "="*80)
+        logger.info("VISUALISATIONS SUCCESSFULLY CREATED")
+        logger.info("="*80)
+        for plot_name, plot_path in result_files.items():
+            if plot_path:
+                logger.info(f"✓ {plot_name}: {plot_path}")
+        return result_files
+    def get_all_plots(self) -> Dict:
+        """Get information about all created plots"""
+        return self.plot_files
+    def save_plots_info(self, filename: str = "plots_info.json") -> None:
+        """Save plot information to JSON file"""
+        try:
+            plots_info = {
+                'total_plots': len(self.plot_files),
+                'plots': self.plot_files,
+                'directories': {
+                    'correlations': self.correlations_dir,
+                    'distributions': self.distributions_dir,
+                    'features': self.features_dir,
+                    'time_series': self.time_series_dir,
+                    'preprocessing': self.preprocessing_dir,
+                    'summary': self.summary_dir,
+                    'reports': self.reports_dir
+                },
+                'generation_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+                'config': {
+                    'target_column': self.config.target_column,
+                    'results_dir': self.config.results_dir
+                }
+            }
+            filepath = os.path.join(self.reports_dir, filename)
+            with open(filepath, 'w', encoding='utf-8') as f:
+                json.dump(plots_info, f, indent=4, ensure_ascii=False, default=str)
+            logger.info(f"✓ Plot information saved: {filepath}")
+        except Exception as e:
+            logger.error(f"✗ Error saving plot information: {e}")
+    def move_existing_plots(self, source_dir: str = None) -> Dict[str, str]:
+        """
+        Move existing plots from specified directory to structured folders
+        Parameters:
+        -----------
+        source_dir : str, optional
+            Directory with existing plots
+        Returns:
+        --------
+        Dict[str, str] : dictionary with information about moved files
+        """
+        if source_dir is None:
+            source_dir = self.plots_dir
+        if not os.path.exists(source_dir):
+            logger.warning(f"Source directory doesn't exist: {source_dir}")
+            return {}
+        # File to folder mapping
+        file_to_folder_map = {
+            # Time series
+            'data_split.png': 'time_series',
+            'stationarity_raskhodvoda.png': 'time_series',
+            'stationarity_analysis.png': 'time_series',
+            'temporal_outliers.png': 'time_series',
+            # Correlations
+            'feature_selection_correlation.png': 'correlations',
+            # Preprocessing
+            'missing_values_analysis.png': 'preprocessing',
+            'outlier_handling_results.png': 'preprocessing',
+            'outliers_analysis.png': 'preprocessing',
+            'scaling_results.png': 'preprocessing',
+            # Default
+            'default': 'summary'
+        }
+        moved_files = {}
+        for filename in os.listdir(source_dir):
+            if filename.endswith('.png'):
+                source_path = os.path.join(source_dir, filename)
+                # Determine destination folder
+                target_folder = file_to_folder_map.get(filename, file_to_folder_map['default'])
+                target_dir = os.path.join(self.plots_dir, target_folder)
+                # Create destination folder if doesn't exist
+                os.makedirs(target_dir, exist_ok=True)
+                # Target path
+                target_path = os.path.join(target_dir, filename)
+                try:
+                    # Move file
+                    os.rename(source_path, target_path)
+                    moved_files[filename] = target_path
+                    logger.info(f"Moved: {filename} -> {target_folder}/")
+                except Exception as e:
+                    logger.error(f"Error moving {filename}: {e}")
+        logger.info(f"Moved {len(moved_files)} files")
+        return moved_files