TimeFlowPro

Sleeping

App Files Files Community

ArabovMK commited on Jan 7

Commit

d8f69a9

1 Parent(s): d9e6371

Update all files

Browse files

Files changed (37) hide show

.gitignore +7 -0
Dockerfile +1 -1
README.md +185 -275
app.py +0 -0
config/__init__.py +0 -0
config/config.py +169 -0
config/default_config.json +78 -0
config/settings.py +375 -0
correlations/__init__.py +0 -0
correlations/correlation_analyzer.py +687 -0
data_loader/__init__.py +0 -0
data_loader/data_loader.py +487 -0
decomposition/__init__.py +0 -0
decomposition/decomposer.py +690 -0
feature_selection/__init__.py +0 -0
feature_selection/feature_selector.py +478 -0
features/__init__.py +0 -0
features/feature_engineer.py +638 -0
missing_values/__init__.py +0 -0
missing_values/missing_analyzer.py +700 -0
outliers/__init__.py +0 -0
outliers/outlier_analyzer.py +857 -0
pipeline/__init__.py +0 -0
pipeline/main_pipeline.py +603 -0
requirements.txt +100 -9
run_pipeline.py +62 -0
scaling/__init__.py +0 -0
scaling/data_scaler.py +634 -0
splitting/__init__.py +0 -0
splitting/data_splitter.py +403 -0
stationarity/__init__.py +0 -0
stationarity/stationarity_checker.py +631 -0
temp_data.csv +0 -0
validation/__init__.py +0 -0
validation/data_validator.py +655 -0
visualization/__init__.py +0 -0
visualization/visualization_manager.py +1462 -0

.gitignore CHANGED Viewed

@@ -1,2 +1,9 @@
 .venv/
 .venv

 .venv/
 .venv
+__pycache__/
+*__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+streamlit_results/

Dockerfile CHANGED Viewed

@@ -1,4 +1,4 @@
-FROM python:3.11-slim
 WORKDIR /app


1	+ FROM python:3.12-slim
2
3	WORKDIR /app
4

README.md CHANGED Viewed

@@ -1,358 +1,268 @@
 ---
-title: Pectin Production Predictor
-emoji: 🧪
-colorFrom: green
-colorTo: blue
 sdk: docker
 pinned: true
 app_file: app.py
 ---
-# 🧪 Pectin Production Predictor
 <div align="center">
-**Predict Pectin Production Parameters — Multi-Model Comparison and Analysis**
-*Machine learning models for predicting pectin yield, galacturonic acid content, molecular weight, and esterification degree*
-[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/spaces/arabovs-ai-lab/PectinProductionModels)
-[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
 [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
 </div>
 ## 🌟 Overview
-Pectin Production Predictor provides a comprehensive suite of machine learning models for predicting key parameters in pectin production processes. The application enables researchers and industry professionals to optimize extraction conditions and predict product quality metrics using trained regression models.
-**Supported algorithms:**
-- **Gradient Boosting** (Best overall performance)
-- **Random Forest**
-- **XGBoost**
-- **Linear Regression**
-All models are published to the Hugging Face Hub under [`arabovs-ai-lab/PectinProductionModels`](https://huggingface.co/arabovs-ai-lab/PectinProductionModels) for easy access and integration.
----
-## 🚀 Features
-### 🔮 Prediction Capabilities
-- **Single prediction**: Interactive parameter tuning for individual experiments
-- **Batch processing**: Upload Excel/CSV files for bulk predictions
-- **Multi-model comparison**: Compare predictions across different algorithms
-- **Real-time visualization**: Interactive charts and performance metrics
-### 📊 Analytical Tools
-- **Statistical comparison**: Side-by-side model performance analysis
-- **Quality metrics**: MAE, MSE, RMSE, R², MAPE, and correlation coefficients
-- **Visual analytics**: Distribution plots, comparison charts, and trend analysis
-- **Data validation**: Automatic file structure detection and data preprocessing
-### 🎯 Target Parameters
-- **Pectin Yield (%)** - Extraction efficiency
-- **Galacturonic Acid (%)** - Pectin purity indicator
-- **Molecular Weight (Da)** - Molecular characteristics
-- **Esterification Degree (%)** - Functional properties
----
-## 🧪 Input Parameters
-| Parameter | Description | Range | Unit |
-|-----------|-------------|-------|------|
-| **Sample Type** | Raw material type | 7 variants | - |
-| **Time** | Extraction duration | 0-300 | minutes |
-| **Temperature** | Extraction temperature | 0-200 | °C |
-| **Pressure** | Process pressure | 0-10 | atm |
-| **pH** | Acidity level | 0-14 | - |
-### 📋 Sample Type Reference
-| Sample Code | Description |
-|-------------|-------------|
-| Абр. | Абрикосовый (Apricot) |
-| Рв. | Ревень (Rhubarb) |
-| Айв. | Айвы (Quince) |
-| Ткв | Тыквенный (Pumpkin) |
-| КрП | Корзинка подсолнечника (Sunflower head) |
-| ЯП(Ф) | Яблочный пектин Файзобод (Apple pectin Fayzobod) |
-| ЯП(М) | Яблочный пектин Муминобод (Apple pectin Muminobod) |
----
-## 📊 Experimental Data Examples
-### Sample Input-Output Data
-| Exp | Sample | Time (min) | Temp (°C) | Pressure (atm) | pH | Pectin Yield (%) | Galacturonic Acid (%) | Molecular Weight (Da) | Esterification Degree (%) |
-|-----|--------|------------|-----------|----------------|----|------------------|---------------------|----------------------|-------------------------|
-| 1 | ЯП(М) | 7 | 120 | 2.08 | 2.0 | 25.864 | 52.706 | 103,773.64 | 71.17 |
-| 2 | ЯП(М) | 7 | 120 | 1.74 | 2.08 | 24.830 | 51.645 | 103,098.49 | 70.015 |
-| 3 | Абр. | 5 | 130 | 2.09 | 1.74 | 14.755 | 67.550 | 127,235.35 | 82.813 |
-| 4 | ЯП(М) | 7 | 120 | 2.05 | 2.0 | 26.353 | 53.804 | 105,994.85 | 65.415 |
-| 5 | КрП | 60 | 85 | 1.03 | 2.0 | 19.505 | 66.606 | 145,498.37 | 67.756 |
----
-## 📈 Model Performance
-### Comprehensive Model Comparison (Test Set Averages)
-| Model | Avg Test R² | Avg Test MAE | Avg Test RMSE |
-|-------|------------:|-------------:|--------------:|
-| **Gradient Boosting** | **0.9427** | **868.440** | **1074.277** |
-| Random Forest | 0.9259 | 978.007 | 1214.291 |
-| XGBoost | 0.9203 | 1074.231 | 1327.170 |
-| Extra Trees | 0.9135 | 1060.174 | 1314.689 |
-| K-Neighbors | 0.8684 | 1287.513 | 2230.119 |
-| MultiLayer Perceptron | 0.8046 | 4253.843 | 5488.065 |
-| Linear Regression | 0.6965 | 3730.755 | 4818.582 |
-| Ridge Regression | 0.5553 | 3665.310 | 4850.510 |
-| SVR | 0.4832 | 6612.236 | 7939.850 |
-| Lasso Regression | 0.3846 | 3702.033 | 4828.528 |
-### Production-Ready Models Selection
-| Model | Best For | Training Time | Robustness | Recommendation |
-|-------|----------|---------------|------------|----------------|
-| **Gradient Boosting** | Overall accuracy | Medium | High | ⭐⭐⭐⭐⭐ **Primary Choice** |
-| **Random Forest** | Stability & Speed | Fast | Very High | ⭐⭐⭐⭐ **Secondary Choice** |
-| **XGBoost** | Large datasets | Fast | High | ⭐⭐⭐⭐ **Alternative** |
-| Linear Regression | Baseline | Very Fast | Medium | ⭐⭐ **Reference** |
-> **Note:** Metrics represent averages across all four target variables. Gradient Boosting demonstrates superior performance across all evaluation criteria.
----
-## 🛠️ Quick Start
-### Installation with Version Control
-```bash
-# Install with specific Streamlit version to avoid compatibility issues
-pip install streamlit==1.28.0 pandas numpy plotly scikit-learn joblib huggingface-hub xgboost
-```
-### Run the Application
 ```bash
-streamlit run app_pectin.py
-```
-### Troubleshooting Version Issues
-If you encounter Streamlit version warnings, ensure you're using the tested version:
-```bash
-pip install --force-reinstall streamlit==1.28.0
-```
-### Programmatic Usage
-```python
-from huggingface_hub import hf_hub_download
-import joblib
-# Download model artifacts
-model_path = hf_hub_download(
-    repo_id="arabovs-ai-lab/PectinProductionModels",
-    filename="gradient_boosting_model.pkl",  # Best performing model
-    repo_type="model"
-)
-# Load model
-model = joblib.load(model_path)
-# Example prediction
-input_data = {
-    'sample': 'ЯП(М)',
-    'time_min': 120,
-    'temperature_c': 90,
-    'pressure_atm': 1.5,
-    'ph': 2.0
-}
-prediction = model.predict(input_data)
 ```
----
-## 📦 Model Architecture
-### Feature Engineering
-- **Sample encoding**: Intelligent categorical encoding of 7 pectin source types
-- **Method detection**: Automatic extraction method classification
-- **Feature scaling**: Standardized input parameters (Time, Temperature, Pressure, pH)
-- **Multi-output regression**: Simultaneous prediction of 4 target variables
-- **Cross-validation**: Robust training with 5-fold cross-validation
-### Model Structure
-```
-Input Features → Preprocessing → Multi-output Regression → Target Predictions
-     ↓              ↓                      ↓                    ↓
-  5 parameters   Encoding &           Ensemble Models     4 quality metrics
-                 Scaling              (GB, RF, XGB, LR)
 ```
----
-## 🔧 Usage Examples
-### Single Prediction
 ```python
-# Optimal extraction parameters example
-optimal_input = {
-    'sample': 'ЯП(М)',
-    'time_min': 7,
-    'temperature_c': 120,
-    'pressure_atm': 2.08,
-    'ph': 2.0
 }
-# Expected output based on experimental data:
-# Pectin Yield: ~25.8%, Galacturonic Acid: ~52.7%
-# Molecular Weight: ~103,774 Da, Esterification: ~71.2%
-```
-### Batch Processing Template
-```csv
-sample,time_min,temperature_c,pressure_atm,ph
-ЯП(М),7,120,2.08,2.0
-Абр.,5,130,2.09,1.74
-КрП,60,85,1.03,2.0
 ```
----
-## 📊 Performance Analysis
-### Key Findings from Model Evaluation
-1. **Gradient Boosting** achieves the highest R² score (0.9427) indicating excellent explanatory power
-2. **Tree-based models** (GB, RF, XGBoost) significantly outperform linear models
-3. **Random Forest** provides the best balance of performance and training speed
-4. **Ensemble methods** demonstrate superior robustness to experimental variability
-### Metric Interpretation
-- **R² (0.9427)**: Models explain 94.27% of variance in target variables
-- **MAE (868.44)**: Average prediction error across all targets
-- **RMSE (1074.28)**: Standard deviation of prediction residuals
----
-## 🎯 Application Interface
-### Main Tabs
-1. **🎯 Single Prediction**: Interactive parameter tuning with real-time predictions
-2. **📁 Batch Processing**: File upload and bulk analysis with validation
-3. **📊 Model Comparison**: Side-by-side performance evaluation
-4. **🔄 Multi-Model Processing**: Apply multiple algorithms simultaneously
-### Input Validation
-- **Range checking**: Automatic validation of physical parameter limits
-- **Sample verification**: Validation against supported sample types
-- **Data integrity**: Comprehensive file structure validation for batch processing
----
-## 📁 Data Compatibility
-### Supported Formats
-- **Excel** (.xlsx, .xls): Automatic structure detection and sheet selection
-- **CSV/TXT**: Multiple encoding support (UTF-8, Windows-1251) and delimiter auto-detection
-- **Column mapping**: Automatic Russian ↔ English column name translation
-### Required Input Format
-```csv
-sample,time_min,temperature_c,pressure_atm,ph
-ЯП(М),7,120,2.08,2.0
-Абр.,5,130,2.09,1.74
 ```
-### Optional Ground Truth Columns
-```csv
-sample,time_min,temperature_c,pressure_atm,ph,pectin_yield,galacturonic_acid,molecular_weight,esterification_degree
-ЯП(М),7,120,2.08,2.0,25.864,52.706,103773.64,71.17
 ```
----
-## 🔬 Scientific Background
-### Pectin Production Optimization
-Pectin is a complex polysaccharide found in plant cell walls, with critical applications as:
-- **Gelling agent** in jams, jellies, and confectionery
-- **Stabilizer** in dairy products and beverages
-- **Pharmaceutical excipient** in drug delivery systems
-- **Functional ingredient** in cosmetic formulations
-### Extraction Parameter Effects
-- **Time & Temperature**: Directly impact yield and molecular weight degradation
-- **pH**: Critical for galacturonic acid content and esterification degree
-- **Pressure**: Influences extraction efficiency and pectin quality
-- **Raw Material**: Determines inherent pectin characteristics and optimal conditions
----
 ## 🤝 Contributing
-We welcome contributions in the following areas:
-- **New algorithms**: Additional machine learning models (Neural Networks, etc.)
-- **Feature engineering**: Advanced preprocessing and feature selection techniques
-- **Visualization**: Enhanced analytical dashboards and interactive plots
-- **Documentation**: Additional usage examples and case studies
-### Development Setup
 ```bash
-git clone https://huggingface.co/spaces/arabovs-ai-lab/PectinProductionModels
-cd PectinProductionModels
-pip install -r requirements.txt
-streamlit run app_pectin.py
-```
-### Version Management
-To ensure compatibility, maintain the specified package versions:
-```txt
-streamlit==1.28.0
-scikit-learn>=1.0.0
-xgboost>=1.5.0
 ```
----
-## 📜 Citation
-If you use this tool in your research, please cite:
-```bibtex
-@misc{pectin_predictor_2024,
-  title  = {Pectin Production Predictor: Machine Learning Models for Pectin Quality Prediction},
-  author = {Arabovs AI Lab},
-  year   = {2024},
-  publisher = {Hugging Face},
-  url    = {https://huggingface.co/arabovs-ai-lab/PectinProductionModels}
-}
-```
----
-## 📄 License
-MIT License - See [LICENSE](LICENSE) file for details.
----
-## 🏛️ Institutional Support
-This project is maintained by **Arabovs AI Lab** as part of our commitment to advancing applied machine learning in industrial and biotechnological applications.
 ---
 <div align="center">
-**Advancing Biotechnology with Machine Learning**
-Brought to you by **Arabovs AI Lab**
-[![Repository](https://img.shields.io/badge/🔗-Model%20Repository-171717)](https://huggingface.co/arabovs-ai-lab/PectinProductionModels)
-[![Live Demo](https://img.shields.io/badge/🚀-Live%20Demo-FF4B4B)](https://huggingface.co/spaces/arabovs-ai-lab/PectinProductionModels)
-</div>
-*Last updated: November 2025*

 ---
+title: TimeFlow Pro
+emoji: 📊
+colorFrom: blue
+colorTo: indigo
 sdk: docker
 pinned: true
 app_file: app.py
+sdk_version: 1.52.2
 ---
+# 📊 TimeFlow Pro
 <div align="center">
+**Intelligent Time Series Data Analysis and Preprocessing Platform**
+*Advanced pipeline for data preparation and feature engineering*
+[![Hugging Face](https://img.shields.io/badge/🤗-Hugging%20Face%20Space-blue)](https://huggingface.co/spaces/your-username/timeflow-pro)
 [![Streamlit](https://img.shields.io/badge/Interface-Streamlit-FF4B4B)](https://streamlit.io)
+[![Python](https://img.shields.io/badge/Python-3.9+-blue)](https://python.org)
 </div>
 ## 🌟 Overview
+TimeFlow Pro is a comprehensive platform for time series data analysis, preprocessing, and feature engineering. Designed for data scientists and analysts, it provides an intuitive interface for transforming raw time series data into ML-ready datasets with advanced preprocessing capabilities.
+## 🚀 Key Features
+### 📈 **Data Analysis & Visualization**
+- **Interactive Data Exploration**: Real-time preview and statistics
+- **Missing Value Analysis**: Smart detection and handling strategies
+- **Outlier Detection**: Multiple methods including IQR, Z-Score, Isolation Forest
+- **Temporal Analysis**: Seasonality detection, trend analysis, decomposition
+### ⚙️ **Advanced Preprocessing Pipeline**
+- **Feature Engineering**: Automatic lag features, rolling statistics, seasonal components
+- **Stationarity Checking**: ADF tests and transformation suggestions
+- **Data Scaling**: Robust, Standard, MinMax, and custom scaling methods
+- **Feature Selection**: Correlation, variance, mutual information, RF importance
+### 🏗️ **ML-Ready Outputs**
+- **Train/Validation/Test Splits**: Time-based or random splitting
+- **Multiple Export Formats**: CSV, Parquet, Excel, JSON
+- **Model Integration**: Ready-to-use datasets for scikit-learn, XGBoost, LightGBM
+- **Visual Reports**: Comprehensive pipeline execution reports
+## 🎮 Quick Start
+### 1. **Upload Your Data**
+- Support for CSV, Excel, Parquet formats
+- Automatic date parsing and validation
+- Smart column type detection
+### 2. **Configure Pipeline**
+```python
+# Example configuration
+config = {
+    'target_column': 'sales',
+    'test_size': 0.2,
+    'max_lags': 5,
+    'seasonal_period': 365,
+    'scaling_method': 'robust'
+}
+```
+### 3. **Run Pipeline & Export**
+- Execute full preprocessing pipeline
+- Download processed data
+- Get feature importance reports
+- Export modeling datasets
+## 📊 Technical Architecture
+### 🔧 **Pipeline Components**
+```
+Data Loading → Validation → Missing Handling → Outlier Treatment
+     ↓
+Feature Engineering → Stationarity Check → Correlation Analysis
+     ↓
+Data Splitting → Scaling → Feature Selection → Final Validation
+```
+### 🏆 **Core Features**
+- **Multi-stage Validation**: Raw, processed, and final data validation
+- **Memory Optimization**: Efficient handling of large datasets
+- **Error Recovery**: Graceful handling of pipeline failures
+- **Reproducible Results**: Configuration saving and logging
+## 📚 Use Cases
+### 🏢 **Business Analytics**
+- Sales forecasting and trend analysis
+- Inventory optimization
+- Customer behavior prediction
+- Financial time series analysis
+### 🏭 **Industrial Applications**
+- Sensor data preprocessing
+- Predictive maintenance
+- Quality control monitoring
+- Energy consumption forecasting
+### 🎓 **Academic Research**
+- Time series modeling experiments
+- Feature engineering research
+- Algorithm comparison studies
+- Educational tool for data science
+## 🛠️ Installation
+### Local Development
 ```bash
+# Clone repository
+git clone https://huggingface.co/spaces/your-username/timeflow-pro
+cd timeflow-pro
+# Install dependencies
+pip install -r requirements.txt
+# Run application
+streamlit run app.py
 ```
+### Docker Deployment
+```bash
+# Build Docker image
+docker build -t timeflow-pro .
+# Run container
+docker run -p 8501:8501 timeflow-pro
 ```
+## 🌐 API Usage Example
 ```python
+from timeflow_pro import TimeFlowPipeline
+import pandas as pd
+# Load your data
+data = pd.read_csv('your_data.csv')
+# Configure pipeline
+config = {
+    'target_column': 'target',
+    'test_size': 0.2,
+    'max_lags': 7,
+    'seasonal_period': 30
 }
+# Create and run pipeline
+pipeline = TimeFlowPipeline(config)
+processed_data = pipeline.run(data)
+# Get modeling data
+modeling_data = pipeline.get_modeling_data()
+X_train, y_train = modeling_data['X_train'], modeling_data['y_train']
 ```
+## 📈 Performance Benchmarks
+| Dataset Size | Processing Time | Memory Usage | Features Generated |
+|--------------|----------------|--------------|-------------------|
+| 10K rows | ~5 seconds | <500 MB | 50-100 features |
+| 100K rows | ~30 seconds | <1 GB | 100-200 features |
+| 1M rows | ~5 minutes | <2 GB | 200-500 features |
+## 🔧 Configuration Options
+### **Data Processing**
+- `missing_threshold`: Threshold for column removal (0.0-0.5)
+- `outlier_method`: IQR, Z-Score, or Isolation Forest
+- `scaling_method`: Robust, Standard, MinMax, or None
+### **Feature Engineering**
+- `max_lags`: Maximum lag features (1-20)
+- `seasonal_period`: Seasonal window (7, 30, 90, 365)
+- `rolling_windows`: List of rolling windows [7, 30, 90]
+### **Model Preparation**
+- `feature_selection_method`: Correlation, Variance, RF, Mutual Info
+- `max_features`: Maximum features to select (5-100)
+- `split_method`: Time-based or random splitting
+## 📋 Requirements
+### **Core Dependencies**
+```txt
+streamlit>=1.28.0
+pandas>=2.0.0
+numpy>=1.24.0
+plotly>=5.17.0
+scikit-learn>=1.3.0
 ```
+### **Optional Dependencies**
+```txt
+xgboost>=2.0.0      # For XGBoost feature importance
+lightgbm>=4.0.0     # For LightGBM integration
+statsmodels>=0.14.0 # For advanced time series analysis
 ```
 ## 🤝 Contributing
+We welcome contributions! Here's how you can help:
+### **Areas for Contribution**
+1. **New Feature Engineering Methods**
+2. **Additional Visualization Types**
+3. **Export Format Support**
+4. **Performance Optimizations**
+5. **Documentation Improvements**
+### **Development Workflow**
 ```bash
+# 1. Fork the repository
+# 2. Create feature branch
+git checkout -b feature/new-feature
+# 3. Make changes and test
+# 4. Submit pull request
 ```
+## 📜 License
+This project is licensed under the **MIT License** - see the [LICENSE](LICENSE) file for details.
+## 🙏 Acknowledgments
+### **Special Thanks To:**
+- **Streamlit Team** for the amazing framework
+- **Hugging Face** for hosting the Space
+- **Open Source Community** for invaluable libraries
+- **All Contributors** who helped improve TimeFlow Pro
+### **Built With:**
+- 🐍 Python
+- 📊 Streamlit
+- 🎨 Plotly
+- 🔧 Scikit-learn
+- 📈 Pandas & NumPy
+## 📞 Support & Contact
+### **Get Help:**
+- 📧 **Email**: cool.araby@gmail.com
+- 💬 **Issues**: [GitHub Issues](https://github.com/your-username/timeflow-pro/issues)
+- 💡 **Discussions**: [Community Forum](https://github.com/your-username/timeflow-pro/discussions)
+### **Stay Updated:**
+- ⭐ **Star** the repository
+- 👁️ **Watch** for releases
+- 🔔 **Enable notifications**
 ---
 <div align="center">
+**Transform Your Time Series Data with Ease**
+*TimeFlow Pro - Making Data Preparation Simple and Powerful*
+[![Follow on Hugging Face](https://img.shields.io/badge/Follow%20on-🤗%20Hugging%20Face-yellow)](https://huggingface.co/your-username)
+[![GitHub Stars](https://img.shields.io/github/stars/your-username/timeflow-pro?style=social)](https://github.com/your-username/timeflow-pro)
+</div>

app.py CHANGED Viewed

The diff for this file is too large to render. See raw diff

config/__init__.py ADDED Viewed

File without changes

config/config.py ADDED Viewed

	@@ -0,0 +1,169 @@

+# ============================================
+# ENUMERATION CLASSES
+# ============================================
+from dataclasses import asdict, dataclass, field
+from enum import Enum
+import json
+import logging
+from pathlib import Path
+from typing import Dict, List, Optional
+from venv import logger
+class DataType(Enum):
+    """Data types"""
+    NUMERIC = "numeric"
+    CATEGORICAL = "categorical"
+    TEMPORAL = "temporal"
+    TEXT = "text"
+class PreprocessingMethod(Enum):
+    """Data preprocessing methods"""
+    FILL_MEAN = "fill_mean"
+    FILL_MEDIAN = "fill_median"
+    FILL_INTERPOLATE = "fill_interpolate"
+    FILL_KNN = "fill_knn"
+    REMOVE = "remove"
+    CLIP = "clip"
+    WINSORIZE = "winsorize"
+    NORMALIZE = "normalize"
+    STANDARDIZE = "standardize"
+    LOG_TRANSFORM = "log_transform"
+    BOX_COX = "box_cox"
+    DIFFERENCING = "differencing"
+class SeasonalityType(Enum):
+    """Seasonality types"""
+    DAILY = "daily"
+    WEEKLY = "weekly"
+    MONTHLY = "monthly"
+    QUARTERLY = "quarterly"
+    YEARLY = "yearly"
+    MULTIPLE = "multiple"
+# ============================================
+# CLASS 1: CONFIGURATION
+# ============================================
+@dataclass
+class Config:
+    """Experiment configuration for data preprocessing"""
+    # Paths and directories
+    data_path: str = 'temp_data.csv'
+    results_dir: str = 'data_preprocessing_results'
+    # Temporal parameters
+    start_year: int = 1970
+    end_year: int = 1990
+    freq: str = 'D'  # Data frequency: D (daily), H (hourly), M (monthly)
+    # Target variable
+    target_column: str = 'raskhodvoda'
+    # Feature parameters
+    max_lags: int = 12
+    seasonal_period: int = 365
+    rolling_windows: List[int] = field(default_factory=lambda: [7, 30, 90, 365])
+    expanding_windows: List[int] = field(default_factory=lambda: [30, 90, 365])
+    # Processing parameters
+    missing_threshold: float = 0.3  # Threshold for dropping columns with missing values
+    outlier_method: str = 'iqr'  # Outlier detection method: iqr, zscore, lof
+    outlier_alpha: float = 1.5  # IQR multiplier
+    outlier_contamination: float = 0.1  # For methods like LOF
+    # Data splitting
+    test_size: float = 0.2
+    validation_size: float = 0.1
+    split_method: str = 'time'  # time, random, expanding_window
+    # Scaling
+    scaling_method: str = 'robust'  # standard, minmax, robust, none
+    # Feature selection
+    feature_selection_method: str = 'correlation'  # correlation, mutual_info, rf, pca
+    max_features: int = 50
+    # Validation
+    enable_validation: bool = True
+    validation_rules: Dict = field(default_factory=dict)
+    # Visualisation
+    save_plots: bool = True
+    plot_style: str = 'seaborn'
+    # Performance
+    use_multiprocessing: bool = False
+    n_jobs: int = -1
+    chunk_size: int = 10000
+    # Logging
+    log_level: str = 'INFO'
+    save_reports: bool = True
+    def __post_init__(self):
+        """Post-initialisation for creating directories and setting up logging"""
+        self.create_directories()
+        self.setup_logging()
+        # Setting default validation rules
+        if not self.validation_rules:
+            self.validation_rules = {
+                'min_rows': 100,
+                'max_missing_percentage': 30,
+                'min_unique_values': 2,
+                'max_skewness': 3,
+                'max_kurtosis': 10
+            }
+    def create_directories(self) -> None:
+        """Create directories for preprocessing results"""
+        dirs = [
+            self.results_dir,
+            f'{self.results_dir}/plots',
+            f'{self.results_dir}/plots/time_series',
+            f'{self.results_dir}/plots/distributions',
+            f'{self.results_dir}/plots/correlations',
+            f'{self.results_dir}/plots/features',
+            f'{self.results_dir}/tables',
+            f'{self.results_dir}/processed_data',
+            f'{self.results_dir}/models',
+            f'{self.results_dir}/reports',
+            f'{self.results_dir}/logs',
+            f'{self.results_dir}/checkpoints'
+        ]
+        for directory in dirs:
+            Path(directory).mkdir(parents=True, exist_ok=True)
+        logger.info(f"Directories created in {self.results_dir}")
+    def setup_logging(self) -> None:
+        """Configure logging"""
+        log_level = getattr(logging, self.log_level.upper())
+        logger.setLevel(log_level)
+    def to_dict(self) -> Dict:
+        """Convert configuration to dictionary"""
+        return asdict(self)
+    def save(self, path: Optional[str] = None) -> None:
+        """Save configuration to file"""
+        if path is None:
+            path = f'{self.results_dir}/config.json'
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(self.to_dict(), f, indent=4, ensure_ascii=False)
+        logger.info(f"Configuration saved to {path}")
+    @classmethod
+    def load(cls, path: str) -> 'Config':
+        """Load configuration from file"""
+        with open(path, 'r', encoding='utf-8') as f:
+            config_dict = json.load(f)
+        return cls(**config_dict)

config/default_config.json ADDED Viewed

	@@ -0,0 +1,78 @@

+{
+  "data_path": "temp_data.csv",
+  "results_dir": "results",
+  "start_year": 1970,
+  "end_year": 1990,
+  "freq": "D",
+  "target_column": "raskhodvoda",
+  "max_lags": 12,
+  "seasonal_period": 365,
+  "rolling_windows": [7, 30, 90, 365],
+  "expanding_windows": [30, 90, 365],
+  "missing_threshold": 0.3,
+  "outlier_method": "iqr",
+  "outlier_alpha": 1.5,
+  "outlier_contamination": 0.1,
+  "test_size": 0.2,
+  "validation_size": 0.1,
+  "split_method": "time",
+  "scaling_method": "robust",
+  "feature_selection_method": "correlation",
+  "max_features": 50,
+  "enable_validation": true,
+  "validation_rules": {
+    "min_rows": 100,
+    "max_missing_percentage": 30,
+    "min_unique_values": 2,
+    "max_skewness": 3,
+    "max_kurtosis": 10,
+    "min_variance": 0.001,
+    "max_constant_columns": 0
+  },
+  "save_plots": true,
+  "plot_style": "seaborn-whitegrid",
+  "plot_dpi": 300,
+  "plot_format": "png",
+  "use_multiprocessing": false,
+  "n_jobs": -1,
+  "chunk_size": 10000,
+  "memory_limit_gb": 4,
+  "log_level": "INFO",
+  "save_reports": true,
+  "report_format": "json",
+  "decomposition_method": "stl",
+  "stationarity_tests": ["adf", "kpss"],
+  "correlation_threshold": 0.85,
+  "vif_threshold": 10,
+  "random_seed": 42,
+  "enable_profiling": false,
+  "save_intermediate": true,
+  "streamlit_settings": {
+    "theme": "light",
+    "sidebar_state": "expanded",
+    "page_title": "Time Series Preprocessing",
+    "page_icon": "📊",
+    "layout": "wide"
+  },
+  "export_options": {
+    "csv": true,
+    "parquet": false,
+    "excel": false,
+    "pickle": true
+  }
+}

config/settings.py ADDED Viewed

	@@ -0,0 +1,375 @@

+"""
+General project settings: visualisation, paths, constants
+"""
+import warnings
+import matplotlib.pyplot as plt
+import seaborn as sns
+from pathlib import Path
+from typing import Dict, Any, Optional
+import yaml
+import json
+import os
+# ============================================================================
+# PATHS AND DIRECTORIES
+# ============================================================================
+PROJECT_ROOT = Path(__file__).parent.parent.parent
+DATA_DIR = PROJECT_ROOT / "data"
+RAW_DATA_DIR = DATA_DIR / "raw"
+PROCESSED_DATA_DIR = DATA_DIR / "processed"
+EXTERNAL_DATA_DIR = DATA_DIR / "external"
+RESULTS_DIR = PROJECT_ROOT / "results"
+PLOTS_DIR = RESULTS_DIR / "plots"
+MODELS_DIR = RESULTS_DIR / "models"
+REPORTS_DIR = RESULTS_DIR / "reports"
+LOGS_DIR = RESULTS_DIR / "logs"
+CONFIGS_DIR = PROJECT_ROOT / "configs"
+NOTEBOOKS_DIR = PROJECT_ROOT / "notebooks"
+TESTS_DIR = PROJECT_ROOT / "tests"
+# Create directories on import
+for directory in [RAW_DATA_DIR, PROCESSED_DATA_DIR, EXTERNAL_DATA_DIR,
+                  PLOTS_DIR, MODELS_DIR, REPORTS_DIR, LOGS_DIR]:
+    directory.mkdir(parents=True, exist_ok=True)
+# ============================================================================
+# VISUALISATION SETTINGS
+# ============================================================================
+def setup_visualization(
+    style: str = "seaborn-whitegrid",
+    palette: str = "husl",
+    context: str = "notebook",
+    font_scale: float = 1.0,
+    dpi: int = 150,
+    figsize: tuple = (12, 6),
+    **kwargs
+):
+    """
+    Configure visualisation parameters for matplotlib and seaborn
+    Parameters:
+    -----------
+    style : str
+        Matplotlib style: 'seaborn-whitegrid', 'ggplot', 'bmh', 'dark_background'
+    palette : str
+        Seaborn palette: 'husl', 'Set2', 'viridis', 'mako'
+    context : str
+        Seaborn context: 'paper', 'notebook', 'talk', 'poster'
+    font_scale : float
+        Font scale
+    dpi : int
+        Plot resolution
+    figsize : tuple
+        Default figure size
+    """
+    # Ignore warnings
+    warnings.filterwarnings('ignore')
+    # Matplotlib settings
+    plt.style.use(style)
+    # RC parameters
+    rc_params = {
+        'font.size': 10,
+        'figure.figsize': figsize,
+        'figure.dpi': dpi,
+        'savefig.dpi': 300,
+        'savefig.bbox': 'tight',
+        'savefig.format': 'png',
+        'axes.titlesize': 12,
+        'axes.labelsize': 10,
+        'xtick.labelsize': 9,
+        'ytick.labelsize': 9,
+        'legend.fontsize': 9,
+        'font.family': ['DejaVu Sans', 'Arial', 'sans-serif'],
+        'figure.titlesize': 14,
+        'axes.grid': True,
+        'grid.alpha': 0.3,
+        'lines.linewidth': 1.5,
+        'lines.markersize': 6,
+        'patch.edgecolor': 'black',
+        'patch.force_edgecolor': True,
+        'xtick.top': False,
+        'ytick.right': False,
+        'axes.spines.top': False,
+        'axes.spines.right': False
+    }
+    # Update additional parameters
+    rc_params.update(kwargs)
+    plt.rcParams.update(rc_params)
+    # Seaborn settings
+    sns.set_style(style.replace('seaborn-', ''))
+    sns.set_palette(palette)
+    sns.set_context(context, font_scale=font_scale)
+    print(f"✓ Visualisation settings applied: style={style}, palette={palette}")
+def get_color_palette(name: str = "husl", n_colors: int = 8) -> list:
+    """
+    Get colour palette
+    Parameters:
+    -----------
+    name : str
+        Palette name
+    n_colors : int
+        Number of colours
+    Returns:
+    --------
+    list
+        List of colours in HEX format
+    """
+    palette_map = {
+        "husl": sns.color_palette("husl", n_colors),
+        "Set2": sns.color_palette("Set2", n_colors),
+        "Set3": sns.color_palette("Set3", n_colors),
+        "viridis": sns.color_palette("viridis", n_colors),
+        "plasma": sns.color_palette("plasma", n_colors),
+        "coolwarm": sns.color_palette("coolwarm", n_colors),
+        "RdYlBu": sns.color_palette("RdYlBu", n_colors),
+        "Spectral": sns.color_palette("Spectral", n_colors),
+        "tab10": sns.color_palette("tab10", n_colors),
+        "tab20": sns.color_palette("tab20", n_colors),
+    }
+    palette = palette_map.get(name, sns.color_palette("husl", n_colors))
+    return [f"#{int(r*255):02x}{int(g*255):02x}{int(b*255):02x}"
+            for r, g, b in palette]
+# ============================================================================
+# CONSTANTS
+# ============================================================================
+# Data types
+DATETIME_FORMATS = [
+    "%Y-%m-%d", "%Y/%m/%d", "%d.%m.%Y", "%d/%m/%Y",
+    "%Y-%m-%d %H:%M:%S", "%Y/%m/%d %H:%M:%S",
+    "%d.%m.%Y %H:%M:%S", "%d/%m/%Y %H:%M:%S"
+]
+# Metrics
+METRICS = {
+    "regression": ["mse", "rmse", "mae", "mape", "r2", "explained_variance"],
+    "classification": ["accuracy", "precision", "recall", "f1", "roc_auc"]
+}
+# Statistical constants
+STATS_CONSTANTS = {
+    "confidence_levels": [0.9, 0.95, 0.99],
+    "z_scores": {0.9: 1.645, 0.95: 1.96, 0.99: 2.576},
+    "outlier_multipliers": {"mild": 1.5, "extreme": 3.0}
+}
+# Time series parameters
+TIME_SERIES_CONSTANTS = {
+    "frequencies": {
+        "H": "hourly",
+        "D": "daily",
+        "W": "weekly",
+        "M": "monthly",
+        "Q": "quarterly",
+        "Y": "yearly"
+    },
+    "seasonal_periods": {
+        "hourly": 24,
+        "daily": 7,
+        "weekly": 52,
+        "monthly": 12,
+        "quarterly": 4,
+        "yearly": 1
+    }
+}
+# ============================================================================
+# CONFIGURATION UTILITIES
+# ============================================================================
+def load_config(config_path: Optional[str] = None) -> Dict[str, Any]:
+    """
+    Load configuration from file
+    Parameters:
+    -----------
+    config_path : str, optional
+        Path to configuration file
+    Returns:
+    --------
+    Dict[str, Any]
+        Configuration dictionary
+    """
+    if config_path is None:
+        config_path = CONFIGS_DIR / "default_config.json"
+    config_path = Path(config_path)
+    if not config_path.exists():
+        print(f"⚠ Configuration file not found: {config_path}")
+        return {}
+    # Determine file format
+    if config_path.suffix.lower() in ['.json']:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = json.load(f)
+    elif config_path.suffix.lower() in ['.yaml', '.yml']:
+        with open(config_path, 'r', encoding='utf-8') as f:
+            config = yaml.safe_load(f)
+    else:
+        raise ValueError(f"Unsupported file format: {config_path.suffix}")
+    print(f"✓ Configuration loaded from: {config_path}")
+    return config
+def save_config(config: Dict[str, Any], config_path: str) -> None:
+    """
+    Save configuration to file
+    Parameters:
+    -----------
+    config : Dict[str, Any]
+        Configuration to save
+    config_path : str
+        Save path
+    """
+    config_path = Path(config_path)
+    config_path.parent.mkdir(parents=True, exist_ok=True)
+    # Determine format
+    if config_path.suffix.lower() in ['.json']:
+        with open(config_path, 'w', encoding='utf-8') as f:
+            json.dump(config, f, indent=2, ensure_ascii=False)
+    elif config_path.suffix.lower() in ['.yaml', '.yml']:
+        with open(config_path, 'w', encoding='utf-8') as f:
+            yaml.dump(config, f, default_flow_style=False, allow_unicode=True)
+    else:
+        raise ValueError(f"Unsupported file format: {config_path.suffix}")
+    print(f"✓ Configuration saved to: {config_path}")
+def merge_configs(base_config: Dict[str, Any],
+                  override_config: Dict[str, Any]) -> Dict[str, Any]:
+    """
+    Recursive configuration merging
+    Parameters:
+    -----------
+    base_config : Dict[str, Any]
+        Base configuration
+    override_config : Dict[str, Any]
+        Override configuration
+    Returns:
+    --------
+    Dict[str, Any]
+        Merged configuration
+    """
+    result = base_config.copy()
+    for key, value in override_config.items():
+        if (key in result and isinstance(result[key], dict)
+            and isinstance(value, dict)):
+            result[key] = merge_configs(result[key], value)
+        else:
+            result[key] = value
+    return result
+# ============================================================================
+# ENVIRONMENT SETUP
+# ============================================================================
+def setup_environment(
+    log_level: str = "INFO",
+    random_seed: int = 42,
+    enable_warnings: bool = False,
+    memory_limit_gb: Optional[int] = None
+) -> None:
+    """
+    Set up environment for reproducibility
+    Parameters:
+    -----------
+    log_level : str
+        Logging level
+    random_seed : int
+        Seed for random generators
+    enable_warnings : bool
+        Enable warnings
+    memory_limit_gb : int, optional
+        Memory limit in GB
+    """
+    import numpy as np
+    import random
+    import torch
+    import tensorflow as tf
+    # Set seeds
+    np.random.seed(random_seed)
+    random.seed(random_seed)
+    try:
+        torch.manual_seed(random_seed)
+    except:
+        pass
+    try:
+        tf.random.set_seed(random_seed)
+    except:
+        pass
+    # Configure warnings
+    if enable_warnings:
+        warnings.filterwarnings('default')
+    else:
+        warnings.filterwarnings('ignore')
+    # Memory limit (if specified)
+    if memory_limit_gb:
+        import resource
+        soft, hard = resource.getrlimit(resource.RLIMIT_AS)
+        memory_limit = memory_limit_gb * 1024**3  # GB to bytes
+        resource.setrlimit(resource.RLIMIT_AS, (memory_limit, hard))
+        print(f"✓ Memory limit set: {memory_limit_gb} GB")
+    print(f"✓ Environment configured. Random seed: {random_seed}")
+# ============================================================================
+# AUTOMATIC SETUP ON IMPORT
+# ============================================================================
+# Automatically apply visualisation settings
+setup_visualization()
+# Export useful variables
+__all__ = [
+    'setup_visualization',
+    'get_color_palette',
+    'load_config',
+    'save_config',
+    'merge_configs',
+    'setup_environment',
+    'PROJECT_ROOT',
+    'DATA_DIR',
+    'RAW_DATA_DIR',
+    'PROCESSED_DATA_DIR',
+    'RESULTS_DIR',
+    'PLOTS_DIR',
+    'DATETIME_FORMATS',
+    'METRICS',
+    'STATS_CONSTANTS',
+    'TIME_SERIES_CONSTANTS'
+]

correlations/__init__.py ADDED Viewed

File without changes

correlations/correlation_analyzer.py ADDED Viewed

	@@ -0,0 +1,687 @@

+# ============================================
+# CLASS 8: CORRELATION AND MULTICOLLINEARITY ANALYSIS
+# ============================================
+import os
+import traceback
+from typing import Any, Dict, List, Optional
+from venv import logger
+from config.config import Config
+import numpy as np
+import pandas as pd
+class CorrelationAnalyzer:
+    """Class for comprehensive correlation and multicollinearity analysis"""
+    def __init__(self, config: Config):
+        """
+        Initialise the analyser
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.correlation_matrices = {}
+        self.high_correlation_pairs = {}
+        self.multicollinearity_info = {}
+        self.vif_scores = {}
+    def analyze(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        threshold: float = 0.8,
+        detailed: bool = True,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Analyse correlations in the data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable
+        threshold : float
+            Threshold for identifying high correlations
+        detailed : bool
+            Whether to perform detailed analysis
+        **kwargs : dict
+            Additional parameters
+        Returns:
+        --------
+        pd.DataFrame
+            Correlation matrix
+        """
+        logger.info("\n" + "="*80)
+        logger.info("CORRELATION AND MULTICOLLINEARITY ANALYSIS")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        try:
+            # 1. Calculate correlation matrix
+            corr_matrix = self._compute_correlations(data, target_col)
+            if corr_matrix.empty:
+                logger.warning("Correlation matrix is empty")
+                return pd.DataFrame()
+            # 2. Identify high correlations
+            high_correlations = self._detect_high_correlations(corr_matrix, threshold)
+            self.high_correlation_pairs['pearson'] = high_correlations
+            # 3. Analyse correlations with target variable
+            target_correlations = []
+            if target_col in corr_matrix.columns:
+                target_correlations = self._get_target_correlations(corr_matrix, target_col)
+            # 4. Analyse multicollinearity (VIF)
+            vif_results = self._compute_vif_scores(data)
+            # 5. Detailed analysis if required
+            if detailed:
+                self._detailed_correlation_analysis(data, corr_matrix, target_col)
+            # 6. Visualisation
+            if self.config.save_plots:
+                self._plot_correlation_analysis(data, corr_matrix, target_col, high_correlations, vif_results)
+            # 7. Output results
+            self._log_analysis_results(corr_matrix, high_correlations, target_correlations, vif_results)
+            return corr_matrix
+        except Exception as e:
+            logger.error(f"Error in correlation analysis: {e}")
+            logger.error(traceback.format_exc())
+            return pd.DataFrame()
+    def _compute_correlations(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """Calculate correlation matrix"""
+        logger.info("Calculating correlation matrix...")
+        # Select only numeric columns
+        numeric_data = data.select_dtypes(include=[np.number])
+        # Remove constant columns
+        numeric_data = numeric_data.loc[:, numeric_data.nunique() > 1]
+        if numeric_data.shape[1] < 2:
+            logger.warning("Insufficient numeric features for analysis")
+            return pd.DataFrame()
+        # Remove missing values
+        numeric_data_clean = numeric_data.dropna()
+        if len(numeric_data_clean) < 10:
+            logger.warning("Insufficient data after cleaning")
+            return pd.DataFrame()
+        # Calculate Pearson correlation
+        try:
+            corr_matrix = numeric_data_clean.corr(method='pearson')
+            self.correlation_matrices['pearson'] = corr_matrix
+            logger.info(f"✓ Correlation matrix calculated: {corr_matrix.shape}")
+            return corr_matrix
+        except Exception as e:
+            logger.error(f"Error calculating correlation: {e}")
+            return pd.DataFrame()
+    def _detect_high_correlations(
+        self,
+        corr_matrix: pd.DataFrame,
+        threshold: float = 0.8
+    ) -> List[Dict[str, Any]]:
+        """Detect high correlations"""
+        high_correlations = []
+        if corr_matrix.empty:
+            return high_correlations
+        # Use upper triangle of matrix
+        upper_triangle = corr_matrix.where(
+            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
+        )
+        # Find pairs with correlation above threshold
+        for col in upper_triangle.columns:
+            if col in upper_triangle:
+                high_corr_series = upper_triangle[col][abs(upper_triangle[col]) > threshold]
+                for row_idx, correlation in high_corr_series.items():
+                    if not pd.isna(correlation):
+                        high_correlations.append({
+                            'feature1': row_idx,
+                            'feature2': col,
+                            'correlation': float(correlation),
+                            'abs_correlation': abs(float(correlation))
+                        })
+        # Sort by absolute correlation value
+        high_correlations.sort(key=lambda x: x['abs_correlation'], reverse=True)
+        logger.info(f"High correlations detected (> {threshold}): {len(high_correlations)}")
+        return high_correlations
+    def _get_target_correlations(
+        self,
+        corr_matrix: pd.DataFrame,
+        target_col: str
+    ) -> List[Dict[str, Any]]:
+        """Get correlations with target variable"""
+        target_correlations = []
+        if target_col not in corr_matrix.columns:
+            return target_correlations
+        # Extract correlations with target variable
+        target_corr_series = corr_matrix[target_col]
+        for feature, correlation in target_corr_series.items():
+            if feature != target_col and not pd.isna(correlation):
+                target_correlations.append({
+                    'feature': feature,
+                    'correlation': float(correlation),
+                    'abs_correlation': abs(float(correlation)),
+                    'direction': 'positive' if correlation > 0 else 'negative'
+                })
+        # Sort by absolute value
+        target_correlations.sort(key=lambda x: x['abs_correlation'], reverse=True)
+        logger.info(f"Correlations with target variable calculated: {len(target_correlations)}")
+        return target_correlations
+    def _compute_vif_scores(self, data: pd.DataFrame) -> Dict[str, Any]:
+        """Calculate VIF (Variance Inflation Factor)"""
+        logger.info("Analysing multicollinearity (VIF)...")
+        vif_results = {
+            'scores': {},
+            'issues': [],
+            'summary': {
+                'critical': 0,
+                'high': 0,
+                'medium': 0,
+                'low': 0
+            }
+        }
+        try:
+            from statsmodels.stats.outliers_influence import variance_inflation_factor
+            import statsmodels.api as sm
+            # Prepare data
+            numeric_data = data.select_dtypes(include=[np.number])
+            numeric_data = numeric_data.loc[:, numeric_data.nunique() > 1]
+            # Remove missing and infinite values
+            clean_data = numeric_data.replace([np.inf, -np.inf], np.nan).dropna()
+            if clean_data.shape[0] < 10 or clean_data.shape[1] < 2:
+                logger.warning("Insufficient data for VIF analysis")
+                return vif_results
+            # Add constant
+            X = sm.add_constant(clean_data, has_constant='add')
+            # Calculate VIF for each feature
+            vif_scores = {}
+            for i, column in enumerate(X.columns):
+                if column == 'const':
+                    continue
+                try:
+                    vif = variance_inflation_factor(X.values, i)
+                    # Handle extreme values
+                    if np.isinf(vif) or vif > 1e6:
+                        vif = 1e6
+                    vif_scores[column] = float(vif)
+                    # Classify by severity
+                    if vif > 100:
+                        vif_results['summary']['critical'] += 1
+                        vif_results['issues'].append({
+                            'feature': column,
+                            'vif': float(vif),
+                            'severity': 'critical',
+                            'recommendation': 'Remove feature'
+                        })
+                    elif vif > 10:
+                        vif_results['summary']['high'] += 1
+                        vif_results['issues'].append({
+                            'feature': column,
+                            'vif': float(vif),
+                            'severity': 'high',
+                            'recommendation': 'Consider removal'
+                        })
+                    elif vif > 5:
+                        vif_results['summary']['medium'] += 1
+                    else:
+                        vif_results['summary']['low'] += 1
+                except Exception as e:
+                    logger.warning(f"VIF error for {column}: {e}")
+                    vif_scores[column] = np.nan
+            vif_results['scores'] = vif_scores
+            self.vif_scores = vif_scores
+            logger.info(f"✓ VIF analysis completed. Critical features: {vif_results['summary']['critical']}")
+        except ImportError:
+            logger.warning("statsmodels not installed, skipping VIF analysis")
+        except Exception as e:
+            logger.error(f"VIF analysis error: {e}")
+        return vif_results
+    def _detailed_correlation_analysis(
+        self,
+        data: pd.DataFrame,
+        corr_matrix: pd.DataFrame,
+        target_col: str
+    ) -> None:
+        """Detailed correlation analysis"""
+        # Analyse correlation clusters
+        if not corr_matrix.empty and corr_matrix.shape[0] > 3:
+            try:
+                # Use clustering to group correlated features
+                from scipy.cluster.hierarchy import linkage, dendrogram, fcluster
+                from scipy.spatial.distance import squareform
+                # Convert correlations to distances
+                distance_matrix = 1 - abs(corr_matrix)
+                np.fill_diagonal(distance_matrix.values, 0)
+                # Clustering
+                condensed_dist = squareform(distance_matrix)
+                Z = linkage(condensed_dist, method='average')
+                # Determine clusters
+                clusters = fcluster(Z, t=0.5, criterion='distance')
+                # Group features by cluster
+                feature_clusters = {}
+                for idx, cluster_id in enumerate(clusters):
+                    feature = corr_matrix.columns[idx]
+                    if cluster_id not in feature_clusters:
+                        feature_clusters[cluster_id] = []
+                    feature_clusters[cluster_id].append(feature)
+                # Save cluster information
+                self.multicollinearity_info['correlation_clusters'] = feature_clusters
+                logger.info(f"Correlated feature clusters detected: {len(feature_clusters)}")
+            except Exception as e:
+                logger.debug(f"Cluster analysis failed: {e}")
+    def _plot_correlation_analysis(
+        self,
+        data: pd.DataFrame,
+        corr_matrix: pd.DataFrame,
+        target_col: str,
+        high_correlations: List[Dict[str, Any]],
+        vif_results: Dict[str, Any]
+    ) -> None:
+        """Visualise correlation analysis"""
+        try:
+            import matplotlib.pyplot as plt
+            import seaborn as sns
+            from matplotlib import rcParams
+            # Style settings
+            plt.style.use('seaborn-v0_8-darkgrid')
+            rcParams.update({
+                'figure.figsize': (12, 8),
+                'font.size': 10,
+                'axes.titlesize': 14,
+                'axes.labelsize': 12
+            })
+            # Create directory
+            plots_dir = os.path.join(self.config.results_dir, 'plots', 'correlations')
+            os.makedirs(plots_dir, exist_ok=True)
+            # 1. Correlation matrix heatmap
+            if not corr_matrix.empty and corr_matrix.shape[0] > 1:
+                fig, ax = plt.subplots(figsize=(14, 12))
+                mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+                sns.heatmap(
+                    corr_matrix,
+                    mask=mask,
+                    annot=True,
+                    fmt='.2f',
+                    cmap='coolwarm',
+                    center=0,
+                    square=True,
+                    linewidths=0.5,
+                    cbar_kws={"shrink": 0.8},
+                    ax=ax
+                )
+                ax.set_title('Correlation Matrix (Pearson)', fontweight='bold')
+                plt.tight_layout()
+                plt.savefig(os.path.join(plots_dir, 'correlation_matrix.png'),
+                           dpi=150, bbox_inches='tight')
+                plt.close()
+            # 2. Target variable correlations
+            if target_col in corr_matrix.columns:
+                target_corrs = corr_matrix[target_col].drop(target_col, errors='ignore')
+                if not target_corrs.empty:
+                    fig, ax = plt.subplots(figsize=(10, 8))
+                    top_corrs = target_corrs.abs().sort_values(ascending=True).tail(20)
+                    colors = ['red' if target_corrs[feat] < 0 else 'blue'
+                             for feat in top_corrs.index]
+                    ax.barh(range(len(top_corrs)), top_corrs.values, color=colors)
+                    ax.set_yticks(range(len(top_corrs)))
+                    ax.set_yticklabels(top_corrs.index)
+                    ax.set_xlabel('Absolute correlation')
+                    ax.set_title(f'Top-20 correlations with {target_col}', fontweight='bold')
+                    ax.grid(True, alpha=0.3, axis='x')
+                    plt.tight_layout()
+                    plt.savefig(os.path.join(plots_dir, 'target_correlations.png'),
+                               dpi=150, bbox_inches='tight')
+                    plt.close()
+            # 3. VIF scores plot
+            if vif_results['scores']:
+                valid_scores = {k: v for k, v in vif_results['scores'].items()
+                               if not pd.isna(v)}
+                if valid_scores:
+                    fig, ax = plt.subplots(figsize=(12, 8))
+                    sorted_scores = dict(sorted(valid_scores.items(),
+                                               key=lambda x: x[1],
+                                               reverse=True)[:25])
+                    colors = []
+                    for vif in sorted_scores.values():
+                        if vif > 100:
+                            colors.append('red')
+                        elif vif > 10:
+                            colors.append('orange')
+                        elif vif > 5:
+                            colors.append('yellow')
+                        else:
+                            colors.append('green')
+                    bars = ax.barh(list(sorted_scores.keys()),
+                                  list(sorted_scores.values()),
+                                  color=colors, edgecolor='black')
+                    ax.set_xlabel('VIF Score')
+                    ax.set_title('VIF Scores (multicollinearity)', fontweight='bold')
+                    ax.axvline(x=5, color='yellow', linestyle='--', alpha=0.7)
+                    ax.axvline(x=10, color='orange', linestyle='--', alpha=0.7)
+                    ax.axvline(x=100, color='red', linestyle='--', alpha=0.7)
+                    ax.grid(True, alpha=0.3, axis='x')
+                    plt.tight_layout()
+                    plt.savefig(os.path.join(plots_dir, 'vif_scores.png'),
+                               dpi=150, bbox_inches='tight')
+                    plt.close()
+            # 4. High correlations plot
+            if high_correlations:
+                fig, ax = plt.subplots(figsize=(12, 8))
+                # Limit number for display
+                display_corrs = high_correlations[:15]
+                # Create labels for feature pairs
+                labels = [f"{corr['feature1']} ↔ {corr['feature2']}"
+                         for corr in display_corrs]
+                values = [corr['correlation'] for corr in display_corrs]
+                colors = ['red' if v < 0 else 'blue' for v in values]
+                y_pos = np.arange(len(display_corrs))
+                ax.barh(y_pos, values, color=colors)
+                ax.set_yticks(y_pos)
+                ax.set_yticklabels(labels, fontsize=9)
+                ax.invert_yaxis()
+                ax.set_xlabel('Correlation')
+                ax.set_title('High correlations (> 0.8)', fontweight='bold')
+                ax.grid(True, alpha=0.3, axis='x')
+                plt.tight_layout()
+                plt.savefig(os.path.join(plots_dir, 'high_correlations.png'),
+                           dpi=150, bbox_inches='tight')
+                plt.close()
+            logger.info(f"Visualisations saved to {plots_dir}")
+        except Exception as e:
+            logger.warning(f"Error creating visualisations: {e}")
+    def _log_analysis_results(
+        self,
+        corr_matrix: pd.DataFrame,
+        high_correlations: List[Dict[str, Any]],
+        target_correlations: List[Dict[str, Any]],
+        vif_results: Dict[str, Any]
+    ) -> None:
+        """Log analysis results"""
+        logger.info("\n" + "="*80)
+        logger.info("CORRELATION AND MULTICOLLINEARITY ANALYSIS REPORT")
+        logger.info("="*80)
+        # General information
+        logger.info(f"\n📊 GENERAL INFORMATION:")
+        logger.info(f"   Correlation matrix size: {corr_matrix.shape}")
+        logger.info(f"   Total features: {len(corr_matrix.columns)}")
+        # High correlations
+        if high_correlations:
+            logger.info(f"\n⚠ HIGH CORRELATIONS (|r| > 0.8): {len(high_correlations)}")
+            logger.info("   " + "-" * 60)
+            for i, corr in enumerate(high_correlations[:10]):
+                sign = "🟥" if corr['correlation'] < 0 else "🟩"
+                logger.info(f"   {i+1:2d}. {sign} {corr['feature1']:25s} ↔ {corr['feature2']:25s}: {corr['correlation']:7.4f}")
+            if len(high_correlations) > 10:
+                logger.info(f"   ... and {len(high_correlations) - 10} more pairs")
+        # Target variable correlations
+        if target_correlations:
+            logger.info(f"\n🎯 CORRELATIONS WITH TARGET VARIABLE:")
+            logger.info("   " + "-" * 60)
+            for i, corr in enumerate(target_correlations[:10]):
+                direction = "↓" if corr['correlation'] < 0 else "↑"
+                logger.info(f"   {i+1:2d}. {direction} {corr['feature']:35s}: {corr['correlation']:7.4f}")
+        # Multicollinearity analysis
+        if vif_results['scores']:
+            logger.info(f"\n📈 MULTICOLLINEARITY ANALYSIS (VIF):")
+            logger.info("   " + "-" * 60)
+            logger.info(f"   Critical (VIF > 100): {vif_results['summary']['critical']}")
+            logger.info(f"   High (10 < VIF ≤ 100): {vif_results['summary']['high']}")
+            logger.info(f"   Medium (5 < VIF ≤ 10): {vif_results['summary']['medium']}")
+            logger.info(f"   Low (VIF ≤ 5): {vif_results['summary']['low']}")
+            # Top problematic features
+            if vif_results['issues']:
+                logger.info(f"\n🔴 PROBLEMATIC FEATURES (VIF > 10):")
+                for issue in vif_results['issues'][:10]:
+                    logger.info(f"   • {issue['feature']:35s}: VIF = {issue['vif']:7.1f} ({issue['severity']})")
+        logger.info("\n" + "="*80)
+        logger.info("RECOMMENDATIONS:")
+        logger.info("="*80)
+        # Generate recommendations
+        recommendations = []
+        if len(high_correlations) > 20:
+            recommendations.append("1. Remove highly correlated features (correlation method)")
+        if vif_results['summary']['critical'] > 0:
+            recommendations.append("2. Remove features with critical VIF (>100)")
+        if vif_results['summary']['high'] > 5:
+            recommendations.append("3. Consider removing features with VIF > 10")
+        if not recommendations:
+            recommendations.append("1. Data in good condition, no serious issues detected")
+            recommendations.append("2. Proceed to modelling")
+        for i, rec in enumerate(recommendations, 1):
+            logger.info(f"   {rec}")
+        logger.info("\n" + "="*80)
+    def remove_highly_correlated(
+        self,
+        data: pd.DataFrame,
+        threshold: float = 0.85,
+        method: str = 'variance',
+        keep_target: bool = True,
+        keep_features: List[str] = None
+    ) -> pd.DataFrame:
+        """
+        Remove highly correlated features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Source data
+        threshold : float
+            Correlation threshold for removal
+        method : str
+            Feature selection method for removal: 'variance', 'random', 'importance'
+        keep_target : bool
+            Whether to keep target variable
+        keep_features : List[str], optional
+            Features to keep
+        Returns:
+        --------
+        pd.DataFrame
+            Data after removing highly correlated features
+        """
+        logger.info("\n" + "="*80)
+        logger.info("REMOVING HIGHLY CORRELATED FEATURES")
+        logger.info("="*80)
+        data_clean = data.copy()
+        if 'pearson' not in self.correlation_matrices:
+            logger.warning("Correlation matrix not calculated, run analyze() first")
+            return data_clean
+        corr_matrix = self.correlation_matrices['pearson']
+        # Features to keep
+        features_to_keep = set()
+        if keep_target and self.config.target_column in data_clean.columns:
+            features_to_keep.add(self.config.target_column)
+        if keep_features:
+            for feat in keep_features:
+                if feat in data_clean.columns:
+                    features_to_keep.add(feat)
+        # Temporal features (usually important for time series)
+        temporal_patterns = ['year', 'month', 'day', 'week', 'quarter',
+                            'hour', 'minute', 'second', 'sin', 'cos']
+        for col in data_clean.columns:
+            if any(pattern in col.lower() for pattern in temporal_patterns):
+                features_to_keep.add(col)
+        # Find highly correlated pairs
+        upper_triangle = corr_matrix.where(
+            np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
+        )
+        # Collect highly correlated features
+        correlated_features = set()
+        for col in upper_triangle.columns:
+            if col in features_to_keep:
+                continue
+            high_corr = upper_triangle[col][abs(upper_triangle[col]) > threshold]
+            for row_idx, corr_value in high_corr.items():
+                if not pd.isna(corr_value) and row_idx not in features_to_keep:
+                    # Select which feature to remove
+                    if method == 'variance':
+                        # Remove the one with lower variance
+                        var_col = data_clean[col].var()
+                        var_row = data_clean[row_idx].var()
+                        feature_to_remove = col if var_col < var_row else row_idx
+                    elif method == 'importance':
+                        # Remove the one with lower correlation to target variable
+                        if self.config.target_column in corr_matrix.columns:
+                            corr_col_target = abs(corr_matrix.loc[col, self.config.target_column])
+                            corr_row_target = abs(corr_matrix.loc[row_idx, self.config.target_column])
+                            feature_to_remove = col if corr_col_target < corr_row_target else row_idx
+                        else:
+                            # If no target, remove randomly
+                            feature_to_remove = np.random.choice([col, row_idx])
+                    else:
+                        # Remove randomly
+                        feature_to_remove = np.random.choice([col, row_idx])
+                    correlated_features.add(feature_to_remove)
+        # Remove features
+        features_to_remove = list(correlated_features)
+        if features_to_remove:
+            data_clean = data_clean.drop(columns=features_to_remove)
+            logger.info(f"\n📊 REMOVAL RESULTS:")
+            logger.info(f"   Initial feature count: {len(data.columns)}")
+            logger.info(f"   Features removed: {len(features_to_remove)}")
+            logger.info(f"   Final feature count: {len(data_clean.columns)}")
+            logger.info(f"   Retained: {len(data_clean.columns)/len(data.columns)*100:.1f}%")
+            if features_to_remove:
+                logger.info(f"\n🗑️ REMOVED FEATURES:")
+                for i, feat in enumerate(sorted(features_to_remove)[:20]):
+                    logger.info(f"   {i+1:2d}. {feat}")
+                if len(features_to_remove) > 20:
+                    logger.info(f"   ... and {len(features_to_remove) - 20} more features")
+        else:
+            logger.info("✓ No highly correlated features detected, all features retained")
+        logger.info("="*80)
+        return data_clean
+    def get_report(self) -> Dict[str, Any]:
+        """Get analysis report"""
+        report = {
+            "correlation_matrix_shape": None,
+            "high_correlation_count": 0,
+            "vif_summary": {},
+            "target_correlation_count": 0
+        }
+        if 'pearson' in self.correlation_matrices:
+            report["correlation_matrix_shape"] = self.correlation_matrices['pearson'].shape
+        if 'pearson' in self.high_correlation_pairs:
+            report["high_correlation_count"] = len(self.high_correlation_pairs['pearson'])
+        if self.vif_scores:
+            report["vif_summary"] = self.vif_scores.get('summary', {})
+        return report

data_loader/__init__.py ADDED Viewed

File without changes

data_loader/data_loader.py ADDED Viewed

	@@ -0,0 +1,487 @@

+# ============================================
+# CLASS 2: DATA LOADER
+# ============================================
+from datetime import datetime
+import hashlib
+import json
+import traceback
+from typing import Dict, List, Optional
+from venv import logger
+from config.config import Config, DataType
+import numpy as np
+import pandas as pd
+class DataLoader:
+    """Class for loading and initial data processing"""
+    def __init__(self, config: Config):
+        """
+        Initialise data loader
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.data = None
+        self.metadata = {}
+        self.data_hash = None
+        self.loading_time = None
+        self.data_types = {}
+        self.original_shape = None
+    def load_from_csv(
+        self,
+        data_path: Optional[str] = None,
+        parse_dates: List[str] = None,
+        date_format: str = None,
+        dtype: Dict = None,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Load data from CSV file
+        Parameters:
+        -----------
+        data_path : str, optional
+            Path to CSV file. If None, uses path from configuration.
+        parse_dates : List[str], optional
+            List of columns to parse as dates
+        date_format : str, optional
+            Date format
+        dtype : Dict, optional
+            Data types for columns
+        **kwargs : dict
+            Additional parameters for pd.read_csv
+        Returns:
+        --------
+        pd.DataFrame
+            Loaded data
+        """
+        logger.info("="*80)
+        logger.info("LOADING DATA FROM CSV")
+        logger.info("="*80)
+        start_time = datetime.now()
+        try:
+            path = data_path or self.config.data_path
+            if parse_dates is None:
+                parse_dates = ['date']
+            # Load data
+            self.data = pd.read_csv(
+                path,
+                parse_dates=parse_dates,
+                dayfirst=False,
+                dtype=dtype,
+                **kwargs
+            )
+            # Convert dates if needed
+            for date_col in parse_dates:
+                if date_col in self.data.columns:
+                    if date_format:
+                        self.data[date_col] = pd.to_datetime(
+                            self.data[date_col],
+                            format=date_format,
+                            errors='coerce'
+                        )
+                    else:
+                        self.data[date_col] = pd.to_datetime(
+                            self.data[date_col],
+                            errors='coerce'
+                        )
+            # Save original shape
+            self.original_shape = self.data.shape
+            # Filter by years
+            if 'date' in self.data.columns:
+                mask = (self.data['date'].dt.year >= self.config.start_year) & \
+                       (self.data['date'].dt.year <= self.config.end_year)
+                self.data = self.data.loc[mask].copy()
+            # Sort by date
+            if 'date' in self.data.columns:
+                self.data = self.data.sort_values('date').reset_index(drop=True)
+                # Set date as index
+                self.data.set_index('date', inplace=True)
+            # Calculate data hash
+            self.data_hash = self._calculate_data_hash()
+            # Analyse data types
+            self._analyse_data_types()
+            # Save metadata
+            self._save_metadata()
+            # Loading time
+            self.loading_time = (datetime.now() - start_time).total_seconds()
+            logger.info(f"✓ Loaded {len(self.data)} records, {len(self.data.columns)} columns")
+            logger.info(f"  Period: {self.data.index.min()} - {self.data.index.max()}")
+            logger.info(f"  Data types: {self.data_types}")
+            logger.info(f"  Target variable: {self.config.target_column}")
+            logger.info(f"  Loading time: {self.loading_time:.2f} sec")
+            return self.data
+        except Exception as e:
+            logger.error(f"✗ Error loading data: {e}")
+            logger.error(traceback.format_exc())
+            raise
+    def create_synthetic_data(
+        self,
+        n_days: int = 365*21,
+        trend_strength: float = 0.01,
+        seasonal_amplitude: List[float] = None,
+        noise_std: float = 10,
+        include_exogenous: bool = True,
+        random_state: int = 42
+    ) -> pd.DataFrame:
+        """
+        Create synthetic data for testing
+        Parameters:
+        -----------
+        n_days : int
+            Number of days to generate
+        trend_strength : float
+            Trend strength
+        seasonal_amplitude : List[float], optional
+            Seasonal component amplitudes
+        noise_std : float
+            Noise standard deviation
+        include_exogenous : bool
+            Whether to include exogenous variables
+        random_state : int
+            Seed for reproducibility
+        Returns:
+        --------
+        pd.DataFrame
+            Synthetic data
+        """
+        logger.info("="*80)
+        logger.info("CREATING SYNTHETIC DATA")
+        logger.info("="*80)
+        if seasonal_amplitude is None:
+            seasonal_amplitude = [50, 30, 20]
+        np.random.seed(random_state)
+        # Generate dates
+        dates = pd.date_range(
+            start=f'{self.config.start_year}-01-01',
+            periods=n_days,
+            freq='D'
+        )
+        t = np.arange(n_days)
+        # Base components
+        trend = trend_strength * t
+        # Seasonal components
+        seasonal = 0
+        periods = [365, 30, 7]  # yearly, monthly, weekly seasonality
+        for i, (period, amplitude) in enumerate(zip(periods, seasonal_amplitude)):
+            seasonal += amplitude * np.sin(2 * np.pi * t / period)
+            if i < len(seasonal_amplitude) - 1:
+                seasonal += 0.5 * amplitude * np.cos(4 * np.pi * t / period)
+        # Cyclical component (business cycles)
+        cycle = 20 * np.sin(2 * np.pi * t / (365*5))  # 5-year cycle
+        # Noise
+        noise = np.random.normal(0, noise_std, n_days)
+        # Generate target variable
+        raskhodvoda = 100 + trend + seasonal + cycle + noise
+        # Create DataFrame
+        self.data = pd.DataFrame(
+            index=dates,
+            data={'raskhodvoda': raskhodvoda}
+        )
+        # Generate exogenous variables
+        if include_exogenous:
+            # Temperature with seasonality
+            tavg = 10 + 8 * np.sin(2 * np.pi * t / 365) + np.random.normal(0, 3, n_days)
+            tmin = tavg - 5 + np.random.normal(0, 2, n_days)
+            tmax = tavg + 5 + np.random.normal(0, 2, n_days)
+            # Water level with trend and seasonality
+            urovenvoda = 200 + 0.5 * t + 20 * np.sin(2 * np.pi * t / 365) + np.random.normal(0, 5, n_days)
+            # Add to DataFrame
+            self.data['tavg'] = tavg
+            self.data['tmin'] = tmin
+            self.data['tmax'] = tmax
+            self.data['urovenvoda'] = urovenvoda
+            # Add noisy lags
+            for lag in [1, 7, 30]:
+                self.data[f'tavg_lag_{lag}'] = self.data['tavg'].shift(lag) + np.random.normal(0, 1, n_days)
+        # Add missing values and outliers for testing
+        if n_days > 100:
+            # Missing values (5% of data)
+            mask_missing = np.random.random(n_days) < 0.05
+            self.data.loc[mask_missing, 'tavg'] = np.nan
+            # Outliers (1% of data)
+            mask_outliers = np.random.random(n_days) < 0.01
+            self.data.loc[mask_outliers, 'raskhodvoda'] *= 2
+        # Save metadata
+        self.metadata.update({
+            'is_synthetic': True,
+            'synthetic_params': {
+                'n_days': n_days,
+                'trend_strength': trend_strength,
+                'seasonal_amplitude': seasonal_amplitude,
+                'noise_std': noise_std,
+                'include_exogenous': include_exogenous,
+                'random_state': random_state
+            }
+        })
+        logger.info(f"✓ Created {len(self.data)} synthetic records")
+        logger.info(f"  Columns: {list(self.data.columns)}")
+        return self.data
+    def _calculate_data_hash(self) -> str:
+        """Calculate data hash for tracking changes"""
+        if self.data is None:
+            return None
+        # Use hash of first 1000 rows and metadata
+        sample = self.data.head(1000).to_string().encode()
+        return hashlib.md5(sample).hexdigest()
+    def _analyse_data_types(self) -> None:
+        """Analyse data types in DataFrame"""
+        if self.data is None:
+            return
+        for col in self.data.columns:
+            dtype = str(self.data[col].dtype)
+            if 'datetime' in dtype:
+                self.data_types[col] = DataType.TEMPORAL.value
+            elif 'int' in dtype or 'float' in dtype:
+                self.data_types[col] = DataType.NUMERIC.value
+            elif 'object' in dtype or 'category' in dtype:
+                # Check if categorical
+                unique_ratio = self.data[col].nunique() / len(self.data)
+                if unique_ratio < 0.1:  # Less than 10% unique values
+                    self.data_types[col] = DataType.CATEGORICAL.value
+                else:
+                    self.data_types[col] = DataType.TEXT.value
+            else:
+                self.data_types[col] = 'unknown'
+    def _save_metadata(self) -> None:
+        """Save data metadata"""
+        if self.data is None:
+            return
+        # Basic metadata
+        self.metadata.update({
+            'original_shape': list(self.original_shape) if self.original_shape else [],
+            'current_shape': list(self.data.shape),
+            'columns': list(self.data.columns),
+            'data_types': self.data_types,
+            'date_range': {
+                'min': self.data.index.min().strftime('%Y-%m-%d') if pd.notnull(self.data.index.min()) else None,
+                'max': self.data.index.max().strftime('%Y-%m-%d') if pd.notnull(self.data.index.max()) else None
+            },
+            'data_hash': self.data_hash,
+            'loading_time': self.loading_time
+        })
+        # Statistics for numeric columns
+        numeric_cols = self.data.select_dtypes(include=[np.number]).columns
+        if len(numeric_cols) > 0:
+            stats = self.data[numeric_cols].describe().to_dict()
+            # Add additional statistics
+            for col in numeric_cols:
+                stats[col]['skewness'] = float(self.data[col].skew())
+                stats[col]['kurtosis'] = float(self.data[col].kurtosis())
+                stats[col]['cv'] = float(self.data[col].std() / self.data[col].mean()) if self.data[col].mean() != 0 else np.nan
+            self.metadata['numeric_statistics'] = stats
+        # Missing values information
+        missing_info = {
+            'total_missing': int(self.data.isnull().sum().sum()),
+            'missing_by_column': self.data.isnull().sum().to_dict(),
+            'missing_percentage': (self.data.isnull().sum() / len(self.data) * 100).to_dict(),
+            'rows_with_missing': int(self.data.isnull().any(axis=1).sum()),
+            'columns_with_missing': self.data.columns[self.data.isnull().any()].tolist()
+        }
+        self.metadata['missing_info'] = missing_info
+    def get_data_info(self) -> Dict:
+        """Get information about data"""
+        if self.data is None:
+            return {}
+        info = {
+            'shape': list(self.data.shape),
+            'columns': list(self.data.columns),
+            'data_types': self.data_types,
+            'date_range': {
+                'min': self.data.index.min().strftime('%Y-%m-%d') if pd.notnull(self.data.index.min()) else None,
+                'max': self.data.index.max().strftime('%Y-%m-%d') if pd.notnull(self.data.index.max()) else None
+            },
+            'target_column': self.config.target_column,
+            'numeric_columns': self.data.select_dtypes(include=[np.number]).columns.tolist(),
+            'categorical_columns': [col for col, dtype in self.data_types.items()
+                                   if dtype == DataType.CATEGORICAL.value],
+            'missing_info': self.metadata.get('missing_info', {})
+        }
+        return info
+    def save_raw_data_info(self) -> None:
+        """Save raw data information"""
+        if self.data is None:
+            return
+        info_path = f'{self.config.results_dir}/reports/raw_data_info.json'
+        # Custom JSON encoder for handling numpy types
+        class NumpyEncoder(json.JSONEncoder):
+            def default(self, obj):
+                if isinstance(obj, (np.integer, np.floating)):
+                    if np.isnan(obj):
+                        return None
+                    return float(obj)
+                elif isinstance(obj, np.bool_):
+                    return bool(obj)
+                elif isinstance(obj, np.ndarray):
+                    return obj.tolist()
+                elif isinstance(obj, pd.Timestamp):
+                    return obj.strftime('%Y-%m-%d %H:%M:%S')
+                elif isinstance(obj, pd.Period):
+                    return str(obj)
+                return super().default(obj)
+        with open(info_path, 'w', encoding='utf-8') as f:
+            json.dump(self.metadata, f, indent=4, ensure_ascii=False, cls=NumpyEncoder)
+        logger.info(f"✓ Raw data information saved: {info_path}")
+    def resample_data(
+        self,
+        freq: str = None,
+        method: str = 'mean'
+    ) -> pd.DataFrame:
+        """
+        Resample time series data
+        Parameters:
+        -----------
+        freq : str, optional
+            New frequency (e.g., 'D', 'W', 'M')
+        method : str
+            Aggregation method: 'mean', 'sum', 'last', 'first'
+        Returns:
+        --------
+        pd.DataFrame
+            Resampled data
+        """
+        if self.data is None:
+            logger.warning("Data not loaded")
+            return None
+        freq = freq or self.config.freq
+        # Check if index is datetime
+        if not isinstance(self.data.index, pd.DatetimeIndex):
+            logger.error("Data index is not DatetimeIndex")
+            return self.data
+        # Aggregation methods
+        agg_methods = {
+            'mean': np.mean,
+            'sum': np.sum,
+            'last': lambda x: x.iloc[-1],
+            'first': lambda x: x.iloc[0],
+            'min': np.min,
+            'max': np.max,
+            'median': np.median
+        }
+        if method not in agg_methods:
+            logger.warning(f"Method {method} not supported, using mean")
+            method = 'mean'
+        # Resampling
+        try:
+            if method == 'last':
+                resampled_data = self.data.resample(freq).last()
+            elif method == 'first':
+                resampled_data = self.data.resample(freq).first()
+            else:
+                resampled_data = self.data.resample(freq).agg(agg_methods[method])
+            logger.info(f"Data resampled to frequency {freq}, method {method}")
+            logger.info(f"Size before: {len(self.data)}, after: {len(resampled_data)}")
+            self.data = resampled_data
+            return self.data
+        except Exception as e:
+            logger.error(f"Error during resampling: {e}")
+            return self.data
+    def detect_frequency(self) -> str:
+        """
+        Automatically detect data frequency
+        Returns:
+        --------
+        str
+            Detected data frequency
+        """
+        if self.data is None or len(self.data) < 2:
+            return 'unknown'
+        if not isinstance(self.data.index, pd.DatetimeIndex):
+            return 'irregular'
+        # Calculate differences between timestamps
+        diffs = pd.Series(self.data.index).diff().dropna()
+        if len(diffs) == 0:
+            return 'unknown'
+        # Most frequent difference
+        mode_diff = diffs.mode().iloc[0] if not diffs.mode().empty else diffs.iloc[0]
+        # Determine frequency
+        if mode_diff < pd.Timedelta('1 hour'):
+            return 'H'  # Hourly
+        elif mode_diff < pd.Timedelta('1 day'):
+            return 'D'  # Daily
+        elif mode_diff < pd.Timedelta('7 days'):
+            return 'W'  # Weekly
+        elif mode_diff < pd.Timedelta('30 days'):
+            return 'M'  # Monthly
+        elif mode_diff < pd.Timedelta('90 days'):
+            return 'Q'  # Quarterly
+        else:
+            return 'Y'  # Yearly

decomposition/__init__.py ADDED Viewed

File without changes

decomposition/decomposer.py ADDED Viewed

	@@ -0,0 +1,690 @@

+# ============================================
+# CLASS 7: TIME SERIES DECOMPOSITION
+# ============================================
+import traceback
+from typing import Dict, Optional
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+import statsmodels.api as sm
+from scipy import stats
+from statsmodels.tsa.seasonal import seasonal_decompose, STL
+from statsmodels.tsa.stattools import adfuller, kpss, acf, pacf
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+from statsmodels.stats.diagnostic import acorr_ljungbox
+from statsmodels.tsa.statespace.sarimax import SARIMAX
+from statsmodels.tsa.holtwinters import ExponentialSmoothing
+class TimeSeriesDecomposer:
+    """Class for time series decomposition"""
+    def __init__(self, config: Config):
+        """
+        Initialise decomposer
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.decomposition_results = {}
+        self.decomposition_models = {}
+        self.seasonal_periods = {}
+    def decompose(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        method: str = 'stl',
+        period: Optional[int] = None,
+        **kwargs
+    ) -> Dict:
+        """
+        Decompose time series into components
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration value.
+        method : str
+            Decomposition model: 'stl', 'seasonal_decompose', 'mstl', 'naive'
+        period : int, optional
+            Seasonality period. If None, uses configuration value.
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        Dict
+            Decomposition results
+        """
+        logger.info("\n" + "="*80)
+        logger.info("TIME SERIES DECOMPOSITION")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        period = period or self.config.seasonal_period
+        if target_col not in data.columns:
+            logger.error(f"Target variable '{target_col}' not found")
+            return {}
+        # Set date as index if not set
+        if not isinstance(data.index, pd.DatetimeIndex):
+            if 'date' in data.columns:
+                data = data.set_index('date')
+            else:
+                logger.error("DatetimeIndex required for decomposition")
+                return {}
+        series = data[target_col]
+        # Automatic seasonality period detection
+        if period is None or period == 'auto':
+            period = self._detect_seasonal_period(series)
+            logger.info(f"Automatically detected seasonality period: {period}")
+        try:
+            decomposition_result = None
+            if method == 'stl':
+                decomposition_result = self._stl_decomposition(series, period, **kwargs)
+            elif method == 'seasonal_decompose':
+                decomposition_result = self._seasonal_decompose(series, period, **kwargs)
+            elif method == 'mstl':
+                decomposition_result = self._mstl_decomposition(series, **kwargs)
+            elif method == 'naive':
+                decomposition_result = self._naive_decomposition(series, period, **kwargs)
+            else:
+                logger.warning(f"Method {method} not supported, using STL")
+                decomposition_result = self._stl_decomposition(series, period, **kwargs)
+            if decomposition_result is None:
+                logger.error("Decomposition failed")
+                return {}
+            # Analyse residuals
+            residuals_info = self._analyse_residuals(decomposition_result.get('residual', None))
+            # Analyse seasonality
+            seasonal_info = self._analyse_seasonality(
+                decomposition_result.get('seasonal', None),
+                period
+            )
+            # Save results
+            self.decomposition_results[target_col] = {
+                'method': method,
+                'period': period,
+                'residuals_analysis': residuals_info,
+                'seasonality_analysis': seasonal_info,
+                'components_present': list(decomposition_result.keys()),
+                'decomposition_stats': {
+                    'trend_strength': self._calculate_trend_strength(
+                        decomposition_result.get('trend', None),
+                        decomposition_result.get('residual', None)
+                    ),
+                    'seasonal_strength': self._calculate_seasonal_strength(
+                        decomposition_result.get('seasonal', None),
+                        decomposition_result.get('residual', None)
+                    )
+                }
+            }
+            # Visualisation
+            if self.config.save_plots:
+                self._plot_decomposition(data, target_col, decomposition_result, method, period)
+            # Additional visualisation
+            if residuals_info:
+                self._plot_residuals_analysis(decomposition_result.get('residual', None), target_col)
+            return self.decomposition_results[target_col]
+        except Exception as e:
+            logger.error(f"Error during decomposition: {e}")
+            logger.error(traceback.format_exc())
+            return {}
+    def _detect_seasonal_period(self, series: pd.Series) -> int:
+        """Automatic seasonality period detection"""
+        if len(series) < 100:
+            return self.config.seasonal_period
+        try:
+            # Use autocorrelation to determine period
+            acf_values = acf(series.dropna(), nlags=min(500, len(series)//2))
+            # Find peaks in autocorrelation
+            peaks = []
+            for i in range(1, len(acf_values)-1):
+                if acf_values[i] > acf_values[i-1] and acf_values[i] > acf_values[i+1]:
+                    if acf_values[i] > 0.3:  # Significance threshold
+                        peaks.append(i)
+            if peaks:
+                # Take most significant period
+                dominant_period = peaks[0]
+                # Check for multiple periods
+                for period in [7, 30, 90, 365]:
+                    if abs(dominant_period - period) <= 2:
+                        return period
+                return dominant_period
+            return self.config.seasonal_period
+        except:
+            return self.config.seasonal_period
+    def _stl_decomposition(
+        self,
+        series: pd.Series,
+        period: int,
+        **kwargs
+    ) -> Optional[Dict]:
+        """STL decomposition"""
+        try:
+            if len(series) < 2 * period:
+                logger.warning(f"Insufficient data for STL decomposition with period {period}")
+                return self._seasonal_decompose(series, period, **kwargs)
+            # STL decomposition
+            stl = STL(
+                series,
+                period=period,
+                seasonal=kwargs.get('seasonal', 7),
+                trend=kwargs.get('trend', None),
+                robust=kwargs.get('robust', True),
+                seasonal_deg=kwargs.get('seasonal_deg', 1),
+                trend_deg=kwargs.get('trend_deg', 1),
+                low_pass_deg=kwargs.get('low_pass_deg', 1)
+            )
+            result = stl.fit()
+            return {
+                'trend': result.trend,
+                'seasonal': result.seasonal,
+                'residual': result.resid,
+                'observed': series
+            }
+        except Exception as e:
+            logger.warning(f"STL decomposition failed: {e}")
+            return self._seasonal_decompose(series, period, **kwargs)
+    def _seasonal_decompose(
+        self,
+        series: pd.Series,
+        period: int,
+        **kwargs
+    ) -> Optional[Dict]:
+        """Seasonal decomposition from statsmodels"""
+        try:
+            model = kwargs.get('model', 'additive')
+            if len(series) < 2 * period:
+                # Reduce period if insufficient data
+                period = max(7, len(series) // 4)
+            decomposition = seasonal_decompose(
+                series,
+                model=model,
+                period=period,
+                extrapolate_trend=kwargs.get('extrapolate_trend', 'freq'),
+                two_sided=kwargs.get('two_sided', True)
+            )
+            return {
+                'trend': decomposition.trend,
+                'seasonal': decomposition.seasonal,
+                'residual': decomposition.resid,
+                'observed': series
+            }
+        except Exception as e:
+            logger.warning(f"Seasonal decompose failed: {e}")
+            return self._naive_decomposition(series, period, **kwargs)
+    def _mstl_decomposition(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> Optional[Dict]:
+        """Multi-seasonal decomposition (simplified)"""
+        try:
+            # Simplified MSTL version
+            periods = kwargs.get('periods', [7, 365])
+            result = {
+                'observed': series,
+                'trend': None,
+                'seasonal': pd.Series(0, index=series.index),
+                'residual': series.copy()
+            }
+            # Sequentially remove seasonal components
+            for period in periods:
+                if len(series) >= 2 * period:
+                    try:
+                        decomp = seasonal_decompose(
+                            result['residual'],
+                            model='additive',
+                            period=period,
+                            extrapolate_trend='freq'
+                        )
+                        if result['trend'] is None:
+                            result['trend'] = decomp.trend
+                        result['seasonal'] = result['seasonal'] + decomp.seasonal
+                        result['residual'] = decomp.resid
+                    except:
+                        continue
+            if result['trend'] is None:
+                result['trend'] = series.rolling(window=min(365, len(series)//4), center=True).mean()
+            return result
+        except Exception as e:
+            logger.warning(f"MSTL decomposition failed: {e}")
+            return self._seasonal_decompose(series, 365, **kwargs)
+    def _naive_decomposition(
+        self,
+        series: pd.Series,
+        period: int,
+        **kwargs
+    ) -> Optional[Dict]:
+        """Naive decomposition"""
+        try:
+            # Simple decomposition using moving averages
+            trend = series.rolling(
+                window=min(period, len(series)//4),
+                center=True,
+                min_periods=1
+            ).mean()
+            # Seasonal component
+            if period > 1:
+                # Average by seasons
+                seasonal = series.groupby(series.index.dayofyear if period == 365 else
+                                         series.index.dayofweek if period == 7 else
+                                         series.index.month).transform('mean')
+                seasonal = seasonal - seasonal.mean()
+            else:
+                seasonal = pd.Series(0, index=series.index)
+            residual = series - trend - seasonal
+            return {
+                'trend': trend,
+                'seasonal': seasonal,
+                'residual': residual,
+                'observed': series
+            }
+        except Exception as e:
+            logger.error(f"Naive decomposition failed: {e}")
+            return None
+    def _analyse_residuals(self, residuals) -> Dict:
+        """Analyse decomposition residuals"""
+        if residuals is None:
+            return {}
+        residuals_clean = residuals.dropna()
+        if len(residuals_clean) == 0:
+            return {}
+        stats_info = {
+            'mean': float(residuals_clean.mean()),
+            'std': float(residuals_clean.std()),
+            'skewness': float(residuals_clean.skew()),
+            'kurtosis': float(residuals_clean.kurtosis()),
+            'min': float(residuals_clean.min()),
+            'max': float(residuals_clean.max()),
+            'mad': float((residuals_clean - residuals_clean.mean()).abs().mean()),
+            'normality_tests': {},
+            'autocorrelation_tests': {}
+        }
+        # Normality test
+        if len(residuals_clean) > 3:
+            try:
+                # Shapiro-Wilk test
+                shapiro_stat, shapiro_p = stats.shapiro(residuals_clean.iloc[:5000])
+                stats_info['normality_tests']['shapiro_wilk'] = {
+                    'statistic': float(shapiro_stat),
+                    'pvalue': float(shapiro_p),
+                    'is_normal': shapiro_p > 0.05
+                }
+                # Anderson-Darling test
+                anderson_result = stats.anderson(residuals_clean, dist='norm')
+                stats_info['normality_tests']['anderson_darling'] = {
+                    'statistic': float(anderson_result.statistic),
+                    'critical_values': {str(level): float(value)
+                                       for level, value in zip(anderson_result.significance_level,
+                                                              anderson_result.critical_values)},
+                    'is_normal': anderson_result.statistic < anderson_result.critical_values[2]  # At 5% level
+                }
+            except:
+                stats_info['normality_tests']['error'] = 'not enough data or calculation error'
+        # Autocorrelation test
+        try:
+            # Ljung-Box test
+            lb_test = acorr_ljungbox(residuals_clean, lags=[10, 20, 30], return_df=True)
+            autocorr_info = {}
+            for idx, row in lb_test.iterrows():
+                autocorr_info[f'lag_{int(row.name)}'] = {
+                    'statistic': float(row['lb_stat']),
+                    'pvalue': float(row['lb_pvalue']),
+                    'has_autocorrelation': row['lb_pvalue'] < 0.05
+                }
+            stats_info['autocorrelation_tests']['ljung_box'] = autocorr_info
+            # Durbin-Watson test
+            try:
+                dw_stat = sm.stats.stattools.durbin_watson(residuals_clean)
+                stats_info['autocorrelation_tests']['durbin_watson'] = {
+                    'statistic': float(dw_stat),
+                    'interpretation': 'no autocorrelation' if 1.5 < dw_stat < 2.5 else
+                                     'positive autocorrelation' if dw_stat < 1.5 else
+                                     'negative autocorrelation'
+                }
+            except:
+                pass
+        except:
+            stats_info['autocorrelation_tests']['error'] = 'calculation error'
+        # Heteroskedasticity test
+        try:
+            # ARCH test
+            from statsmodels.stats.diagnostic import het_arch
+            arch_test = het_arch(residuals_clean)
+            stats_info['heteroskedasticity_tests'] = {
+                'arch': {
+                    'statistic': float(arch_test[0]),
+                    'pvalue': float(arch_test[1]),
+                    'is_homoskedastic': arch_test[1] > 0.05
+                }
+            }
+        except:
+            pass
+        return stats_info
+    def _analyse_seasonality(self, seasonal_component, period: int) -> Dict:
+        """Analyse seasonal component"""
+        if seasonal_component is None:
+            return {}
+        seasonal_clean = seasonal_component.dropna()
+        if len(seasonal_clean) == 0:
+            return {}
+        analysis = {
+            'period': period,
+            'amplitude': float(seasonal_clean.max() - seasonal_clean.min()),
+            'mean_amplitude': float(seasonal_clean.abs().mean()),
+            'seasonal_strength': float(seasonal_clean.std()),
+            'periodicity_check': {}
+        }
+        # Check periodicity via autocorrelation
+        if len(seasonal_clean) > period * 2:
+            try:
+                acf_values = acf(seasonal_clean, nlags=min(period * 3, len(seasonal_clean)//2))
+                # Look for peaks at expected lags
+                expected_lags = [period, period*2]
+                peaks_found = []
+                for lag in expected_lags:
+                    if lag < len(acf_values):
+                        if acf_values[lag] > 0.5:  # Strong autocorrelation at period
+                            peaks_found.append({
+                                'lag': lag,
+                                'autocorrelation': float(acf_values[lag]),
+                                'is_significant': True
+                            })
+                analysis['periodicity_check']['autocorrelation_peaks'] = peaks_found
+                analysis['periodicity_check']['is_periodic'] = len(peaks_found) > 0
+            except:
+                pass
+        # Seasonality pattern analysis
+        if isinstance(seasonal_clean.index, pd.DatetimeIndex):
+            try:
+                # Group by months/week days
+                if period == 12 or period == 365:
+                    # Monthly seasonality
+                    monthly_seasonal = seasonal_clean.groupby(seasonal_clean.index.month).mean()
+                    analysis['monthly_pattern'] = monthly_seasonal.to_dict()
+                if period == 7 or period == 365:
+                    # Daily seasonality
+                    daily_seasonal = seasonal_clean.groupby(seasonal_clean.index.dayofweek).mean()
+                    analysis['daily_pattern'] = daily_seasonal.to_dict()
+            except:
+                pass
+        return analysis
+    def _calculate_trend_strength(self, trend, residual) -> float:
+        """Calculate trend strength"""
+        if trend is None or residual is None:
+            return 0.0
+        trend_clean = trend.dropna()
+        residual_clean = residual.dropna()
+        if len(trend_clean) == 0 or len(residual_clean) == 0:
+            return 0.0
+        # Trend strength = 1 - Var(residual) / Var(trend + residual)
+        try:
+            var_total = np.var(trend_clean + residual_clean)
+            if var_total > 0:
+                trend_strength = 1 - np.var(residual_clean) / var_total
+                return max(0.0, min(1.0, float(trend_strength)))
+        except:
+            pass
+        return 0.0
+    def _calculate_seasonal_strength(self, seasonal, residual) -> float:
+        """Calculate seasonality strength"""
+        if seasonal is None or residual is None:
+            return 0.0
+        seasonal_clean = seasonal.dropna()
+        residual_clean = residual.dropna()
+        if len(seasonal_clean) == 0 or len(residual_clean) == 0:
+            return 0.0
+        # Seasonality strength = 1 - Var(residual) / Var(seasonal + residual)
+        try:
+            var_total = np.var(seasonal_clean + residual_clean)
+            if var_total > 0:
+                seasonal_strength = 1 - np.var(residual_clean) / var_total
+                return max(0.0, min(1.0, float(seasonal_strength)))
+        except:
+            pass
+        return 0.0
+    def _plot_decomposition(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        decomposition: Dict,
+        method: str,
+        period: int
+    ) -> None:
+        """Visualise decomposition"""
+        fig, axes = plt.subplots(4, 1, figsize=(14, 12))
+        # Original series
+        axes[0].plot(decomposition.get('observed', pd.Series()))
+        axes[0].set_ylabel('Observed')
+        axes[0].set_title(f'Time Series Decomposition: {target_col} ({method}, period={period})')
+        axes[0].grid(True, alpha=0.3)
+        # Trend
+        if 'trend' in decomposition and decomposition['trend'] is not None:
+            axes[1].plot(decomposition['trend'])
+        axes[1].set_ylabel('Trend')
+        axes[1].grid(True, alpha=0.3)
+        # Seasonality
+        if 'seasonal' in decomposition and decomposition['seasonal'] is not None:
+            axes[2].plot(decomposition['seasonal'])
+        axes[2].set_ylabel('Seasonality')
+        axes[2].grid(True, alpha=0.3)
+        # Residuals
+        if 'residual' in decomposition and decomposition['residual'] is not None:
+            axes[3].plot(decomposition['residual'])
+        axes[3].set_ylabel('Residuals')
+        axes[3].set_xlabel('Date')
+        axes[3].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/decomposition_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+        # Additional plots
+        self._plot_decomposition_components(data, target_col, decomposition)
+    def _plot_decomposition_components(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        decomposition: Dict
+    ) -> None:
+        """Visualise decomposition components"""
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        # 1. Sum of components vs original series
+        if all(k in decomposition for k in ['trend', 'seasonal', 'residual']):
+            reconstructed = decomposition['trend'] + decomposition['seasonal'] + decomposition['residual']
+            axes[0, 0].plot(decomposition['observed'], alpha=0.7, label='Original')
+            axes[0, 0].plot(reconstructed, alpha=0.7, label='Reconstructed')
+            axes[0, 0].set_title('Original vs Reconstructed Series')
+            axes[0, 0].set_xlabel('Date')
+            axes[0, 0].set_ylabel(target_col)
+            axes[0, 0].legend()
+            axes[0, 0].grid(True, alpha=0.3)
+        # 2. Residuals distribution
+        if 'residual' in decomposition and decomposition['residual'] is not None:
+            residuals = decomposition['residual'].dropna()
+            axes[0, 1].hist(residuals, bins=30, edgecolor='black', alpha=0.7, density=True)
+            # Normal distribution for comparison
+            xmin, xmax = axes[0, 1].get_xlim()
+            x = np.linspace(xmin, xmax, 100)
+            p = stats.norm.pdf(x, residuals.mean(), residuals.std())
+            axes[0, 1].plot(x, p, 'k', linewidth=2, label='Normal distribution')
+            axes[0, 1].set_title('Residuals Distribution')
+            axes[0, 1].set_xlabel('Residuals')
+            axes[0, 1].set_ylabel('Density')
+            axes[0, 1].legend()
+            axes[0, 1].grid(True, alpha=0.3)
+        # 3. ACF of residuals
+        if 'residual' in decomposition and decomposition['residual'] is not None:
+            plot_acf(decomposition['residual'].dropna(), lags=50, ax=axes[1, 0], alpha=0.05)
+            axes[1, 0].set_title('Residuals ACF')
+            axes[1, 0].set_xlabel('Lag')
+            axes[1, 0].set_ylabel('Autocorrelation')
+            axes[1, 0].grid(True, alpha=0.3)
+        # 4. Seasonal pattern
+        if 'seasonal' in decomposition and decomposition['seasonal'] is not None:
+            seasonal = decomposition['seasonal']
+            if isinstance(seasonal.index, pd.DatetimeIndex):
+                # Group by months
+                try:
+                    monthly_seasonal = seasonal.groupby(seasonal.index.month).mean()
+                    axes[1, 1].bar(monthly_seasonal.index, monthly_seasonal.values)
+                    axes[1, 1].set_title('Average Seasonal Pattern by Month')
+                    axes[1, 1].set_xlabel('Month')
+                    axes[1, 1].set_ylabel('Seasonality')
+                    axes[1, 1].set_xticks(range(1, 13))
+                    axes[1, 1].grid(True, alpha=0.3)
+                except:
+                    axes[1, 1].plot(seasonal.index, seasonal.values)
+                    axes[1, 1].set_title('Seasonal Component')
+                    axes[1, 1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/decomposition_components_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def _plot_residuals_analysis(self, residuals, target_col: str) -> None:
+        """Visualise residual analysis"""
+        if residuals is None:
+            return
+        residuals_clean = residuals.dropna()
+        if len(residuals_clean) == 0:
+            return
+        fig, axes = plt.subplots(1, 2, figsize=(12, 4))
+        # Q-Q plot
+        stats.probplot(residuals_clean, dist="norm", plot=axes[0])
+        axes[0].set_title('Residuals Q-Q plot')
+        axes[0].grid(True, alpha=0.3)
+        # Residuals over time
+        axes[1].plot(residuals_clean.index, residuals_clean.values, linewidth=0.5)
+        axes[1].axhline(y=0, color='r', linestyle='-', alpha=0.3)
+        axes[1].set_title('Residuals Over Time')
+        axes[1].set_xlabel('Date')
+        axes[1].set_ylabel('Residuals')
+        axes[1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/residuals_analysis_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get decomposition report"""
+        return self.decomposition_results

feature_selection/__init__.py ADDED Viewed

File without changes

feature_selection/feature_selector.py ADDED Viewed

	@@ -0,0 +1,478 @@

+# ============================================
+# CLASS 11: FEATURE SELECTION
+# ============================================
+from typing import Dict, List, Optional, Tuple
+from venv import logger
+from config.config import Config
+try:
+    import pandas as pd
+    import numpy as np
+    import matplotlib.pyplot as plt
+    import seaborn as sns
+    from sklearn.ensemble import RandomForestRegressor
+    from sklearn.decomposition import PCA
+    from sklearn.preprocessing import StandardScaler
+    print("✅ All imports working!")
+except ImportError as e:
+    print(f"❌ Import error: {e}")
+from sklearn.inspection import permutation_importance, partial_dependence
+from sklearn.feature_selection import (
+    SelectKBest, SelectPercentile, RFE, RFECV, VarianceThreshold,
+    f_regression, mutual_info_regression
+)
+class FeatureSelector:
+    """Class for selecting the most important features"""
+    def __init__(self, config: Config):
+        """
+        Initialise feature selector
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.selected_features = []
+        self.feature_importances = {}
+        self.selection_methods = {}
+        self.selector_objects = {}
+    def select(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        method: str = None,
+        n_features: int = None,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Select the most important features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration value.
+        method : str, optional
+            Selection method. If None, uses configuration value.
+        n_features : int, optional
+            Number of features to select. If None, uses configuration value.
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Data with selected features
+        """
+        logger.info("\n" + "="*80)
+        logger.info("FEATURE SELECTION")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        method = method or self.config.feature_selection_method
+        n_features = n_features or self.config.max_features
+        if target_col not in data.columns:
+            logger.error(f"Target variable '{target_col}' not found")
+            return data
+        # Prepare data
+        X = data.drop(columns=[target_col]).select_dtypes(include=[np.number])
+        y = data[target_col]
+        # Remove missing values
+        mask = X.notna().all(axis=1) & y.notna()
+        X_clean = X[mask]
+        y_clean = y[mask]
+        if len(X_clean) < 10 or len(X_clean.columns) < 2:
+            logger.warning("Insufficient data for feature selection")
+            return data
+        logger.info(f"Selection method: {method}")
+        logger.info(f"Target number of features: {n_features}")
+        logger.info(f"Initial number of features: {len(X.columns)}")
+        logger.info(f"Data for selection: {len(X_clean)} records")
+        # Apply selection method
+        selected_features_list = []
+        feature_importance_dict = {}
+        if method == 'correlation':
+            selected_features_list, feature_importance_dict = self._correlation_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'mutual_info':
+            selected_features_list, feature_importance_dict = self._mutual_info_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'rf':
+            selected_features_list, feature_importance_dict = self._random_forest_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'pca':
+            selected_features_list, feature_importance_dict = self._pca_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'rfe':
+            selected_features_list, feature_importance_dict = self._rfe_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'lasso':
+            selected_features_list, feature_importance_dict = self._lasso_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        elif method == 'hybrid':
+            selected_features_list, feature_importance_dict = self._hybrid_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        else:
+            logger.warning(f"Method {method} not supported, using correlation")
+            selected_features_list, feature_importance_dict = self._correlation_selection(
+                X_clean, y_clean, n_features, **kwargs
+            )
+        # Save selected features
+        self.selected_features = selected_features_list
+        self.feature_importances = feature_importance_dict
+        self.selection_methods[method] = {
+            'selected_features': selected_features_list,
+            'n_features': len(selected_features_list),
+            'feature_importances': feature_importance_dict
+        }
+        # Form final dataset
+        features_to_keep = selected_features_list + [target_col]
+        features_to_keep = [f for f in features_to_keep if f in data.columns]
+        data_selected = data[features_to_keep].copy()
+        logger.info(f"✓ Selected {len(selected_features_list)} features")
+        logger.info(f"  Total features kept: {len(data_selected.columns)}")
+        # Visualisation
+        if self.config.save_plots and selected_features_list:
+            self._plot_feature_selection(
+                X_clean, y_clean, selected_features_list,
+                feature_importance_dict, method
+            )
+        return data_selected
+    def _correlation_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on correlation"""
+        # Calculate correlations with target variable
+        correlations = X.corrwith(y).abs().sort_values(ascending=False)
+        # Select top-n_features
+        selected_features = correlations.head(n_features).index.tolist()
+        feature_importance = correlations.to_dict()
+        return selected_features, feature_importance
+    def _mutual_info_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on mutual information"""
+        try:
+            mi_scores = mutual_info_regression(X, y, random_state=kwargs.get('random_state', 42))
+            mi_series = pd.Series(mi_scores, index=X.columns)
+            mi_series = mi_series.sort_values(ascending=False)
+            selected_features = mi_series.head(n_features).index.tolist()
+            feature_importance = mi_series.to_dict()
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"Mutual information selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _random_forest_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on Random Forest"""
+        try:
+            rf = RandomForestRegressor(
+                n_estimators=kwargs.get('n_estimators', 100),
+                max_depth=kwargs.get('max_depth', None),
+                random_state=kwargs.get('random_state', 42),
+                n_jobs=self.config.n_jobs if self.config.use_multiprocessing else None
+            )
+            rf.fit(X, y)
+            importances = pd.Series(rf.feature_importances_, index=X.columns)
+            importances = importances.sort_values(ascending=False)
+            selected_features = importances.head(n_features).index.tolist()
+            feature_importance = importances.to_dict()
+            self.selector_objects['random_forest'] = rf
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"Random Forest selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _pca_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection based on PCA"""
+        try:
+            # First standardise data
+            from sklearn.preprocessing import StandardScaler
+            scaler = StandardScaler()
+            X_scaled = scaler.fit_transform(X)
+            # Apply PCA
+            pca = PCA(n_components=min(n_features, len(X.columns)))
+            X_pca = pca.fit_transform(X_scaled)
+            # Get feature importance via absolute component values
+            importance = np.abs(pca.components_).sum(axis=0)
+            importance_series = pd.Series(importance, index=X.columns)
+            importance_series = importance_series.sort_values(ascending=False)
+            selected_features = importance_series.head(n_features).index.tolist()
+            feature_importance = importance_series.to_dict()
+            self.selector_objects['pca'] = pca
+            self.selector_objects['scaler'] = scaler
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"PCA selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _rfe_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Recursive Feature Elimination"""
+        try:
+            from sklearn.feature_selection import RFE
+            from sklearn.linear_model import LinearRegression
+            estimator = LinearRegression()
+            rfe = RFE(
+                estimator=estimator,
+                n_features_to_select=n_features,
+                step=kwargs.get('step', 1)
+            )
+            rfe.fit(X, y)
+            selected_mask = rfe.support_
+            selected_features = X.columns[selected_mask].tolist()
+            # Feature importance via ranking
+            ranking = pd.Series(rfe.ranking_, index=X.columns)
+            feature_importance = (1 / ranking).to_dict()  # Convert ranking to importance
+            self.selector_objects['rfe'] = rfe
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"RFE selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _lasso_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Feature selection using Lasso"""
+        try:
+            from sklearn.linear_model import LassoCV
+            lasso = LassoCV(
+                cv=kwargs.get('cv', 5),
+                random_state=kwargs.get('random_state', 42),
+                max_iter=kwargs.get('max_iter', 1000)
+            )
+            lasso.fit(X, y)
+            # Features with non-zero coefficients
+            coefficients = pd.Series(lasso.coef_, index=X.columns)
+            non_zero_features = coefficients[coefficients != 0].abs().sort_values(ascending=False)
+            # Select top-n_features
+            selected_features = non_zero_features.head(n_features).index.tolist()
+            feature_importance = non_zero_features.to_dict()
+            self.selector_objects['lasso'] = lasso
+            return selected_features, feature_importance
+        except Exception as e:
+            logger.warning(f"Lasso selection failed: {e}, using correlation")
+            return self._correlation_selection(X, y, n_features, **kwargs)
+    def _hybrid_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        n_features: int,
+        **kwargs
+    ) -> Tuple[List[str], Dict]:
+        """Hybrid feature selection method"""
+        # Combine multiple methods
+        methods = kwargs.get('methods', ['correlation', 'mutual_info', 'rf'])
+        weights = kwargs.get('weights', [0.3, 0.3, 0.4])
+        all_importances = {}
+        for method, weight in zip(methods, weights):
+            try:
+                if method == 'correlation':
+                    _, importance = self._correlation_selection(X, y, n_features, **kwargs)
+                elif method == 'mutual_info':
+                    _, importance = self._mutual_info_selection(X, y, n_features, **kwargs)
+                elif method == 'rf':
+                    _, importance = self._random_forest_selection(X, y, n_features, **kwargs)
+                else:
+                    continue
+                # Normalise importances and weight them
+                importance_series = pd.Series(importance)
+                if importance_series.max() > importance_series.min():
+                    importance_normalized = (importance_series - importance_series.min()) / \
+                                          (importance_series.max() - importance_series.min())
+                else:
+                    importance_normalized = pd.Series(1, index=importance_series.index)
+                # Add weighted importances
+                for feature in importance_normalized.index:
+                    if feature not in all_importances:
+                        all_importances[feature] = 0
+                    all_importances[feature] += importance_normalized[feature] * weight
+            except Exception as e:
+                logger.debug(f"Method {method} failed in hybrid selection: {e}")
+        # Sort by total importance
+        combined_importance = pd.Series(all_importances).sort_values(ascending=False)
+        selected_features = combined_importance.head(n_features).index.tolist()
+        return selected_features, combined_importance.to_dict()
+    def _plot_feature_selection(
+        self,
+        X: pd.DataFrame,
+        y: pd.Series,
+        selected_features: List[str],
+        feature_importance: Dict,
+        method: str
+    ) -> None:
+        """Visualise feature selection results"""
+        # Prepare data for visualisation
+        importance_series = pd.Series(feature_importance).sort_values(ascending=False)
+        # Limit number of features for display
+        display_features = importance_series.head(20)
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        # 1. Feature importance
+        y_pos = np.arange(len(display_features))
+        axes[0, 0].barh(y_pos, display_features.values)
+        axes[0, 0].set_yticks(y_pos)
+        axes[0, 0].set_yticklabels(display_features.index, fontsize=9)
+        axes[0, 0].invert_yaxis()
+        axes[0, 0].set_xlabel('Importance')
+        axes[0, 0].set_title(f'Top-{len(display_features)} features by importance ({method})')
+        axes[0, 0].grid(True, alpha=0.3, axis='x')
+        # 2. Cumulative importance
+        cumulative_importance = importance_series.cumsum() / importance_series.sum()
+        axes[0, 1].plot(range(1, len(cumulative_importance) + 1), cumulative_importance.values)
+        axes[0, 1].axhline(y=0.8, color='r', linestyle='--', alpha=0.7, label='80% importance')
+        axes[0, 1].axhline(y=0.9, color='orange', linestyle='--', alpha=0.7, label='90% importance')
+        axes[0, 1].set_xlabel('Number of features')
+        axes[0, 1].set_ylabel('Cumulative importance')
+        axes[0, 1].set_title('Cumulative feature importance')
+        axes[0, 1].legend()
+        axes[0, 1].grid(True, alpha=0.3)
+        # 3. Correlation matrix of selected features
+        if len(selected_features) > 1:
+            selected_X = X[selected_features]
+            corr_matrix = selected_X.corr()
+            mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+            sns.heatmap(
+                corr_matrix,
+                annot=True,
+                fmt='.2f',
+                cmap='coolwarm',
+                center=0,
+                square=True,
+                mask=mask,
+                cbar_kws={'shrink': 0.8},
+                ax=axes[1, 0]
+            )
+            axes[1, 0].set_title(f'Correlation of selected features ({len(selected_features)})')
+        # 4. Importance distribution
+        axes[1, 1].hist(importance_series.values, bins=30, edgecolor='black', alpha=0.7)
+        axes[1, 1].set_xlabel('Feature importance')
+        axes[1, 1].set_ylabel('Frequency')
+        axes[1, 1].set_title('Feature importance distribution')
+        axes[1, 1].grid(True, alpha=0.3)
+        plt.suptitle(f'Feature selection results using {method} method', fontsize=14)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/feature_selection_{method}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get feature selection report"""
+        return {
+            'selected_features': self.selected_features,
+            'feature_importances': self.feature_importances,
+            'selection_methods': self.selection_methods
+        }

features/__init__.py ADDED Viewed

File without changes

features/feature_engineer.py ADDED Viewed

	@@ -0,0 +1,638 @@

+# ============================================
+# CLASS 5: FEATURE ENGINEER
+# ============================================
+from typing import Dict, List, Optional
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+class FeatureEngineer:
+    """Class for creating new features for time series"""
+    def __init__(self, config: Config):
+        """
+        Initialise feature engineer
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.created_features = []
+        self.feature_info = {}
+        self.feature_importances = {}
+        self.transforms_applied = {}
+    def create_all_features(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None
+    ) -> pd.DataFrame:
+        """
+        Create all types of features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration value.
+        Returns:
+        --------
+        pd.DataFrame
+            Data with all features
+        """
+        logger.info("\n" + "="*80)
+        logger.info("CREATING FEATURES FOR TIME SERIES")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        initial_features = len(data.columns)
+        initial_rows = len(data)
+        # Check and save index
+        original_index = data.index
+        index_is_datetime = isinstance(original_index, pd.DatetimeIndex)
+        logger.info(f"Initial number of features: {initial_features}")
+        logger.info(f"Initial number of rows: {initial_rows}")
+        logger.info(f"Index is DatetimeIndex: {index_is_datetime}")
+        # If index not DatetimeIndex but 'date' column exists
+        if not index_is_datetime and 'date' in data.columns:
+            logger.info("Attempting to set DatetimeIndex from 'date' column")
+            try:
+                data = data.set_index('date')
+                if isinstance(data.index, pd.DatetimeIndex):
+                    index_is_datetime = True
+                    original_index = data.index
+                    logger.info("✓ DatetimeIndex set from 'date' column")
+                else:
+                    logger.warning("Failed to set DatetimeIndex")
+            except Exception as e:
+                logger.warning(f"Error setting DatetimeIndex: {e}")
+        # Save data copy for index restoration later
+        data_processed = data.copy()
+        # 1. Create basic temporal features (if date exists)
+        if index_is_datetime:
+            logger.info("\n1. BASIC TEMPORAL FEATURES")
+            data_processed = self.create_temporal_features(data_processed)
+        else:
+            logger.info("\n1. BASIC TEMPORAL FEATURES: skipped (no DatetimeIndex)")
+        # 2. Create statistical features
+        logger.info("\n2. STATISTICAL FEATURES")
+        data_processed = self.create_statistical_features(data_processed, target_col)
+        # 3. Create rolling features
+        logger.info("\n3. ROLLING FEATURES")
+        data_processed = self.create_rolling_features(data_processed, target_col)
+        # 4. Create lag features (limited quantity)
+        logger.info("\n4. LAG FEATURES")
+        data_processed = self.create_lag_features(data_processed, target_col)
+        # 5. Create interaction features
+        logger.info("\n5. INTERACTION FEATURES")
+        data_processed = self.create_interaction_features(data_processed, target_col)
+        # 6. Create spectral features (only if sufficient data)
+        logger.info("\n6. SPECTRAL FEATURES")
+        if len(data_processed) > 100:
+            data_processed = self.create_spectral_features(data_processed, target_col)
+        else:
+            logger.info("   Skipped: insufficient data")
+        # 7. Create decomposition features (only if sufficient data and date exists)
+        logger.info("\n7. DECOMPOSITION FEATURES")
+        if len(data_processed) > 365 and index_is_datetime:
+            data_processed = self.create_decomposition_features(data_processed, target_col)
+        else:
+            logger.info("   Skipped: insufficient data or no DatetimeIndex")
+        # Remove rows with NaN that appeared due to lags and differences
+        rows_before_nan = len(data_processed)
+        data_processed = data_processed.dropna()
+        rows_after_nan = len(data_processed)
+        removed_rows = rows_before_nan - rows_after_nan
+        # Remove constant features
+        constant_features = []
+        for col in data_processed.columns:
+            if data_processed[col].nunique() <= 1:
+                constant_features.append(col)
+        if constant_features:
+            logger.info(f"\nRemoving constant features: {len(constant_features)} found")
+            for feat in constant_features[:10]:
+                logger.info(f"  - {feat}")
+            if len(constant_features) > 10:
+                logger.info(f"  ... and {len(constant_features) - 10} more features")
+            data_processed = data_processed.drop(columns=constant_features)
+            # Update created features list
+            self.created_features = [f for f in self.created_features if f not in constant_features]
+        # Save information
+        self.feature_info = {
+            'initial_features': initial_features,
+            'final_features': len(data_processed.columns),
+            'features_created': len(self.created_features),
+            'initial_rows': initial_rows,
+            'final_rows': len(data_processed),
+            'removed_rows': removed_rows,
+            'constant_features_removed': len(constant_features),
+            'created_features_list': self.created_features,
+            'feature_categories': self.get_feature_categories()
+        }
+        logger.info(f"\nFeature creation summary:")
+        logger.info(f"  Initial number of features: {initial_features}")
+        logger.info(f"  Final number of features: {len(data_processed.columns)}")
+        logger.info(f"  New features created: {len(self.created_features)}")
+        logger.info(f"  Initial number of rows: {initial_rows}")
+        logger.info(f"  Final number of rows: {len(data_processed)}")
+        logger.info(f"  Rows removed due to NaN: {removed_rows}")
+        logger.info(f"  Constant features removed: {len(constant_features)}")
+        return data_processed
+    def create_temporal_features(self, data: pd.DataFrame) -> pd.DataFrame:
+        """
+        Create temporal features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        Returns:
+        --------
+        pd.DataFrame
+            Data with temporal features
+        """
+        data_processed = data.copy()
+        if not isinstance(data_processed.index, pd.DatetimeIndex):
+            logger.warning("Temporal features not created: index not DatetimeIndex")
+            return data_processed
+        try:
+            # Basic temporal features
+            data_processed['year'] = data_processed.index.year
+            data_processed['month'] = data_processed.index.month
+            data_processed['day'] = data_processed.index.day
+            data_processed['dayofyear'] = data_processed.index.dayofyear
+            data_processed['dayofweek'] = data_processed.index.dayofweek
+            data_processed['weekofyear'] = data_processed.index.isocalendar().week.astype(int)
+            data_processed['quarter'] = data_processed.index.quarter
+            data_processed['is_weekend'] = data_processed['dayofweek'].isin([5, 6]).astype(int)
+            # Cyclic features for seasonality
+            data_processed['month_sin'] = np.sin(2 * np.pi * data_processed['month'] / 12)
+            data_processed['month_cos'] = np.cos(2 * np.pi * data_processed['month'] / 12)
+            data_processed['dayofyear_sin'] = np.sin(2 * np.pi * data_processed['dayofyear'] / 365.25)
+            data_processed['dayofyear_cos'] = np.cos(2 * np.pi * data_processed['dayofyear'] / 365.25)
+            data_processed['dayofweek_sin'] = np.sin(2 * np.pi * data_processed['dayofweek'] / 7)
+            data_processed['dayofweek_cos'] = np.cos(2 * np.pi * data_processed['dayofweek'] / 7)
+            # Time in days from start (relative features)
+            min_date = data_processed.index.min()
+            data_processed['days_from_start'] = (data_processed.index - min_date).days
+            # Register created features
+            temporal_features = ['year', 'month', 'day', 'dayofyear', 'dayofweek',
+                               'weekofyear', 'quarter', 'is_weekend', 'month_sin',
+                               'month_cos', 'dayofyear_sin', 'dayofyear_cos',
+                               'dayofweek_sin', 'dayofweek_cos', 'days_from_start']
+            self.created_features.extend([f for f in temporal_features if f not in self.created_features])
+            logger.info(f"✓ Created {len(temporal_features)} temporal features")
+        except Exception as e:
+            logger.warning(f"Error creating temporal features: {e}")
+        return data_processed
+    def create_statistical_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create statistical features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with statistical features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Only if we have year data
+        if 'year' in data_processed.columns:
+            # Yearly statistics
+            try:
+                yearly_stats = data_processed.groupby('year')[target_col].agg([
+                    'mean', 'std', 'min', 'max', 'median'
+                ])
+                yearly_stats.columns = [f'{target_col}_yearly_{col}' for col in yearly_stats.columns]
+                data_processed = data_processed.merge(yearly_stats, on='year', how='left')
+                # Add created features to list
+                for col in yearly_stats.columns:
+                    self.created_features.append(col)
+            except Exception as e:
+                logger.debug(f"Yearly statistics not created: {e}")
+        # Normalised features (only if there is variation)
+        std_val = data_processed[target_col].std()
+        if std_val > 0:
+            data_processed[f'{target_col}_zscore'] = (data_processed[target_col] - data_processed[target_col].mean()) / std_val
+            self.created_features.append(f'{target_col}_zscore')
+        # Features based on percentiles (binary features)
+        try:
+            for p in [0.25, 0.5, 0.75]:
+                quantile_val = data_processed[target_col].quantile(p)
+                data_processed[f'{target_col}_above_p{int(p*100)}'] = (data_processed[target_col] > quantile_val).astype(int)
+                self.created_features.append(f'{target_col}_above_p{int(p*100)}')
+        except Exception as e:
+            logger.debug(f"Quantile features not created: {e}")
+        logger.info(f"✓ Statistical features created: {len([c for c in data_processed.columns if c not in data.columns])}")
+        return data_processed
+    def create_rolling_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create rolling statistics
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with rolling features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Use only main windows from configuration
+        windows = [w for w in self.config.rolling_windows if w < len(data_processed) // 2]
+        for window in windows:
+            try:
+                # Basic statistics
+                data_processed[f'{target_col}_rolling_mean_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).mean()
+                data_processed[f'{target_col}_rolling_std_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).std()
+                data_processed[f'{target_col}_rolling_min_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).min()
+                data_processed[f'{target_col}_rolling_max_{window}'] = data_processed[target_col].rolling(
+                    window=window, min_periods=max(1, window//4), center=True
+                ).max()
+                self.created_features.extend([
+                    f'{target_col}_rolling_mean_{window}',
+                    f'{target_col}_rolling_std_{window}',
+                    f'{target_col}_rolling_min_{window}',
+                    f'{target_col}_rolling_max_{window}'
+                ])
+            except Exception as e:
+                logger.debug(f"Rolling features for window {window} not created: {e}")
+                continue
+        logger.info(f"✓ Rolling features created: {len([c for c in data_processed.columns if 'rolling' in c and c not in data.columns])}")
+        return data_processed
+    def create_lag_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create lag features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with lag features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Limited number of lags
+        max_lags = min(self.config.max_lags, 7)  # Maximum 7 lags
+        for lag in [1, 2, 3, 7, 14, 30]:
+            if lag <= max_lags:
+                data_processed[f'{target_col}_lag_{lag}'] = data_processed[target_col].shift(lag)
+                self.created_features.append(f'{target_col}_lag_{lag}')
+        # Seasonal lags (only if sufficient data)
+        if len(data_processed) > 365:
+            try:
+                data_processed[f'{target_col}_seasonal_lag_365'] = data_processed[target_col].shift(365)
+                self.created_features.append(f'{target_col}_seasonal_lag_365')
+            except Exception as e:
+                logger.debug(f"Seasonal lag not created: {e}")
+        # Differences (stationarity)
+        data_processed[f'{target_col}_diff_1'] = data_processed[target_col].diff(1)
+        self.created_features.append(f'{target_col}_diff_1')
+        if len(data_processed) > 7:
+            data_processed[f'{target_col}_diff_7'] = data_processed[target_col].diff(7)
+            self.created_features.append(f'{target_col}_diff_7')
+        logger.info(f"✓ Lag features created: {len([c for c in data_processed.columns if ('lag' in c or 'diff' in c) and c not in data.columns])}")
+        return data_processed
+    def create_interaction_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create interaction features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with interaction features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        # Interactions with temperature (only if data exists)
+        temp_cols = ['tavg', 'tmin', 'tmax']
+        available_temp_cols = [col for col in temp_cols if col in data_processed.columns]
+        for temp_col in available_temp_cols:
+            try:
+                # Avoid division by zero
+                temp_data = data_processed[temp_col].replace(0, np.nan)
+                if temp_data.notna().all() and (temp_data != 0).all():
+                    data_processed[f'{target_col}_{temp_col}_ratio'] = data_processed[target_col] / temp_data
+                    self.created_features.append(f'{target_col}_{temp_col}_ratio')
+                    # Product
+                    data_processed[f'{target_col}_{temp_col}_product'] = data_processed[target_col] * temp_data
+                    self.created_features.append(f'{target_col}_{temp_col}_product')
+            except Exception as e:
+                logger.debug(f"Interaction feature with {temp_col} not created: {e}")
+        # Interaction with water level
+        if 'urovenvoda' in data_processed.columns:
+            try:
+                uroven_data = data_processed['urovenvoda'].replace(0, np.nan)
+                if uroven_data.notna().all() and (uroven_data != 0).all():
+                    data_processed[f'{target_col}_urovenvoda_ratio'] = data_processed[target_col] / uroven_data
+                    self.created_features.append(f'{target_col}_urovenvoda_ratio')
+            except Exception as e:
+                logger.debug(f"Interaction feature with urovenvoda not created: {e}")
+        logger.info(f"✓ Interaction features created: {len([c for c in data_processed.columns if ('ratio' in c or 'product' in c) and c not in data.columns])}")
+        return data_processed
+    def create_spectral_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create spectral features
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with spectral features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        if len(data_processed) < 100:
+            logger.info("Insufficient data for creating spectral features")
+            return data_processed
+        try:
+            # Fast Fourier Transform
+            series = data_processed[target_col].dropna().values
+            if len(series) > 50:
+                # Calculate periodogram
+                from scipy.signal import periodogram
+                freqs, psd = periodogram(series, fs=1.0)
+                # Find dominant frequencies
+                if len(psd) > 3:
+                    # Top-3 frequencies by power
+                    top_indices = np.argsort(psd)[-3:][::-1]
+                    for i, idx in enumerate(top_indices, 1):
+                        if idx < len(freqs):
+                            freq = freqs[idx]
+                            if freq > 0:
+                                period = 1 / freq
+                                data_processed[f'{target_col}_dominant_period_{i}'] = period
+                                self.created_features.append(f'{target_col}_dominant_period_{i}')
+        except Exception as e:
+            logger.debug(f"Spectral features creation failed: {e}")
+        return data_processed
+    def create_decomposition_features(
+        self,
+        data: pd.DataFrame,
+        target_col: str
+    ) -> pd.DataFrame:
+        """
+        Create features based on decomposition
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        Returns:
+        --------
+        pd.DataFrame
+            Data with decomposition features
+        """
+        data_processed = data.copy()
+        if target_col not in data_processed.columns:
+            logger.warning(f"Target variable '{target_col}' not found")
+            return data_processed
+        if len(data_processed) < 365:
+            logger.info("Insufficient data for decomposition")
+            return data_processed
+        try:
+            # Check for date presence
+            if isinstance(data_processed.index, pd.DatetimeIndex):
+                # STL decomposition
+                if len(data_processed) > 730:  # Need at least 2 years for yearly seasonality
+                    try:
+                        from statsmodels.tsa.seasonal import STL
+                        # STL decomposition
+                        stl = STL(
+                            data_processed[target_col].fillna(method='ffill'),
+                            period=365,
+                            robust=True
+                        )
+                        result = stl.fit()
+                        # Add components
+                        data_processed[f'{target_col}_trend'] = result.trend
+                        data_processed[f'{target_col}_seasonal'] = result.seasonal
+                        data_processed[f'{target_col}_residual'] = result.resid
+                        self.created_features.extend([
+                            f'{target_col}_trend',
+                            f'{target_col}_seasonal',
+                            f'{target_col}_residual'
+                        ])
+                        logger.info("✓ STL decomposition successful")
+                    except Exception as e:
+                        logger.debug(f"STL decomposition failed: {e}")
+                        # Simple seasonal decomposition
+                        try:
+                            from statsmodels.tsa.seasonal import seasonal_decompose
+                            decomposition = seasonal_decompose(
+                                data_processed[target_col].fillna(method='ffill'),
+                                model='additive',
+                                period=365,
+                                extrapolate_trend='freq'
+                            )
+                            data_processed[f'{target_col}_trend'] = decomposition.trend
+                            data_processed[f'{target_col}_seasonal'] = decomposition.seasonal
+                            self.created_features.extend([
+                                f'{target_col}_trend',
+                                f'{target_col}_seasonal'
+                            ])
+                            logger.info("✓ Seasonal decomposition successful")
+                        except Exception as e2:
+                            logger.debug(f"Seasonal decomposition failed: {e2}")
+        except Exception as e:
+            logger.debug(f"Decomposition features creation failed: {e}")
+        return data_processed
+    def get_feature_categories(self) -> Dict[str, List[str]]:
+        """Get features by categories"""
+        categories = {
+            'temporal': [],
+            'statistical': [],
+            'rolling': [],
+            'lag': [],
+            'interaction': [],
+            'spectral': [],
+            'decomposition': [],
+            'binary': []
+        }
+        for feature in self.created_features:
+            if any(keyword in feature for keyword in ['year', 'month', 'day', 'week', 'quarter', 'sin', 'cos', 'is_weekend']):
+                categories['temporal'].append(feature)
+            elif any(keyword in feature for keyword in ['zscore', 'above_p', 'yearly_']):
+                if 'above_p' in feature:
+                    categories['binary'].append(feature)
+                else:
+                    categories['statistical'].append(feature)
+            elif 'rolling' in feature:
+                categories['rolling'].append(feature)
+            elif any(keyword in feature for keyword in ['lag', 'diff']):
+                categories['lag'].append(feature)
+            elif 'ratio' in feature or 'product' in feature:
+                categories['interaction'].append(feature)
+            elif 'dominant' in feature:
+                categories['spectral'].append(feature)
+            elif any(keyword in feature for keyword in ['trend', 'seasonal', 'residual']):
+                categories['decomposition'].append(feature)
+        # Remove empty categories
+        categories = {k: v for k, v in categories.items() if v}
+        return categories

missing_values/__init__.py ADDED Viewed

File without changes

missing_values/missing_analyzer.py ADDED Viewed

	@@ -0,0 +1,700 @@

+# ============================================
+# CLASS 3: MISSING VALUE ANALYSER
+# ============================================
+from typing import Dict, Tuple
+from venv import logger
+from config.config import Config
+from scipy.interpolate import interp1d
+from statsmodels.tsa.seasonal import seasonal_decompose, STL
+try:
+    import pandas as pd
+    import numpy as np
+    import matplotlib.pyplot as plt
+    print("✅ All imports working!")
+except ImportError as e:
+    print(f"❌ Import error: {e}")
+class MissingValueAnalyser:
+    """Class for analysing and handling missing values"""
+    def __init__(self, config: Config):
+        """
+        Initialise missing value analyser
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.missing_info = {}
+        self.handling_methods = {}
+        self.imputers = {}
+        self.missing_patterns = {}
+    def analyse(
+        self,
+        data: pd.DataFrame,
+        detailed: bool = True
+    ) -> Dict:
+        """
+        Analyse missing values in data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        detailed : bool
+            Whether to perform detailed analysis
+        Returns:
+        --------
+        Dict
+            Information about missing values
+        """
+        logger.info("\n" + "="*80)
+        logger.info("MISSING VALUE ANALYSIS")
+        logger.info("="*80)
+        # Calculate missing values
+        missing_total = data.isnull().sum()
+        missing_percent = (missing_total / len(data)) * 100
+        missing_df = pd.DataFrame({
+            'missing_count': missing_total,
+            'missing_percent': missing_percent,
+            'dtype': data.dtypes.astype(str)
+        })
+        # Detailed analysis
+        if detailed:
+            self._detailed_missing_analysis(data, missing_df)
+        # Save information
+        self.missing_info = {
+            'summary': {
+                col: {
+                    'missing_count': int(missing_df.loc[col, 'missing_count']),
+                    'missing_percent': float(missing_df.loc[col, 'missing_percent']),
+                    'dtype': missing_df.loc[col, 'dtype']
+                }
+                for col in missing_df.index
+            },
+            'overall': {
+                'total_missing': int(missing_total.sum()),
+                'total_rows': int(len(data)),
+                'total_cells': int(data.size),
+                'overall_missing_percentage': float(missing_total.sum() / data.size * 100),
+                'rows_with_any_missing': int(data.isnull().any(axis=1).sum()),
+                'rows_all_missing': int(data.isnull().all(axis=1).sum()),
+                'columns_with_missing': missing_df[missing_df['missing_count'] > 0].index.tolist(),
+                'columns_all_missing': missing_df[missing_df['missing_count'] == len(data)].index.tolist()
+            }
+        }
+        # Visualisation
+        if self.config.save_plots:
+            self._plot_missing_values(data, missing_df)
+        # Output results
+        self._log_missing_summary(missing_df)
+        return self.missing_info
+    def _detailed_missing_analysis(
+        self,
+        data: pd.DataFrame,
+        missing_df: pd.DataFrame
+    ) -> None:
+        """Detailed missing value analysis"""
+        # Analyse missing patterns
+        missing_matrix = data.isnull()
+        # Row missing patterns
+        row_patterns = missing_matrix.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)
+        row_pattern_counts = row_patterns.value_counts().head(10)
+        # Column missing patterns
+        col_patterns = missing_matrix.apply(lambda x: ''.join(x.astype(int).astype(str)), axis=0)
+        col_pattern_counts = col_patterns.value_counts().head(10)
+        # Time-based missing patterns analysis
+        time_patterns = {}
+        if isinstance(data.index, pd.DatetimeIndex):
+            # Missing values by time
+            time_missing = data.isnull().resample('M').sum()
+            time_patterns['monthly_missing'] = time_missing.sum(axis=1).to_dict()
+            # Missing values by day of week
+            data_with_dow = data.copy()
+            data_with_dow['dayofweek'] = data.index.dayofweek
+            dow_missing = data_with_dow.groupby('dayofweek').apply(lambda x: x.isnull().sum().sum())
+            time_patterns['dayofweek_missing'] = dow_missing.to_dict()
+        self.missing_patterns = {
+            'row_patterns': row_pattern_counts.to_dict(),
+            'col_patterns': col_pattern_counts.to_dict(),
+            'time_patterns': time_patterns,
+            'missing_correlation': missing_matrix.corr().to_dict()  # Missing value correlation
+        }
+        logger.debug(f"Found {len(row_pattern_counts)} unique row missing patterns")
+        logger.debug(f"Found {len(col_pattern_counts)} unique column missing patterns")
+    def _plot_missing_values(
+        self,
+        data: pd.DataFrame,
+        missing_df: pd.DataFrame
+    ) -> None:
+        """Visualise missing values"""
+        fig, axes = plt.subplots(3, 2, figsize=(16, 12))
+        # 1. Missing percentage histogram
+        axes[0, 0].barh(
+            missing_df.index,
+            missing_df['missing_percent']
+        )
+        axes[0, 0].axvline(self.config.missing_threshold, color='red', linestyle='--')
+        axes[0, 0].set_title('Missing Percentage by Column')
+        axes[0, 0].set_xlabel('Missing Percentage (%)')
+        axes[0, 0].set_ylabel('Columns')
+        axes[0, 0].grid(True, alpha=0.3)
+        # 2. Missing values heatmap
+        missing_matrix = data.isnull()
+        axes[0, 1].imshow(
+            missing_matrix.T if len(data) > 1000 else missing_matrix.T[:1000],
+            aspect='auto',
+            cmap='binary',
+            interpolation='none'
+        )
+        axes[0, 1].set_title('Missing Values Matrix')
+        axes[0, 1].set_xlabel('Observation Index')
+        axes[0, 1].set_ylabel('Variables')
+        axes[0, 1].set_yticks(range(len(data.columns)))
+        axes[0, 1].set_yticklabels(data.columns, fontsize=8)
+        # 3. Missing values over time (if time series)
+        if isinstance(data.index, pd.DatetimeIndex):
+            time_missing = data.isnull().resample('M').sum()
+            axes[1, 0].plot(time_missing.sum(axis=1))
+            axes[1, 0].set_title('Missing Values by Month')
+            axes[1, 0].set_xlabel('Date')
+            axes[1, 0].set_ylabel('Number of Missing Values')
+            axes[1, 0].grid(True, alpha=0.3)
+            # 4. Missing values by day of week
+            data_with_dow = data.copy()
+            data_with_dow['dayofweek'] = data.index.dayofweek
+            dow_missing = data_with_dow.groupby('dayofweek').apply(lambda x: x.isnull().sum().sum())
+            dow_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
+            axes[1, 1].bar(range(7), dow_missing)
+            axes[1, 1].set_title('Missing Values by Day of Week')
+            axes[1, 1].set_xlabel('Day of Week')
+            axes[1, 1].set_ylabel('Number of Missing Values')
+            axes[1, 1].set_xticks(range(7))
+            axes[1, 1].set_xticklabels(dow_names)
+            axes[1, 1].grid(True, alpha=0.3)
+        # 5. Missing value correlation
+        missing_corr = data.isnull().corr()
+        im = axes[2, 0].imshow(
+            missing_corr,
+            cmap='coolwarm',
+            vmin=-1,
+            vmax=1,
+            aspect='auto'
+        )
+        axes[2, 0].set_title('Missing Value Correlation Between Variables')
+        axes[2, 0].set_xlabel('Variables')
+        axes[2, 0].set_ylabel('Variables')
+        plt.colorbar(im, ax=axes[2, 0])
+        # 6. Cumulative missing sum
+        cumulative_missing = data.isnull().cumsum()
+        for col in data.columns[:5]:  # First 5 columns
+            if data[col].isnull().any():
+                axes[2, 1].plot(
+                    cumulative_missing.index,
+                    cumulative_missing[col],
+                    label=col[:20]
+                )
+        axes[2, 1].set_title('Cumulative Missing Values')
+        axes[2, 1].set_xlabel('Time/Index')
+        axes[2, 1].set_ylabel('Cumulative Missing')
+        axes[2, 1].legend(fontsize=8)
+        axes[2, 1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/missing_values_analysis.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def _log_missing_summary(self, missing_df: pd.DataFrame) -> None:
+        """Log missing value summary"""
+        missing_columns = missing_df[missing_df['missing_count'] > 0]
+        if len(missing_columns) > 0:
+            logger.info("MISSING VALUES FOUND:")
+            logger.info("-" * 50)
+            logger.info(f"Total missing values: {self.missing_info['overall']['total_missing']}")
+            logger.info(f"Overall missing percentage: {self.missing_info['overall']['overall_missing_percentage']:.2f}%")
+            logger.info(f"Rows with missing values: {self.missing_info['overall']['rows_with_any_missing']}")
+            logger.info(f"Columns with missing values: {len(self.missing_info['overall']['columns_with_missing'])}")
+            logger.info("\nTop-10 columns by missing values:")
+            top_missing = missing_df.nlargest(10, 'missing_percent')
+            for idx, (col, row) in enumerate(top_missing.iterrows(), 1):
+                logger.info(f"  {idx:2d}. {col}: {int(row['missing_count'])} missing ({row['missing_percent']:.2f}%)")
+        else:
+            logger.info("✓ No missing values found")
+    def handle(
+        self,
+        data: pd.DataFrame,
+        method: str = 'interpolate',
+        strategy: str = 'columnwise',
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Handle missing values
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str
+            Handling method: 'interpolate', 'ffill', 'bfill', 'mean', 'median', 'mode', 'knn', 'regression'
+        strategy : str
+            Strategy: 'columnwise', 'rowwise', 'global'
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Data with handled missing values
+        """
+        logger.info("\n" + "="*80)
+        logger.info("HANDLING MISSING VALUES")
+        logger.info("="*80)
+        data_processed = data.copy()
+        methods_applied = {}
+        # Determine columns to process
+        if strategy == 'columnwise':
+            columns_to_process = data_processed.columns
+        elif strategy == 'rowwise':
+            # Row-wise handling (for time series)
+            data_processed = self._handle_rowwise(data_processed, method, **kwargs)
+            return data_processed
+        else:
+            columns_to_process = data_processed.select_dtypes(include=[np.number]).columns
+        # Process each column
+        for col in columns_to_process:
+            missing_before = data_processed[col].isnull().sum()
+            if missing_before > 0:
+                # Check if missing percentage exceeds threshold
+                missing_percent = (missing_before / len(data_processed)) * 100
+                if missing_percent > self.config.missing_threshold:
+                    logger.warning(f"  {col}: {missing_before} missing ({missing_percent:.1f}%) > threshold {self.config.missing_threshold}%")
+                    if kwargs.get('drop_high_missing', False):
+                        data_processed = data_processed.drop(columns=[col])
+                        method_used = f"dropped (>{self.config.missing_threshold}% missing)"
+                        missing_after = 0
+                    else:
+                        # Use selected method
+                        data_processed[col], method_used = self._apply_imputation_method(
+                            data_processed[col], method, **kwargs
+                        )
+                        missing_after = data_processed[col].isnull().sum()
+                else:
+                    # Use selected method
+                    data_processed[col], method_used = self._apply_imputation_method(
+                        data_processed[col], method, **kwargs
+                    )
+                    missing_after = data_processed[col].isnull().sum()
+                methods_applied[col] = {
+                    'method': method_used,
+                    'missing_before': int(missing_before),
+                    'missing_after': int(missing_after),
+                    'missing_percent_before': float(missing_percent)
+                }
+                if missing_before > 0:
+                    logger.info(f"  {col}: {missing_before} → {missing_after} missing ({method_used})")
+        self.handling_methods = methods_applied
+        # Check that all missing values are handled
+        remaining_missing = data_processed.isnull().sum().sum()
+        if remaining_missing == 0:
+            logger.info("✓ All missing values successfully handled")
+        else:
+            logger.warning(f"⚠ {remaining_missing} missing values remain")
+            # Additional handling of remaining missing values
+            data_processed = data_processed.fillna(method='ffill').fillna(method='bfill')
+            remaining_after = data_processed.isnull().sum().sum()
+            if remaining_after == 0:
+                logger.info("✓ Remaining missing values handled with ffill/bfill combination")
+        return data_processed
+    def _apply_imputation_method(
+        self,
+        series: pd.Series,
+        method: str,
+        **kwargs
+    ) -> Tuple[pd.Series, str]:
+        """
+        Apply imputation method to individual series
+        Parameters:
+        -----------
+        series : pd.Series
+            Input series
+        method : str
+            Imputation method
+        **kwargs : dict
+            Additional parameters
+        Returns:
+        --------
+        Tuple[pd.Series, str]
+            Processed series and method description
+        """
+        if method == 'interpolate':
+            # Interpolation for time series
+            if isinstance(series.index, pd.DatetimeIndex):
+                method_name = f"{kwargs.get('interpolation_method', 'linear')} interpolation"
+                series_filled = series.interpolate(
+                    method=kwargs.get('interpolation_method', 'linear'),
+                    limit_direction=kwargs.get('limit_direction', 'both'),
+                    limit=kwargs.get('limit', None)
+                )
+            else:
+                method_name = 'linear interpolation'
+                series_filled = series.interpolate(method='linear')
+        elif method == 'time_weighted':
+            # Time-weighted interpolation
+            method_name = 'time-weighted interpolation'
+            series_filled = self._time_weighted_interpolation(series)
+        elif method == 'seasonal':
+            # Seasonal interpolation
+            method_name = 'seasonal interpolation'
+            series_filled = self._seasonal_interpolation(series, **kwargs)
+        elif method == 'ffill':
+            # Forward fill
+            method_name = 'forward fill'
+            series_filled = series.ffill(limit=kwargs.get('limit', None))
+        elif method == 'bfill':
+            # Backward fill
+            method_name = 'backward fill'
+            series_filled = series.bfill(limit=kwargs.get('limit', None))
+        elif method == 'mean':
+            # Mean imputation
+            method_name = 'mean imputation'
+            series_filled = series.fillna(series.mean())
+        elif method == 'median':
+            # Median imputation
+            method_name = 'median imputation'
+            series_filled = series.fillna(series.median())
+        elif method == 'mode':
+            # Mode imputation
+            method_name = 'mode imputation'
+            mode_value = series.mode()
+            if not mode_value.empty:
+                series_filled = series.fillna(mode_value.iloc[0])
+            else:
+                series_filled = series.fillna(series.median())
+        elif method == 'knn':
+            # KNN imputation
+            method_name = f"KNN imputation (k={kwargs.get('k', 5)})"
+            # Simplified version using nearest neighbour mean
+            series_filled = self._knn_imputation(series, k=kwargs.get('k', 5))
+        elif method == 'regression':
+            # Regression imputation
+            method_name = 'regression imputation'
+            series_filled = self._regression_imputation(series, **kwargs)
+        elif method == 'spline':
+            # Spline interpolation
+            method_name = 'spline interpolation'
+            series_filled = series.interpolate(method='spline', order=kwargs.get('order', 3))
+        elif method == 'stl':
+            # STL decomposition + interpolation
+            method_name = 'STL-based imputation'
+            series_filled = self._stl_imputation(series, **kwargs)
+        else:
+            raise ValueError(f"Unknown method: {method}")
+        # If missing values remain, fill with ffill/bfill
+        if series_filled.isnull().any():
+            series_filled = series_filled.ffill().bfill()
+            method_name += " + ffill/bfill"
+        return series_filled, method_name
+    def _time_weighted_interpolation(self, series: pd.Series) -> pd.Series:
+        """Time-weighted interpolation"""
+        if not isinstance(series.index, pd.DatetimeIndex):
+            return series.interpolate()
+        # Create timestamps
+        time_numeric = pd.Series(range(len(series)), index=series.index)
+        # Interpolate timestamps for missing values
+        time_interpolated = time_numeric.interpolate()
+        # Interpolate values based on timestamps
+        valid_mask = series.notna()
+        if valid_mask.sum() < 2:
+            return series.ffill().bfill()
+        # Use linear interpolation
+        valid_times = time_numeric[valid_mask]
+        valid_values = series[valid_mask]
+        # Interpolation
+        interp_func = interp1d(
+            valid_times,
+            valid_values,
+            kind='linear',
+            bounds_error=False,
+            fill_value='extrapolate'
+        )
+        series_filled = series.copy()
+        missing_mask = series.isna()
+        series_filled[missing_mask] = interp_func(time_interpolated[missing_mask])
+        return series_filled
+    def _seasonal_interpolation(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """Seasonal interpolation"""
+        if not isinstance(series.index, pd.DatetimeIndex):
+            return series.interpolate()
+        period = kwargs.get('period', self.config.seasonal_period)
+        # Create series copy
+        series_filled = series.copy()
+        # Interpolation considering seasonality
+        for i in range(len(series)):
+            if pd.isna(series.iloc[i]):
+                # Find values at same seasonal position
+                seasonal_indices = []
+                for offset in range(1, 10):  # Look in previous/next cycles
+                    idx_back = i - offset * period
+                    idx_forward = i + offset * period
+                    if idx_back >= 0 and not pd.isna(series.iloc[idx_back]):
+                        seasonal_indices.append(idx_back)
+                    if idx_forward < len(series) and not pd.isna(series.iloc[idx_forward]):
+                        seasonal_indices.append(idx_forward)
+                if seasonal_indices:
+                    # Take mean value from seasonal positions
+                    seasonal_values = series.iloc[seasonal_indices]
+                    series_filled.iloc[i] = seasonal_values.mean()
+        # Fill remaining missing values with regular interpolation
+        series_filled = series_filled.interpolate()
+        return series_filled
+    def _knn_imputation(
+        self,
+        series: pd.Series,
+        k: int = 5
+    ) -> pd.Series:
+        """KNN imputation for time series"""
+        # Simplified KNN for time series
+        series_filled = series.copy()
+        for i in range(len(series)):
+            if pd.isna(series.iloc[i]):
+                # Find nearest k non-missing values
+                distances = []
+                values = []
+                for j in range(max(0, i - k * 10), min(len(series), i + k * 10)):
+                    if j != i and not pd.isna(series.iloc[j]):
+                        distance = abs(i - j)
+                        distances.append(distance)
+                        values.append(series.iloc[j])
+                        if len(values) >= k:
+                            break
+                if values:
+                    # Distance-weighted average
+                    weights = [1 / (d + 1) for d in distances]
+                    weighted_avg = np.average(values, weights=weights)
+                    series_filled.iloc[i] = weighted_avg
+        return series_filled
+    def _regression_imputation(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """Regression imputation based on neighbouring values"""
+        # Simplified regression for time series
+        series_filled = series.copy()
+        if series.notna().sum() < 3:
+            return series.ffill().bfill()
+        # Use polynomial regression
+        x = np.arange(len(series))
+        y = series.values
+        # Valid values mask
+        valid_mask = ~np.isnan(y)
+        if valid_mask.sum() < 2:
+            return series.ffill().bfill()
+        # Polynomial regression degree 2
+        coeffs = np.polyfit(x[valid_mask], y[valid_mask], 2)
+        poly_func = np.poly1d(coeffs)
+        # Fill missing values
+        missing_mask = np.isnan(y)
+        series_filled.iloc[missing_mask] = poly_func(x[missing_mask])
+        return series_filled
+    def _stl_imputation(
+        self,
+        series: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """STL decomposition-based imputation"""
+        try:
+            if not isinstance(series.index, pd.DatetimeIndex):
+                return series.interpolate()
+            # STL decomposition
+            stl = STL(
+                series.ffill().bfill(),  # Fill missing for STL
+                period=kwargs.get('period', self.config.seasonal_period),
+                robust=True
+            )
+            result = stl.fit()
+            # Reconstruct series without noise
+            reconstructed = result.trend + result.seasonal
+            # Replace missing values with reconstructed values
+            series_filled = series.copy()
+            missing_mask = series.isna()
+            series_filled[missing_mask] = reconstructed[missing_mask]
+            return series_filled
+        except Exception as e:
+            logger.warning(f"STL imputation failed: {e}, using interpolation")
+            return series.interpolate()
+    def _handle_rowwise(
+        self,
+        data: pd.DataFrame,
+        method: str,
+        **kwargs
+    ) -> pd.DataFrame:
+        """Row-wise missing value handling"""
+        data_processed = data.copy()
+        # Remove rows with high missing counts
+        if kwargs.get('drop_rows_threshold', 0) > 0:
+            threshold = kwargs['drop_rows_threshold']
+            rows_before = len(data_processed)
+            missing_per_row = data_processed.isnull().sum(axis=1) / data_processed.shape[1] * 100
+            rows_to_drop = missing_per_row[missing_per_row > threshold].index
+            data_processed = data_processed.drop(rows_to_drop)
+            rows_after = len(data_processed)
+            logger.info(f"Rows removed: {rows_before - rows_after} (missing > {threshold}%)")
+        # Row-wise imputation
+        if method == 'row_mean':
+            data_processed = data_processed.T.fillna(data_processed.mean(axis=1)).T
+        elif method == 'row_median':
+            data_processed = data_processed.T.fillna(data_processed.median(axis=1)).T
+        elif method == 'row_ffill':
+            data_processed = data_processed.ffill(axis=1).bfill(axis=1)
+        return data_processed
+    def create_validation_rules(self) -> Dict:
+        """Create validation rules based on missing value analysis"""
+        rules = {}
+        for col, info in self.missing_info['summary'].items():
+            missing_percent = info['missing_percent']
+            if missing_percent > 50:
+                rules[col] = {
+                    'action': 'drop_column',
+                    'reason': f'Missing > 50%: {missing_percent:.1f}%'
+                }
+            elif missing_percent > 20:
+                rules[col] = {
+                    'action': 'advanced_imputation',
+                    'reason': f'High missing: {missing_percent:.1f}%',
+                    'recommended_method': 'knn'
+                }
+            elif missing_percent > 5:
+                rules[col] = {
+                    'action': 'standard_imputation',
+                    'reason': f'Moderate missing: {missing_percent:.1f}%',
+                    'recommended_method': 'interpolate'
+                }
+            elif missing_percent > 0:
+                rules[col] = {
+                    'action': 'simple_imputation',
+                    'reason': f'Low missing: {missing_percent:.1f}%',
+                    'recommended_method': 'ffill'
+                }
+        return rules
+    def get_report(self) -> Dict:
+        """Get missing values report"""
+        return {
+            'missing_info': self.missing_info,
+            'handling_methods': self.handling_methods,
+            'missing_patterns': self.missing_patterns,
+            'validation_rules': self.create_validation_rules()
+        }

outliers/__init__.py ADDED Viewed

File without changes

outliers/outlier_analyzer.py ADDED Viewed

	@@ -0,0 +1,857 @@

+# ============================================
+# CLASS 4: OUTLIER ANALYSER
+# ============================================
+from typing import Dict, List, Tuple
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from sklearn.neighbors import LocalOutlierFactor
+from sklearn.covariance import EllipticEnvelope
+from scipy import stats
+class OutlierAnalyser:
+    """Class for analysing and handling outliers"""
+    def __init__(self, config: Config):
+        """
+        Initialise outlier analyser
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.outlier_info = {}
+        self.handling_methods = {}
+        self.detection_methods = {}
+        self.outlier_models = {}
+    def analyse(
+        self,
+        data: pd.DataFrame,
+        method: str = None,
+        columns: List[str] = None,
+        **kwargs
+    ) -> Dict:
+        """
+        Analyse outliers in data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str, optional
+            Detection method. If None, uses configuration value.
+        columns : List[str], optional
+            List of columns to analyse. If None, uses all numeric columns.
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        Dict
+            Information about outliers
+        """
+        logger.info("\n" + "="*80)
+        logger.info("OUTLIER ANALYSIS")
+        logger.info("="*80)
+        method = method or self.config.outlier_method
+        if columns is None:
+            columns = data.select_dtypes(include=[np.number]).columns
+        outliers_info = {}
+        # Apply various detection methods
+        detection_results = {}
+        # 1. Statistical methods
+        if method in ['iqr', 'zscore', 'sigma', 'all']:
+            detection_results.update(self._statistical_methods(data, columns, method, **kwargs))
+        # 2. ML methods
+        if method in ['lof', 'isolation_forest', 'elliptic_envelope', 'all']:
+            detection_results.update(self._ml_methods(data, columns, method, **kwargs))
+        # 3. Temporal methods
+        if isinstance(data.index, pd.DatetimeIndex):
+            detection_results.update(self._temporal_methods(data, columns, **kwargs))
+        # Aggregate results
+        for col in columns:
+            if col in detection_results:
+                # Combine results from different methods
+                combined_mask = self._combine_detection_methods(detection_results, col)
+                outliers_count = combined_mask.sum()
+                outliers_percent = (outliers_count / len(data)) * 100
+                # Detailed information
+                col_data = data[col].dropna()
+                stats = {
+                    'mean': float(col_data.mean()),
+                    'std': float(col_data.std()),
+                    'median': float(col_data.median()),
+                    'q1': float(col_data.quantile(0.25)),
+                    'q3': float(col_data.quantile(0.75)),
+                    'min': float(col_data.min()),
+                    'max': float(col_data.max()),
+                    'skewness': float(col_data.skew()),
+                    'kurtosis': float(col_data.kurtosis())
+                }
+                outliers_info[col] = {
+                    'method': method,
+                    'statistics': stats,
+                    'outliers_count': int(outliers_count),
+                    'outliers_percent': float(outliers_percent),
+                    'outlier_indices': data[combined_mask].index.tolist() if outliers_count > 0 else [],
+                    'outlier_values': data.loc[combined_mask, col].tolist() if outliers_count > 0 else [],
+                    'detection_methods': {
+                        name: {
+                            'count': int(mask.sum()),
+                            'percent': float(mask.sum() / len(data) * 100)
+                        }
+                        for name, mask in detection_results[col].items()
+                    }
+                }
+                logger.info(f"{col}: {outliers_count} outliers ({outliers_percent:.2f}%)")
+        self.outlier_info = outliers_info
+        self.detection_methods = detection_results
+        # Visualisation
+        if self.config.save_plots and len(columns) > 0:
+            self._plot_outlier_analysis(data, columns, outliers_info)
+        return outliers_info
+    def _statistical_methods(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        method: str,
+        **kwargs
+    ) -> Dict:
+        """Statistical outlier detection methods"""
+        results = {}
+        for col in columns:
+            col_results = {}
+            series = data[col].dropna()
+            if len(series) < 3:
+                continue
+            # IQR method
+            if method in ['iqr', 'all']:
+                q1 = series.quantile(0.25)
+                q3 = series.quantile(0.75)
+                iqr = q3 - q1
+                lower_bound = q1 - self.config.outlier_alpha * iqr
+                upper_bound = q3 + self.config.outlier_alpha * iqr
+                iqr_mask = (data[col] < lower_bound) | (data[col] > upper_bound)
+                col_results['iqr'] = iqr_mask
+            # Z-score method
+            if method in ['zscore', 'sigma', 'all']:
+                z_threshold = kwargs.get('z_threshold', 3)
+                z_scores = np.abs((data[col] - series.mean()) / series.std())
+                z_mask = z_scores > z_threshold
+                col_results['zscore'] = z_mask
+            # Modified Z-score method
+            if method in ['zscore', 'all']:
+                median = series.median()
+                mad = np.median(np.abs(series - median))
+                if mad != 0:
+                    modified_z_scores = 0.6745 * (data[col] - median) / mad
+                    mz_mask = np.abs(modified_z_scores) > 3.5
+                    col_results['modified_zscore'] = mz_mask
+            # Tukey's fences
+            if method in ['iqr', 'all']:
+                inner_lower = q1 - 1.5 * iqr
+                inner_upper = q3 + 1.5 * iqr
+                outer_lower = q1 - 3 * iqr
+                outer_upper = q3 + 3 * iqr
+                mild_mask = ((data[col] < inner_lower) | (data[col] > inner_upper)) & \
+                           ((data[col] >= outer_lower) & (data[col] <= outer_upper))
+                extreme_mask = (data[col] < outer_lower) | (data[col] > outer_upper)
+                col_results['tukey_mild'] = mild_mask
+                col_results['tukey_extreme'] = extreme_mask
+            results[col] = col_results
+        return results
+    def _ml_methods(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        method: str,
+        **kwargs
+    ) -> Dict:
+        """ML outlier detection methods"""
+        results = {}
+        numeric_data = data[columns].dropna()
+        if len(numeric_data) < 10:
+            return results
+        try:
+            # Local Outlier Factor
+            if method in ['lof', 'all']:
+                lof = LocalOutlierFactor(
+                    contamination=self.config.outlier_contamination,
+                    n_neighbors=kwargs.get('n_neighbors', 20)
+                )
+                lof_labels = lof.fit_predict(numeric_data)
+                lof_mask = pd.Series(lof_labels == -1, index=numeric_data.index)
+                for col in columns:
+                    if col in numeric_data.columns:
+                        if col not in results:
+                            results[col] = {}
+                        results[col]['lof'] = lof_mask
+            # Elliptic Envelope
+            if method in ['elliptic_envelope', 'all']:
+                try:
+                    envelope = EllipticEnvelope(
+                        contamination=self.config.outlier_contamination,
+                        random_state=42
+                    )
+                    envelope_labels = envelope.fit_predict(numeric_data)
+                    envelope_mask = pd.Series(envelope_labels == -1, index=numeric_data.index)
+                    for col in columns:
+                        if col in numeric_data.columns:
+                            if col not in results:
+                                results[col] = {}
+                            results[col]['elliptic_envelope'] = envelope_mask
+                except Exception as e:
+                    logger.warning(f"Elliptic Envelope failed: {e}")
+        except Exception as e:
+            logger.warning(f"ML outlier detection methods failed: {e}")
+        return results
+    def _temporal_methods(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        **kwargs
+    ) -> Dict:
+        """Outlier detection methods for time series"""
+        results = {}
+        for col in columns:
+            col_results = {}
+            series = data[col].dropna()
+            if len(series) < 30:
+                continue
+            # Rolling statistics method
+            window = kwargs.get('temporal_window', 30)
+            rolling_mean = series.rolling(window=window, center=True).mean()
+            rolling_std = series.rolling(window=window, center=True).std()
+            # Outliers relative to moving average
+            threshold = kwargs.get('temporal_threshold', 3)
+            temporal_mask = np.abs(series - rolling_mean) > (threshold * rolling_std)
+            col_results['temporal'] = temporal_mask
+            # Seasonal detrending + outlier detection
+            try:
+                # Simple seasonal detrending
+                if len(series) > 365:
+                    seasonal_period = kwargs.get('seasonal_period', 365)
+                    seasonal_mean = series.rolling(window=seasonal_period, center=True).mean()
+                    detrended = series - seasonal_mean
+                    # Outliers in detrended series
+                    q1 = detrended.quantile(0.25)
+                    q3 = detrended.quantile(0.75)
+                    iqr = q3 - q1
+                    seasonal_lower = q1 - 3 * iqr
+                    seasonal_upper = q3 + 3 * iqr
+                    seasonal_mask = (detrended < seasonal_lower) | (detrended > seasonal_upper)
+                    col_results['seasonal'] = seasonal_mask
+            except Exception as e:
+                logger.debug(f"Seasonal outlier detection failed for {col}: {e}")
+            results[col] = col_results
+        return results
+    def _combine_detection_methods(
+        self,
+        detection_results: Dict,
+        column: str
+    ) -> pd.Series:
+        """Combine results from different detection methods"""
+        if column not in detection_results:
+            return pd.Series(False, index=pd.RangeIndex(0))
+        methods = detection_results[column]
+        combined_mask = None
+        for method_name, mask in methods.items():
+            if combined_mask is None:
+                combined_mask = mask.copy()
+            else:
+                # Combine via OR (outlier by any method)
+                combined_mask = combined_mask | mask
+        return combined_mask.fillna(False)
+    def _plot_outlier_analysis(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        outliers_info: Dict
+    ) -> None:
+        """Visualise outlier analysis"""
+        n_cols = min(len(columns), 4)
+        n_rows = (len(columns) + n_cols - 1) // n_cols
+        fig = plt.figure(figsize=(16, 4 * n_rows))
+        gs = fig.add_gridspec(n_rows, n_cols)
+        for idx, col in enumerate(columns):
+            if col not in outliers_info:
+                continue
+            row = idx // n_cols
+            col_idx = idx % n_cols
+            ax = fig.add_subplot(gs[row, col_idx])
+            # Data
+            series = data[col].dropna()
+            # 1. Box plot
+            bp = ax.boxplot(
+                series.values,
+                vert=True,
+                patch_artist=True,
+                widths=0.6,
+                showfliers=False
+            )
+            # Colours for box plot
+            bp['boxes'][0].set_facecolor('lightblue')
+            bp['medians'][0].set_color('red')
+            bp['whiskers'][0].set_color('black')
+            bp['whiskers'][1].set_color('black')
+            bp['caps'][0].set_color('black')
+            bp['caps'][1].set_color('black')
+            # 2. Outliers
+            if outliers_info[col]['outliers_count'] > 0:
+                outlier_indices = outliers_info[col]['outlier_indices']
+                outlier_values = outliers_info[col]['outlier_values']
+                # Convert indices to positions for box plot
+                jitter = np.random.normal(0, 0.05, len(outlier_values))
+                ax.scatter(
+                    np.ones(len(outlier_values)) + jitter,
+                    outlier_values,
+                    color='red',
+                    alpha=0.6,
+                    s=30,
+                    edgecolors='black',
+                    label=f'Outliers ({outliers_info[col]["outliers_count"]})'
+                )
+            # 3. Histogram on same plot
+            ax2 = ax.twinx()
+            ax2.hist(
+                series.values,
+                bins=30,
+                alpha=0.3,
+                color='green',
+                density=True
+            )
+            # 4. Normal distribution for comparison
+            if len(series) > 10:
+                xmin, xmax = ax.get_xlim()
+                x = np.linspace(series.min(), series.max(), 100)
+                mean = series.mean()
+                std = series.std()
+                if std > 0:
+                    p = stats.norm.pdf(x, mean, std)
+                    ax2.plot(x, p, 'k--', linewidth=1, label='Normal distribution')
+            ax.set_title(f'{col}\nOutliers: {outliers_info[col]["outliers_count"]} ({outliers_info[col]["outliers_percent"]:.1f}%)')
+            ax.set_ylabel('Value')
+            ax.grid(True, alpha=0.3)
+            # Legend
+            if outliers_info[col]['outliers_count'] > 0:
+                ax.legend(loc='upper right', fontsize=8)
+            ax2.legend(loc='upper left', fontsize=8)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/outliers_analysis.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+        # Additional plots for time series
+        if isinstance(data.index, pd.DatetimeIndex) and len(columns) > 0:
+            self._plot_temporal_outliers(data, columns, outliers_info)
+    def _plot_temporal_outliers(
+        self,
+        data: pd.DataFrame,
+        columns: List[str],
+        outliers_info: Dict
+    ) -> None:
+        """Visualise outliers over time"""
+        n_plots = min(len(columns), 3)
+        fig, axes = plt.subplots(n_plots, 1, figsize=(14, 4 * n_plots))
+        if n_plots == 1:
+            axes = [axes]
+        for idx, (col, ax) in enumerate(zip(columns[:n_plots], axes)):
+            if col not in outliers_info:
+                continue
+            # Time series
+            ax.plot(data.index, data[col], alpha=0.7, linewidth=1, label='Original series')
+            # Outliers
+            if outliers_info[col]['outliers_count'] > 0:
+                outlier_indices = outliers_info[col]['outlier_indices']
+                outlier_values = outliers_info[col]['outlier_values']
+                ax.scatter(
+                    outlier_indices,
+                    outlier_values,
+                    color='red',
+                    s=40,
+                    edgecolors='black',
+                    zorder=5,
+                    label='Outliers'
+                )
+            # Moving average
+            if len(data) > 30:
+                rolling_mean = data[col].rolling(window=30, center=True).mean()
+                ax.plot(data.index, rolling_mean, 'orange', linewidth=2, label='Moving average (30)')
+            ax.set_title(f'Outliers over time: {col}')
+            ax.set_xlabel('Date')
+            ax.set_ylabel(col)
+            ax.legend(fontsize=8)
+            ax.grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/temporal_outliers.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def handle(
+        self,
+        data: pd.DataFrame,
+        method: str = 'clip',
+        strategy: str = 'columnwise',
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Handle outliers
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str
+            Handling method: 'clip', 'remove', 'mean', 'median', 'winsorize', 'transform', 'impute'
+        strategy : str
+            Strategy: 'columnwise', 'global', 'adaptive'
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Data with handled outliers
+        """
+        logger.info("\n" + "="*80)
+        logger.info("HANDLING OUTLIERS")
+        logger.info("="*80)
+        if not self.outlier_info:
+            logger.warning("⚠ Perform outlier analysis first")
+            return data
+        data_processed = data.copy()
+        methods_applied = {}
+        for col, info in self.outlier_info.items():
+            if col not in data_processed.columns:
+                continue
+            outliers_count = info['outliers_count']
+            if outliers_count > 0:
+                # Create outlier mask
+                outlier_mask = pd.Series(False, index=data_processed.index)
+                if info['outlier_indices']:
+                    outlier_indices = [idx for idx in info['outlier_indices'] if idx in data_processed.index]
+                    outlier_mask.loc[outlier_indices] = True
+                # Determine boundaries
+                stats = info['statistics']
+                q1, q3 = stats['q1'], stats['q3']
+                iqr = q3 - q1
+                lower_bound = q1 - self.config.outlier_alpha * iqr
+                upper_bound = q3 + self.config.outlier_alpha * iqr
+                if method == 'clip':
+                    # Clip values to boundaries
+                    data_processed[col] = data_processed[col].clip(
+                        lower=lower_bound,
+                        upper=upper_bound
+                    )
+                    method_used = 'clipping'
+                    affected = outliers_count
+                elif method == 'remove':
+                    # Remove rows with outliers
+                    data_processed = data_processed[~outlier_mask]
+                    method_used = 'removal'
+                    affected = outliers_count
+                elif method == 'mean':
+                    # Replace outliers with mean value
+                    mean_val = data_processed[col].mean()
+                    data_processed.loc[outlier_mask, col] = mean_val
+                    method_used = 'mean imputation'
+                    affected = outliers_count
+                elif method == 'median':
+                    # Replace outliers with median
+                    median_val = data_processed[col].median()
+                    data_processed.loc[outlier_mask, col] = median_val
+                    method_used = 'median imputation'
+                    affected = outliers_count
+                elif method == 'winsorize':
+                    # Winsorisation
+                    data_processed[col] = self._winsorize_series(
+                        data_processed[col],
+                        limits=kwargs.get('limits', (0.05, 0.05))
+                    )
+                    method_used = 'winsorization'
+                    affected = outliers_count
+                elif method == 'transform':
+                    # Transformation to reduce outlier impact
+                    transform_method = kwargs.get('transform_method', 'log')
+                    data_processed[col] = self._transform_series(
+                        data_processed[col],
+                        method=transform_method
+                    )
+                    method_used = f'{transform_method} transformation'
+                    affected = 'all'  # Transformation applied to all values
+                elif method == 'impute':
+                    # Smart outlier imputation
+                    impute_method = kwargs.get('impute_method', 'neighbors')
+                    data_processed[col] = self._impute_outliers(
+                        data_processed[col],
+                        outlier_mask,
+                        method=impute_method,
+                        **kwargs
+                    )
+                    method_used = f'{impute_method} imputation'
+                    affected = outliers_count
+                elif method == 'adaptive':
+                    # Adaptive handling
+                    data_processed[col] = self._adaptive_outlier_handling(
+                        data_processed[col],
+                        outlier_mask,
+                        **kwargs
+                    )
+                    method_used = 'adaptive handling'
+                    affected = outliers_count
+                else:
+                    raise ValueError(f"Unknown method: {method}")
+                methods_applied[col] = {
+                    'method': method_used,
+                    'outliers_before': outliers_count,
+                    'affected': affected,
+                    'bounds': {
+                        'lower': float(lower_bound),
+                        'upper': float(upper_bound)
+                    }
+                }
+                logger.info(f"  {col}: {outliers_count} outliers handled ({method_used})")
+        self.handling_methods = methods_applied
+        # Handling statistics
+        total_outliers = sum(info['outliers_count'] for info in self.outlier_info.values())
+        total_affected = sum(method['affected'] for method in methods_applied.values()
+                           if isinstance(method['affected'], (int, np.integer)))
+        logger.info(f"\n✓ {total_affected} out of {total_outliers} outliers handled")
+        logger.info(f"  Data size before: {len(data)} rows")
+        logger.info(f"  Data size after: {len(data_processed)} rows")
+        # Visualise results
+        if self.config.save_plots and methods_applied:
+            self._plot_outlier_handling_results(data, data_processed, methods_applied)
+        return data_processed
+    def _winsorize_series(
+        self,
+        series: pd.Series,
+        limits: Tuple[float, float] = (0.05, 0.05)
+    ) -> pd.Series:
+        """Winsorize series"""
+        from scipy.stats.mstats import winsorize
+        try:
+            winsorized = winsorize(series.values, limits=limits)
+            return pd.Series(winsorized, index=series.index)
+        except:
+            return series
+    def _transform_series(
+        self,
+        series: pd.Series,
+        method: str = 'log'
+    ) -> pd.Series:
+        """Transform series to reduce outlier impact"""
+        series_transformed = series.copy()
+        if method == 'log':
+            # Logarithmic transformation
+            min_val = series.min()
+            if min_val <= 0:
+                shift = abs(min_val) + 1
+                series_transformed = np.log(series + shift)
+            else:
+                series_transformed = np.log(series)
+        elif method == 'boxcox':
+            # Box-Cox transformation
+            try:
+                from scipy.stats import boxcox
+                transformed, _ = boxcox(series - series.min() + 1)
+                series_transformed = pd.Series(transformed, index=series.index)
+            except:
+                logger.warning("Box-Cox transformation failed, using log")
+                return self._transform_series(series, 'log')
+        elif method == 'sqrt':
+            # Square root
+            min_val = series.min()
+            if min_val < 0:
+                series_transformed = np.sqrt(series - min_val)
+            else:
+                series_transformed = np.sqrt(series)
+        elif method == 'yeojohnson':
+            # Yeo-Johnson transformation
+            try:
+                from scipy.stats import yeojohnson
+                transformed, _ = yeojohnson(series)
+                series_transformed = pd.Series(transformed, index=series.index)
+            except:
+                logger.warning("Yeo-Johnson transformation failed, using log")
+                return self._transform_series(series, 'log')
+        return series_transformed
+    def _impute_outliers(
+        self,
+        series: pd.Series,
+        outlier_mask: pd.Series,
+        method: str = 'neighbors',
+        **kwargs
+    ) -> pd.Series:
+        """Smart outlier imputation"""
+        series_imputed = series.copy()
+        if method == 'neighbors':
+            # Replace with mean of neighbouring values
+            for idx in series[outlier_mask].index:
+                if idx in series.index:
+                    pos = series.index.get_loc(idx)
+                    neighbours = []
+                    # Find nearest non-outliers
+                    for offset in range(1, 6):
+                        if pos - offset >= 0 and not outlier_mask.iloc[pos - offset]:
+                            neighbours.append(series.iloc[pos - offset])
+                            break
+                    for offset in range(1, 6):
+                        if pos + offset < len(series) and not outlier_mask.iloc[pos + offset]:
+                            neighbours.append(series.iloc[pos + offset])
+                            break
+                    if neighbours:
+                        series_imputed.loc[idx] = np.mean(neighbours)
+        elif method == 'interpolate':
+            # Interpolation
+            series_imputed = series.mask(outlier_mask).interpolate()
+        elif method == 'rolling':
+            # Replace with moving average
+            window = kwargs.get('window', 5)
+            rolling_mean = series.rolling(window=window, center=True, min_periods=1).mean()
+            series_imputed = series.mask(outlier_mask, rolling_mean)
+        return series_imputed
+    def _adaptive_outlier_handling(
+        self,
+        series: pd.Series,
+        outlier_mask: pd.Series,
+        **kwargs
+    ) -> pd.Series:
+        """Adaptive outlier handling"""
+        series_processed = series.copy()
+        outlier_indices = series[outlier_mask].index
+        for idx in outlier_indices:
+            if idx in series.index:
+                value = series.loc[idx]
+                stats = self.outlier_info.get(series.name, {}).get('statistics', {})
+                # Determine outlier type
+                q1 = stats.get('q1', series.quantile(0.25))
+                q3 = stats.get('q3', series.quantile(0.75))
+                iqr = q3 - q1
+                if value < q1 - 3 * iqr:
+                    # Extreme low outlier
+                    series_processed.loc[idx] = q1 - 1.5 * iqr
+                elif value > q3 + 3 * iqr:
+                    # Extreme high outlier
+                    series_processed.loc[idx] = q3 + 1.5 * iqr
+                else:
+                    # Moderate outlier
+                    pos = series.index.get_loc(idx)
+                    # Use linear interpolation
+                    if pos > 0 and pos < len(series) - 1:
+                        series_processed.loc[idx] = (series.iloc[pos-1] + series.iloc[pos+1]) / 2
+        return series_processed
+    def _plot_outlier_handling_results(
+        self,
+        original_data: pd.DataFrame,
+        processed_data: pd.DataFrame,
+        methods_applied: Dict
+    ) -> None:
+        """Visualise outlier handling results"""
+        cols_to_plot = list(methods_applied.keys())[:3]
+        if not cols_to_plot:
+            return
+        fig, axes = plt.subplots(len(cols_to_plot), 2, figsize=(14, 4 * len(cols_to_plot)))
+        if len(cols_to_plot) == 1:
+            axes = axes.reshape(1, -1)
+        for idx, col in enumerate(cols_to_plot):
+            if col not in original_data.columns or col not in processed_data.columns:
+                continue
+            # Distribution before handling
+            axes[idx, 0].hist(original_data[col].dropna(), bins=30, alpha=0.5, label='Before', density=True)
+            axes[idx, 0].hist(processed_data[col].dropna(), bins=30, alpha=0.5, label='After', density=True)
+            axes[idx, 0].set_title(f'{col}: Distribution before/after')
+            axes[idx, 0].set_xlabel('Value')
+            axes[idx, 0].set_ylabel('Density')
+            axes[idx, 0].legend()
+            axes[idx, 0].grid(True, alpha=0.3)
+            # QQ plot for normality check
+            stats.probplot(original_data[col].dropna(), dist="norm", plot=axes[idx, 1])
+            axes[idx, 1].set_title(f'{col}: Q-Q plot (before handling)')
+            axes[idx, 1].grid(True, alpha=0.3)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/outlier_handling_results.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def create_validation_rules(self) -> Dict:
+        """Create validation rules based on outlier analysis"""
+        rules = {}
+        for col, info in self.outlier_info.items():
+            outliers_percent = info['outliers_percent']
+            skewness = info['statistics']['skewness']
+            rule = {
+                'outliers_percent': outliers_percent,
+                'skewness': skewness,
+                'recommended_action': 'none'
+            }
+            if outliers_percent > 10:
+                rule['recommended_action'] = 'aggressive_handling'
+                rule['reason'] = f'High outliers: {outliers_percent:.1f}%'
+            elif outliers_percent > 5:
+                rule['recommended_action'] = 'moderate_handling'
+                rule['reason'] = f'Moderate outliers: {outliers_percent:.1f}%'
+            elif outliers_percent > 1:
+                rule['recommended_action'] = 'conservative_handling'
+                rule['reason'] = f'Low outliers: {outliers_percent:.1f}%'
+            if abs(skewness) > 1:
+                rule['skewness_issue'] = True
+                rule['skewness_reason'] = f'Strong skewness: {skewness:.2f}'
+                if rule['recommended_action'] == 'none':
+                    rule['recommended_action'] = 'transformation'
+            rules[col] = rule
+        return rules
+    def get_report(self) -> Dict:
+        """Get outlier analysis report"""
+        return {
+            'outlier_info': self.outlier_info,
+            'handling_methods': self.handling_methods,
+            'detection_methods': self.detection_methods,
+            'validation_rules': self.create_validation_rules()
+        }

pipeline/__init__.py ADDED Viewed

File without changes

pipeline/main_pipeline.py ADDED Viewed

	@@ -0,0 +1,603 @@

+# ============================================
+# CLASS 14: MAIN PIPELINE
+# ============================================
+from datetime import datetime
+import json
+import os
+import traceback
+from typing import Any, Dict, Optional
+from venv import logger
+from config.config import Config
+from correlations.correlation_analyzer import CorrelationAnalyzer
+from data_loader.data_loader import DataLoader
+from decomposition.decomposer import TimeSeriesDecomposer
+from feature_selection.feature_selector import FeatureSelector
+from features.feature_engineer import FeatureEngineer
+from missing_values.missing_analyzer import MissingValueAnalyser
+from outliers.outlier_analyzer import OutlierAnalyser
+from scaling.data_scaler import DataScaler
+from splitting.data_splitter import DataSplitter
+from stationarity.stationarity_checker import StationarityChecker
+from validation.data_validator import DataValidator
+import pandas as pd
+import numpy as np
+from visualization.visualization_manager import VisualisationManager
+class EnhancedDataPreprocessingPipeline:
+    """Enhanced main data preprocessing pipeline"""
+    def __init__(self, config: Config):
+        """
+        Initialise pipeline
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.data_loader = DataLoader(config)
+        self.missing_analyser = MissingValueAnalyser(config)
+        self.outlier_analyser = OutlierAnalyser(config)
+        self.feature_engineer = FeatureEngineer(config)
+        self.stationarity_checker = StationarityChecker(config)
+        self.decomposer = TimeSeriesDecomposer(config)
+        self.correlation_analyser = CorrelationAnalyzer(config)
+        self.data_splitter = DataSplitter(config)
+        self.data_scaler = DataScaler(config)
+        self.feature_selector = FeatureSelector(config)
+        self.data_validator = DataValidator(config)
+        self.visualisation_manager = VisualisationManager(config)
+        self.results = {}
+        self.processed_data = None
+        self.train_data = None
+        self.val_data = None
+        self.test_data = None
+        self.is_fitted = False
+    def run_full_pipeline(
+        self,
+        data_path: Optional[str] = None,
+        use_synthetic: bool = False,
+        save_intermediate: bool = True,
+        create_reports: bool = True
+    ) -> pd.DataFrame:
+        """
+        Run enhanced full data preprocessing pipeline
+        Parameters:
+        -----------
+        data_path : str, optional
+            Path to data. If None, uses configuration value.
+        use_synthetic : bool
+            Use synthetic data for testing
+        save_intermediate : bool
+            Save intermediate results
+        create_reports : bool
+            Create reports
+        Returns:
+        --------
+        pd.DataFrame
+            Processed data
+        """
+        logger.info("\n" + "="*80)
+        logger.info("RUNNING ENHANCED DATA PREPROCESSING PIPELINE")
+        logger.info("="*80)
+        start_time = datetime.now()
+        try:
+            # Step 1: Data loading
+            logger.info("\n" + "="*80)
+            logger.info("STEP 1: DATA LOADING")
+            logger.info("="*80)
+            if use_synthetic:
+                data = self.data_loader.create_synthetic_data(
+                    n_days=365*20,
+                    trend_strength=0.01,
+                    noise_std=10,
+                    include_exogenous=True
+                )
+            else:
+                data = self.data_loader.load_from_csv(
+                    data_path=data_path,
+                    parse_dates=['date']
+                )
+            # Check for date index
+            if not isinstance(data.index, pd.DatetimeIndex):
+                logger.warning("Index is not DatetimeIndex, setting...")
+                if 'date' in data.columns:
+                    data = data.set_index('date')
+                    logger.info("Index set from 'date' column")
+            self.results['data_loading'] = {
+                'shape': list(data.shape),
+                'columns': list(data.columns),
+                'date_range': {
+                    'min': data.index.min().strftime('%Y-%m-%d') if isinstance(data.index, pd.DatetimeIndex) else None,
+                    'max': data.index.max().strftime('%Y-%m-%d') if isinstance(data.index, pd.DatetimeIndex) else None
+                },
+                'is_datetime_index': isinstance(data.index, pd.DatetimeIndex)
+            }
+            # Save raw data information
+            self.data_loader.save_raw_data_info()
+            # Step 2: Raw data validation
+            logger.info("\n" + "="*80)
+            logger.info("STEP 2: RAW DATA VALIDATION")
+            logger.info("="*80)
+            raw_validation = self.data_validator.validate(
+                data, stage='raw', detailed=True
+            )
+            self.results['raw_validation'] = raw_validation
+            if raw_validation['status'] == 'FAIL':
+                logger.warning("⚠ Raw data has critical issues!")
+                if not self.config.enable_validation:
+                    logger.warning("Validation disabled in configuration, continuing processing")
+                else:
+                    logger.error("Pipeline interrupted due to data issues")
+                    return None
+            # Step 3: Missing values analysis and handling
+            logger.info("\n" + "="*80)
+            logger.info("STEP 3: MISSING VALUES HANDLING")
+            logger.info("="*80)
+            missing_info = self.missing_analyser.analyse(data, detailed=True)
+            self.results['missing_analysis'] = missing_info
+            # Handle missing values
+            data = self.missing_analyser.handle(
+                data,
+                method='interpolate',
+                strategy='columnwise'
+            )
+            self.results['missing_handling'] = self.missing_analyser.handling_methods
+            # Step 4: Outlier analysis and handling
+            logger.info("\n" + "="*80)
+            logger.info("STEP 4: OUTLIER HANDLING")
+            logger.info("="*80)
+            outlier_info = self.outlier_analyser.analyse(
+                data,
+                method=self.config.outlier_method,
+                columns=data.select_dtypes(include=[np.number]).columns.tolist()
+            )
+            self.results['outlier_analysis'] = outlier_info
+            # Handle outliers
+            data = self.outlier_analyser.handle(
+                data,
+                method='clip',
+                strategy='columnwise'
+            )
+            self.results['outlier_handling'] = self.outlier_analyser.handling_methods
+            # Step 5: Feature engineering
+            logger.info("\n" + "="*80)
+            logger.info("STEP 5: FEATURE ENGINEERING")
+            logger.info("="*80)
+            data = self.feature_engineer.create_all_features(data)
+            self.results['feature_engineering'] = self.feature_engineer.feature_info
+            # Check for data after feature engineering
+            if len(data) == 0:
+                logger.error("No data remaining after feature engineering!")
+                return None
+            # Step 6: Stationarity analysis
+            logger.info("\n" + "="*80)
+            logger.info("STEP 6: STATIONARITY ANALYSIS")
+            logger.info("="*80)
+            stationarity_results = self.stationarity_checker.check(
+                data,
+                target_col=self.config.target_column,
+                make_stationary=True,
+                try_transformations=True
+            )
+            self.results['stationarity_analysis'] = stationarity_results
+            # Step 7: Time series decomposition
+            logger.info("\n" + "="*80)
+            logger.info("STEP 7: TIME SERIES DECOMPOSITION")
+            logger.info("="*80)
+            if isinstance(data.index, pd.DatetimeIndex) and len(data) > 365:
+                decomposition_results = self.decomposer.decompose(
+                    data,
+                    target_col=self.config.target_column,
+                    method='stl',
+                    period=self.config.seasonal_period
+                )
+                self.results['decomposition'] = decomposition_results
+            else:
+                logger.info("Skipped: insufficient data or no DatetimeIndex")
+                self.results['decomposition'] = {'skipped': 'insufficient data or no DatetimeIndex'}
+            # Step 8: Correlation analysis
+            logger.info("\n" + "="*80)
+            logger.info("STEP 8: CORRELATION ANALYSIS")
+            logger.info("="*80)
+            corr_matrix = self.correlation_analyser.analyze(
+                data,
+                target_col=self.config.target_column,
+                threshold=0.8,
+                detailed=True
+            )
+            self.results['correlation_analysis'] = self.correlation_analyser.get_report()
+            # Remove highly correlated features
+            if not corr_matrix.empty:
+                data = self.correlation_analyser.remove_highly_correlated(
+                    data,
+                    threshold=0.95,
+                    method='variance',
+                    keep_target=True
+                )
+            else:
+                logger.warning("Correlation matrix empty, skipping feature removal")
+            # Step 9: Processed data validation
+            logger.info("\n" + "="*80)
+            logger.info("STEP 9: PROCESSED DATA VALIDATION")
+            logger.info("="*80)
+            processed_validation = self.data_validator.validate(
+                data, stage='processed', detailed=True
+            )
+            self.results['processed_validation'] = processed_validation
+            if processed_validation['status'] == 'FAIL':
+                logger.warning("⚠ Processed data failed validation!")
+                logger.warning("Continuing pipeline, but data quality may be low")
+            elif processed_validation['status'] == 'WARNING':
+                logger.warning("⚠ Processed data requires attention")
+            # Step 10: Data splitting
+            logger.info("\n" + "="*80)
+            logger.info("STEP 10: DATA SPLITTING")
+            logger.info("="*80)
+            train_data, val_data, test_data = self.data_splitter.split(
+                data,
+                method=self.config.split_method,
+                test_size=self.config.test_size,
+                validation_size=self.config.validation_size
+            )
+            self.train_data = train_data
+            self.val_data = val_data
+            self.test_data = test_data
+            self.results['data_splitting'] = self.data_splitter.split_info
+            # Step 11: Data scaling
+            logger.info("\n" + "="*80)
+            logger.info("STEP 11: DATA SCALING")
+            logger.info("="*80)
+            # Scale training data
+            train_data_scaled = self.data_scaler.fit_transform(
+                train_data,
+                method=self.config.scaling_method,
+                target_col=self.config.target_column,
+                fit_on_train=True
+            )
+            # Apply same scaling to validation and test data
+            val_data_scaled = self.data_scaler.transform(val_data)
+            test_data_scaled = self.data_scaler.transform(test_data)
+            self.train_data = train_data_scaled
+            self.val_data = val_data_scaled
+            self.test_data = test_data_scaled
+            self.results['data_scaling'] = self.data_scaler.get_report()
+            # Step 12: Feature selection
+            logger.info("\n" + "="*80)
+            logger.info("STEP 12: FEATURE SELECTION")
+            logger.info("="*80)
+            if len(train_data_scaled.columns) > 5:
+                # Select features on training data
+                train_data_selected = self.feature_selector.select(
+                    train_data_scaled,
+                    method=self.config.feature_selection_method,
+                    n_features=min(self.config.max_features, len(train_data_scaled.columns) - 1)
+                )
+                # Save selected features
+                selected_features = self.feature_selector.selected_features
+                # Apply same selection to validation and test data
+                features_to_keep = selected_features + [self.config.target_column]
+                features_to_keep = [f for f in features_to_keep if f in val_data_scaled.columns]
+                if len(features_to_keep) > 1:
+                    self.train_data = train_data_scaled[features_to_keep].copy()
+                    self.val_data = val_data_scaled[features_to_keep].copy()
+                    self.test_data = test_data_scaled[features_to_keep].copy()
+                else:
+                    logger.warning("Failed to select features, using all")
+                self.results['feature_selection'] = self.feature_selector.get_report()
+            else:
+                logger.info("Skipped: insufficient features for selection")
+                self.results['feature_selection'] = {'skipped': 'insufficient features'}
+            # Step 13: Final validation
+            logger.info("\n" + "="*80)
+            logger.info("STEP 13: FINAL VALIDATION")
+            logger.info("="*80)
+            # Combine all data for final validation
+            all_processed_data = pd.concat([self.train_data, self.val_data, self.test_data])
+            final_validation = self.data_validator.validate(
+                all_processed_data, stage='final', detailed=True
+            )
+            self.results['final_validation'] = final_validation
+            self.processed_data = all_processed_data
+            self.is_fitted = True
+            # Step 14: Additional multicollinearity cleaning
+            logger.info("\n" + "="*80)
+            logger.info("STEP 14: ADDITIONAL MULTICOLLINEARITY CLEANING")
+            logger.info("="*80)
+            # Remove temporal features with extreme VIF
+            self.processed_data = self._remove_extreme_vif_features(self.processed_data)
+            self.train_data = self.train_data[self.processed_data.columns]
+            self.val_data = self.val_data[self.processed_data.columns]
+            self.test_data = self.test_data[self.processed_data.columns]
+            # Step 15: Create visualisations and reports
+            logger.info("\n" + "="*80)
+            logger.info("STEP 15: CREATING REPORTS AND VISUALISATIONS")
+            logger.info("="*80)
+            if create_reports:
+                self.create_all_reports()
+                self.create_all_visualisations()
+            # Calculate execution time
+            execution_time = (datetime.now() - start_time).total_seconds()
+            # Save final results
+            self.results['pipeline_execution'] = {
+                'start_time': start_time.strftime('%Y-%m-%d %H:%M:%S'),
+                'end_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+                'execution_time_seconds': execution_time,
+                'success': True,
+                'stages_completed': 15
+            }
+            # Save configuration and results
+            self.save_pipeline_results()
+            logger.info("\n" + "="*80)
+            logger.info("ENHANCED PIPELINE SUCCESSFULLY COMPLETED!")
+            logger.info("="*80)
+            logger.info(f"Execution time: {execution_time:.2f} seconds")
+            logger.info(f"Initial data size: {self.results['data_loading']['shape']}")
+            logger.info(f"Final data size: {list(self.processed_data.shape)}")
+            logger.info(f"Data quality: {final_validation['overall_score']}/100")
+            logger.info(f"Status: {final_validation['status']}")
+            logger.info(f"Training data: {len(self.train_data)} records")
+            logger.info(f"Features in final set: {len(self.train_data.columns)}")
+            return self.processed_data
+        except Exception as e:
+            logger.error(f"✗ Pipeline error: {e}")
+            logger.error(traceback.format_exc())
+            self.results['pipeline_execution'] = {
+                'success': False,
+                'error': str(e),
+                'traceback': traceback.format_exc()
+            }
+            # Save partial results
+            self.save_pipeline_results()
+            return None
+    def _remove_extreme_vif_features(self, data: pd.DataFrame) -> pd.DataFrame:
+        """Remove features with extreme VIF"""
+        data_clean = data.copy()
+        # Identify features with extreme VIF for removal
+        extreme_vif_features = [
+            'year', 'day', 'dayofyear', 'days_from_start',
+            'raskhodvoda_zscore'  # Usually has extreme VIF
+        ]
+        # Remove only those present in data
+        features_to_remove = [f for f in extreme_vif_features if f in data_clean.columns]
+        if features_to_remove:
+            logger.info(f"Removing features with extreme VIF: {features_to_remove}")
+            data_clean = data_clean.drop(columns=features_to_remove)
+        return data_clean
+    def create_all_reports(self) -> None:
+        """Create all reports"""
+        logger.info("Creating reports...")
+        # 1. Save validation results
+        for stage in ['raw', 'processed', 'final']:
+            if stage in self.data_validator.validation_results:
+                self.data_validator.save_report(stage)
+        # 2. Save plots information
+        self.visualisation_manager.save_plots_info()
+        # 3. Create summary report
+        self.create_summary_report()
+        logger.info("✓ All reports created")
+    def create_all_visualisations(self) -> None:
+        """Create all visualisations"""
+        logger.info("Creating visualisations...")
+        if self.processed_data is not None:
+            # 1. Summary dashboard
+            preprocessing_stages = {
+                'Loading': self.results['data_loading']['shape'][1] if 'data_loading' in self.results else 0,
+                'After cleaning': len(self.processed_data.columns),
+                'Features created': self.feature_engineer.feature_info.get('features_created', 0),
+                'Features selected': len(self.feature_selector.selected_features) if hasattr(self.feature_selector, 'selected_features') else 0
+            }
+            self.visualisation_manager.create_summary_dashboard(
+                self.processed_data,
+                preprocessing_stages
+            )
+        logger.info("✓ All visualisations created")
+    def create_summary_report(self) -> None:
+        """Create summary report"""
+        report = {
+            'pipeline_summary': {
+                'config': self.config.to_dict(),
+                'execution': self.results.get('pipeline_execution', {}),
+                'data_statistics': {
+                    'initial_shape': self.results.get('data_loading', {}).get('shape', []),
+                    'final_shape': list(self.processed_data.shape) if self.processed_data is not None else [],
+                    'target_column': self.config.target_column,
+                    'features_created': self.feature_engineer.feature_info.get('features_created', 0),
+                    'features_selected': len(self.feature_selector.selected_features) if hasattr(self.feature_selector, 'selected_features') else 0
+                }
+            },
+            'validation_summary': {},
+            'quality_metrics': {}
+        }
+        # Add validation results
+        for stage in ['raw', 'processed', 'final']:
+            if stage in self.data_validator.validation_results:
+                stage_results = self.data_validator.validation_results[stage]
+                report['validation_summary'][stage] = {
+                    'status': stage_results.get('status'),
+                    'score': stage_results.get('overall_score'),
+                    'issues_count': sum(len(issues) for issues in stage_results.get('issues', {}).values()),
+                    'checks_passed': sum(1 for check in stage_results.get('basic_checks', {}).values()
+                                       if check.get('passed', False))
+                }
+        # Save report
+        report_path = f'{self.config.results_dir}/reports/pipeline_summary.json'
+        with open(report_path, 'w', encoding='utf-8') as f:
+            json.dump(report, f, indent=4, ensure_ascii=False)
+        logger.info(f"✓ Summary report saved: {report_path}")
+    def save_pipeline_results(self) -> None:
+        """Save all pipeline results"""
+        # Save configuration
+        self.config.save()
+        # Save data
+        if self.processed_data is not None:
+            # Save processed data
+            data_path = f'{self.config.results_dir}/processed_data/processed_data.csv'
+            self.processed_data.to_csv(data_path)
+            logger.info(f"✓ Processed data saved: {data_path}")
+            # Save split data
+            if self.train_data is not None:
+                self.train_data.to_csv(f'{self.config.results_dir}/processed_data/train_data.csv')
+                self.val_data.to_csv(f'{self.config.results_dir}/processed_data/val_data.csv')
+                self.test_data.to_csv(f'{self.config.results_dir}/processed_data/test_data.csv')
+    def get_final_data_for_modelling(self) -> Dict[str, Any]:
+        """Prepare data for modelling"""
+        if not self.is_fitted:
+            logger.warning("Pipeline not executed, data not ready")
+            return {}
+        return {
+            'X_train': self.train_data.drop(columns=[self.config.target_column]),
+            'y_train': self.train_data[self.config.target_column],
+            'X_val': self.val_data.drop(columns=[self.config.target_column]),
+            'y_val': self.val_data[self.config.target_column],
+            'X_test': self.test_data.drop(columns=[self.config.target_column]),
+            'y_test': self.test_data[self.config.target_column],
+            'feature_names': self.train_data.drop(columns=[self.config.target_column]).columns.tolist(),
+            'scaler': self.data_scaler,
+            'feature_selector': self.feature_selector,
+            'results': self.results
+        }
+# ============================================
+# QUICK LAUNCH FUNCTION
+# ============================================
+def run_enhanced_preprocessing(
+    config_path: Optional[str] = None,
+    data_path: Optional[str] = None,
+    use_synthetic: bool = False,
+    save_results: bool = True
+) -> EnhancedDataPreprocessingPipeline:
+    """
+    Quick launch function for enhanced pipeline
+    Parameters:
+    -----------
+    config_path : str, optional
+        Path to configuration file
+    data_path : str, optional
+        Path to data
+    use_synthetic : bool
+        Use synthetic data
+    save_results : bool
+        Save results
+    Returns:
+    --------
+    EnhancedDataPreprocessingPipeline
+        Pipeline object with results
+    """
+    # Load or create configuration
+    if config_path and os.path.exists(config_path):
+        config = Config.load(config_path)
+        logger.info(f"Configuration loaded from {config_path}")
+    else:
+        config = Config()
+        logger.info("Using default configuration")
+    # Update data path if specified
+    if data_path:
+        config.data_path = data_path
+    # Create and run pipeline
+    pipeline = EnhancedDataPreprocessingPipeline(config)
+    pipeline.run_full_pipeline(
+        data_path=data_path,
+        use_synthetic=use_synthetic,
+        save_intermediate=save_results,
+        create_reports=save_results
+    )
+    return pipeline

requirements.txt CHANGED Viewed

@@ -1,9 +1,100 @@
-streamlit
-pandas
-numpy
-plotly
-joblib
-huggingface_hub
-scikit-learn
-openpyxl
-xgboost

+absl-py==2.3.1
+altair==6.0.0
+anyio==4.11.0
+astunparse==1.6.3
+attrs==25.4.0
+blinker==1.9.0
+cachetools==6.2.4
+certifi==2025.11.12
+charset-normalizer==3.4.4
+click==8.3.1
+colorama==0.4.6
+contourpy==1.3.2
+cycler==0.12.1
+et_xmlfile==2.0.0
+filelock==3.20.1
+flatbuffers==25.12.19
+fonttools==4.61.1
+fsspec==2025.12.0
+gast==0.7.0
+gensim==4.4.0
+gitdb==4.0.12
+GitPython==3.1.46
+google-pasta==0.2.0
+grpcio==1.76.0
+h5py==3.15.1
+h11==0.16.0
+hf-xet==1.2.0
+httpcore==1.0.9
+httpx==0.28.1
+huggingface_hub==1.1.2
+idna==3.11
+Jinja2==3.1.6
+joblib==1.5.3
+jsonschema==4.25.1
+jsonschema-specifications==2025.9.1
+keras==3.13.0
+kiwisolver==1.4.9
+libclang==18.1.1
+Markdown==3.10
+markdown-it-py==4.0.0
+MarkupSafe==3.0.3
+matplotlib==3.10.8
+mdurl==0.1.2
+ml_dtypes==0.5.4
+mpmath==1.3.0
+namex==0.1.0
+narwhals==2.14.0
+networkx==3.6.1
+numpy==2.4.0
+openpyxl==3.1.5
+opt_einsum==3.4.0
+optree==0.18.0
+packaging==25.0
+pandas==2.3.3
+patsy==1.0.2
+pillow==12.0.0
+plotly==6.5.0
+protobuf==6.33.2
+pyarrow==22.0.0
+pydeck==0.9.1
+Pygments==2.19.2
+pyparsing==3.3.1
+pyperclip==1.11.0
+python-dateutil==2.9.0.post0
+pytz==2025.2
+PyYAML==6.0.3
+referencing==0.37.0
+requests==2.32.5
+rich==14.2.0
+rpds-py==0.30.0
+scikit-learn==1.8.0
+scipy==1.16.3
+seaborn==0.13.2
+setuptools==80.9.0
+shellingham==1.5.4
+six==1.17.0
+smart_open==7.5.0
+smmap==5.0.2
+sniffio==1.3.1
+statsmodels==0.14.6
+streamlit==1.52.2
+sympy==1.14.0
+tenacity==9.1.2
+tensorboard==2.20.0
+tensorboard-data-server==0.7.2
+tensorflow==2.20.0
+termcolor==3.3.0
+threadpoolctl==3.6.0
+toml==0.10.2
+torch==2.9.1
+tornado==6.5.4
+tqdm==4.67.1
+typer-slim==0.20.0
+typing_extensions==4.15.0
+tzdata==2025.3
+urllib3==2.6.2
+watchdog==6.0.0
+Werkzeug==3.1.4
+wheel==0.45.1
+wrapt==2.0.1

run_pipeline.py ADDED Viewed

	@@ -0,0 +1,62 @@

+# ============================================
+# RUN
+# ============================================
+from config.config import Config
+from pipeline.main_pipeline import EnhancedDataPreprocessingPipeline
+import pandas as pd
+if __name__ == "__main__":
+    """
+    Pipeline execution
+    """
+    # Configuration with reasonable parameters
+    config = Config(
+        data_path='temp_data.csv',
+        results_dir='enhanced_preprocessing_results',
+        target_column='raskhodvoda',
+        start_year=1970,
+        end_year=1990,
+        max_lags=5,
+        seasonal_period=365,
+        rolling_windows=[7, 30, 90],
+        expanding_windows=[30, 90],
+        test_size=0.2,
+        validation_size=0.1,
+        scaling_method='robust',
+        feature_selection_method='correlation',
+        max_features=20,
+        missing_threshold=0.3,
+        outlier_method='iqr',
+        enable_validation=True
+    )
+    # Run enhanced pipeline
+    pipeline = EnhancedDataPreprocessingPipeline(config)
+    processed_data = pipeline.run_full_pipeline(
+        use_synthetic=False,
+        save_intermediate=True,
+        create_reports=True
+    )
+    if processed_data is not None:
+        print("\n" + "="*80)
+        print("ENHANCED PIPELINE SUCCESSFULLY COMPLETED!")
+        print("="*80)
+        print(f"Final data size: {processed_data.shape}")
+        print(f"Columns: {list(processed_data.columns)}")
+        # Get modeling data
+        modeling_data = pipeline.get_final_data_for_modeling()
+        if modeling_data:
+            print(f"\nModeling data ready:")
+            print(f"  X_train: {modeling_data['X_train'].shape}")
+            print(f"  X_val: {modeling_data['X_val'].shape}")
+            print(f"  X_test: {modeling_data['X_test'].shape}")
+            print(f"  Features: {len(modeling_data['feature_names'])}")
+        # Save final data
+        processed_data.to_csv('enhanced_preprocessing_results\processed_data\enhanced_final_processed_data.csv',
+                            index=True if isinstance(processed_data.index, pd.DatetimeIndex) else False)
+        print(f"\n✓ Final data saved to 'enhanced_final_processed_data.csv'")

scaling/__init__.py ADDED Viewed

File without changes

scaling/data_scaler.py ADDED Viewed

	@@ -0,0 +1,634 @@

+# ============================================
+# CLASS 10: DATA SCALING
+# ============================================
+from typing import Dict, List, Optional, Tuple
+from venv import logger
+import pandas as pd
+from config.config import Config
+import numpy as np
+import matplotlib.pyplot as plt
+class DataScaler:
+    """Class for data scaling and normalisation"""
+    def __init__(self, config: Config):
+        """
+        Initialise scaler
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.scalers = {}
+        self.scaling_info = {}
+        self.transforms_applied = {}
+    def fit_transform(
+        self,
+        data: pd.DataFrame,
+        method: str = None,
+        columns: List[str] = None,
+        target_col: Optional[str] = None,
+        fit_on_train: bool = True,
+        **kwargs
+    ) -> pd.DataFrame:
+        """
+        Scale data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        method : str, optional
+            Scaling method. If None, uses configuration value.
+        columns : List[str], optional
+            List of columns to scale. If None, uses all numeric columns.
+        target_col : str, optional
+            Target variable (not scaled by default)
+        fit_on_train : bool
+            Whether to save scaling parameters for applying to new data
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        pd.DataFrame
+            Scaled data
+        """
+        logger.info("\n" + "="*80)
+        logger.info("DATA SCALING")
+        logger.info("="*80)
+        method = method or self.config.scaling_method
+        data_scaled = data.copy()
+        if columns is None:
+            # Select all numeric columns except target
+            numeric_cols = data_scaled.select_dtypes(include=[np.number]).columns
+            if target_col and target_col in numeric_cols:
+                columns = [col for col in numeric_cols if col != target_col]
+            else:
+                columns = list(numeric_cols)
+        logger.info(f"Scaling method: {method}")
+        logger.info(f"Columns to process: {len(columns)}")
+        # Apply scaling
+        for col in columns:
+            if col in data_scaled.columns:
+                try:
+                    # Check feature type
+                    series = data_scaled[col].dropna()
+                    # Special handling for different feature types
+                    if self._is_binary_feature(series):
+                        logger.debug(f"  {col}: binary feature, scaling not applied")
+                        scaler_info = {
+                            'method': 'none',
+                            'scaler_type': 'binary',
+                            'original_values': sorted(series.unique().tolist()),
+                            'note': 'binary feature, no scaling applied'
+                        }
+                        self.scaling_info[col] = scaler_info
+                        if fit_on_train:
+                            self.scalers[col] = scaler_info
+                    elif self._is_categorical_feature(series):
+                        logger.debug(f"  {col}: categorical feature, using min-max")
+                        scaled_series, scaler_info = self._apply_scaling(
+                            data_scaled[col], 'minmax', fit_on_train, **kwargs
+                        )
+                        data_scaled[col] = scaled_series
+                        self.scaling_info[col] = scaler_info
+                        if fit_on_train:
+                            if scaler_info.get('scaler_type') == 'sklearn':
+                                self.scalers[col] = scaler_info['scaler_object']
+                            else:
+                                self.scalers[col] = scaler_info
+                    else:
+                        # Regular scaling for continuous features
+                        scaled_series, scaler_info = self._apply_scaling(
+                            data_scaled[col], method, fit_on_train, **kwargs
+                        )
+                        data_scaled[col] = scaled_series
+                        self.scaling_info[col] = scaler_info
+                        if fit_on_train:
+                            if scaler_info.get('scaler_type') == 'sklearn':
+                                self.scalers[col] = scaler_info['scaler_object']
+                            else:
+                                self.scalers[col] = scaler_info
+                except Exception as e:
+                    logger.warning(f"Error processing column {col}: {e}")
+                    # Save error information
+                    self.scaling_info[col] = {
+                        'method': 'error',
+                        'error': str(e),
+                        'scaler_type': 'none'
+                    }
+        logger.info(f"✓ Data processed using {method} method")
+        # Visualisation of results
+        if self.config.save_plots and columns:
+            self._plot_scaling_results(data, data_scaled, columns, method)
+        return data_scaled
+    def _is_binary_feature(self, series: pd.Series) -> bool:
+        """Check if feature is binary"""
+        unique_values = series.dropna().unique()
+        return len(unique_values) == 2 and set(unique_values).issubset({0, 1})
+    def _is_categorical_feature(self, series: pd.Series, max_categories: int = 10) -> bool:
+        """Check if feature is categorical"""
+        unique_values = series.dropna().unique()
+        return len(unique_values) <= max_categories and series.dtype in ['int64', 'float64']
+    def _apply_scaling(
+        self,
+        series: pd.Series,
+        method: str,
+        fit_on_train: bool,
+        **kwargs
+    ) -> Tuple[pd.Series, Dict]:
+        """Apply specific scaling method"""
+        series_clean = series.dropna()
+        if len(series_clean) == 0:
+            return series, {
+                'method': 'none',
+                'scaler_type': 'none',
+                'error': 'all values are NaN'
+            }
+        scaler_info = {
+            'method': method,
+            'scaler_type': 'simple',
+            'original_mean': float(series_clean.mean()),
+            'original_std': float(series_clean.std()),
+            'original_min': float(series_clean.min()),
+            'original_max': float(series_clean.max()),
+            'scaler': None,
+            'scaler_object': None
+        }
+        try:
+            if method == 'standard':
+                # Standardisation (z-score normalisation)
+                mean = series_clean.mean()
+                std = series_clean.std()
+                if std > 0:
+                    series_scaled = (series - mean) / std
+                    scaler_info['scaler'] = {'mean': float(mean), 'std': float(std)}
+                else:
+                    series_scaled = series - mean  # If std = 0, just center
+                    scaler_info['scaler'] = {'mean': float(mean), 'std': 0}
+            elif method == 'minmax':
+                # Min-Max normalisation
+                min_val = series_clean.min()
+                max_val = series_clean.max()
+                if max_val > min_val:
+                    series_scaled = (series - min_val) / (max_val - min_val)
+                    scaler_info['scaler'] = {'min': float(min_val), 'max': float(max_val)}
+                else:
+                    series_scaled = series - min_val  # If all values equal
+                    scaler_info['scaler'] = {'min': float(min_val), 'max': float(min_val)}
+            elif method == 'robust':
+                # Robust scaling (outlier resistant)
+                # Check sufficient values for quartile calculation
+                if len(series_clean) >= 4:
+                    median = series_clean.median()
+                    q1 = series_clean.quantile(0.25)
+                    q3 = series_clean.quantile(0.75)
+                    iqr = q3 - q1
+                    if iqr > 0:
+                        series_scaled = (series - median) / iqr
+                        scaler_info['scaler'] = {
+                            'median': float(median),
+                            'q1': float(q1),
+                            'q3': float(q3),
+                            'iqr': float(iqr)
+                        }
+                    else:
+                        # If IQR = 0, use standard deviation
+                        std = series_clean.std()
+                        if std > 0:
+                            series_scaled = (series - median) / std
+                            scaler_info['scaler'] = {'median': float(median), 'std': float(std)}
+                        else:
+                            series_scaled = series - median
+                            scaler_info['scaler'] = {'median': float(median), 'iqr': 0}
+                else:
+                    # If insufficient data, use standardisation
+                    mean = series_clean.mean()
+                    std = series_clean.std()
+                    if std > 0:
+                        series_scaled = (series - mean) / std
+                        scaler_info['scaler'] = {'mean': float(mean), 'std': float(std)}
+                        scaler_info['method'] = 'standard'  # Change method in info
+                    else:
+                        series_scaled = series - mean
+                        scaler_info['scaler'] = {'mean': float(mean), 'std': 0}
+                        scaler_info['method'] = 'standard'
+            elif method == 'log':
+                # Logarithmic transformation
+                min_val = series_clean.min()
+                if min_val <= 0:
+                    shift = abs(min_val) + 1
+                    series_scaled = np.log(series + shift)
+                    scaler_info['scaler'] = {'shift': float(shift)}
+                else:
+                    series_scaled = np.log(series)
+                    scaler_info['scaler'] = {'shift': 0}
+            elif method == 'boxcox':
+                # Box-Cox transformation
+                try:
+                    from scipy.stats import boxcox
+                    min_val = series_clean.min()
+                    if min_val <= 0:
+                        shift = abs(min_val) + 1
+                        series_to_transform = series + shift
+                    else:
+                        shift = 0
+                        series_to_transform = series
+                    transformed, lambda_val = boxcox(series_to_transform.dropna())
+                    # Interpolate for all values
+                    series_scaled = series.copy()
+                    valid_mask = series_to_transform.notna()
+                    series_scaled[valid_mask] = transformed
+                    scaler_info['scaler'] = {
+                        'lambda': float(lambda_val),
+                        'shift': float(shift)
+                    }
+                except Exception as e:
+                    logger.warning(f"Box-Cox transformation failed for {series.name}: {e}")
+                    # Return original series and change method
+                    series_scaled = series
+                    scaler_info['method'] = 'none'
+                    scaler_info['scaler_type'] = 'none'
+                    scaler_info['error'] = str(e)
+            elif method == 'quantile':
+                # Quantile transformation (rank-based)
+                try:
+                    from sklearn.preprocessing import QuantileTransformer
+                    qt = QuantileTransformer(
+                        n_quantiles=kwargs.get('n_quantiles', min(100, len(series_clean))),
+                        output_distribution=kwargs.get('output_distribution', 'normal'),
+                        random_state=kwargs.get('random_state', 42)
+                    )
+                    series_reshaped = series.values.reshape(-1, 1)
+                    series_scaled_values = qt.fit_transform(series_reshaped)
+                    series_scaled = pd.Series(series_scaled_values.flatten(), index=series.index)
+                    scaler_info['scaler_type'] = 'sklearn'
+                    scaler_info['scaler_object'] = qt
+                except Exception as e:
+                    logger.warning(f"Quantile transform failed for {series.name}: {e}")
+                    series_scaled = series
+                    scaler_info['method'] = 'none'
+                    scaler_info['scaler_type'] = 'none'
+                    scaler_info['error'] = str(e)
+            elif method == 'power':
+                # Power transform (Yeo-Johnson)
+                try:
+                    from sklearn.preprocessing import PowerTransformer
+                    pt = PowerTransformer(method='yeo-johnson', standardize=True)
+                    series_reshaped = series.values.reshape(-1, 1)
+                    series_scaled_values = pt.fit_transform(series_reshaped)
+                    series_scaled = pd.Series(series_scaled_values.flatten(), index=series.index)
+                    scaler_info['scaler_type'] = 'sklearn'
+                    scaler_info['scaler_object'] = pt
+                except Exception as e:
+                    logger.warning(f"Power transform failed for {series.name}: {e}")
+                    series_scaled = series
+                    scaler_info['method'] = 'none'
+                    scaler_info['scaler_type'] = 'none'
+                    scaler_info['error'] = str(e)
+            elif method == 'none':
+                # No scaling
+                series_scaled = series
+                scaler_info['method'] = 'none'
+                scaler_info['scaler_type'] = 'none'
+            else:
+                logger.warning(f"Unknown scaling method: {method}, using standardisation")
+                return self._apply_scaling(series, 'standard', fit_on_train, **kwargs)
+            # Add statistics after scaling
+            scaled_clean = series_scaled.dropna()
+            if len(scaled_clean) > 0:
+                scaler_info.update({
+                    'scaled_mean': float(scaled_clean.mean()),
+                    'scaled_std': float(scaled_clean.std()),
+                    'scaled_min': float(scaled_clean.min()),
+                    'scaled_max': float(scaled_clean.max()),
+                    'skewness_before': float(series_clean.skew()),
+                    'skewness_after': float(scaled_clean.skew()),
+                    'kurtosis_before': float(series_clean.kurtosis()),
+                    'kurtosis_after': float(scaled_clean.kurtosis())
+                })
+            return series_scaled, scaler_info
+        except Exception as e:
+            logger.warning(f"Error applying method {method} for {series.name}: {e}")
+            return series, {
+                'method': 'error',
+                'scaler_type': 'none',
+                'error': str(e)
+            }
+    def transform(
+        self,
+        data: pd.DataFrame,
+        columns: List[str] = None
+    ) -> pd.DataFrame:
+        """
+        Apply saved scaling to new data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            New data
+        columns : List[str], optional
+            List of columns to transform
+        Returns:
+        --------
+        pd.DataFrame
+            Transformed data
+        """
+        if not self.scalers:
+            logger.warning("Scalers not trained, use fit_transform first")
+            return data
+        data_transformed = data.copy()
+        if columns is None:
+            columns = [col for col in self.scalers.keys() if col in data_transformed.columns]
+        for col in columns:
+            if col in data_transformed.columns and col in self.scalers:
+                try:
+                    scaler_info = self.scaling_info.get(col, {})
+                    scaler_data = self.scalers[col]
+                    method = scaler_info.get('method', 'unknown')
+                    # For binary features, do nothing
+                    if method == 'none' and scaler_info.get('scaler_type') == 'binary':
+                        continue
+                    # Skip errors
+                    if method == 'error':
+                        continue
+                    if isinstance(scaler_data, dict) and 'scaler' in scaler_data:
+                        scaler_params = scaler_data['scaler']
+                        if method == 'standard':
+                            mean = scaler_params.get('mean', 0)
+                            std = scaler_params.get('std', 1)
+                            if std > 0:
+                                data_transformed[col] = (data_transformed[col] - mean) / std
+                        elif method == 'minmax':
+                            min_val = scaler_params.get('min', 0)
+                            max_val = scaler_params.get('max', 1)
+                            if max_val > min_val:
+                                data_transformed[col] = (data_transformed[col] - min_val) / (max_val - min_val)
+                        elif method == 'robust':
+                            median = scaler_params.get('median', 0)
+                            iqr = scaler_params.get('iqr', 1)
+                            if iqr > 0:
+                                data_transformed[col] = (data_transformed[col] - median) / iqr
+                            else:
+                                std = scaler_params.get('std', 1)
+                                if std > 0:
+                                    data_transformed[col] = (data_transformed[col] - median) / std
+                    elif hasattr(scaler_data, 'transform'):
+                        # For sklearn objects
+                        from sklearn.base import BaseEstimator
+                        if isinstance(scaler_data, BaseEstimator):
+                            try:
+                                transformed = scaler_data.transform(
+                                    data_transformed[[col]].values.reshape(-1, 1)
+                                ).flatten()
+                                data_transformed[col] = transformed
+                            except Exception as e:
+                                logger.warning(f"Error in sklearn transformation for {col}: {e}")
+                except Exception as e:
+                    logger.warning(f"Error transforming column {col}: {e}")
+        return data_transformed
+    def inverse_transform(
+        self,
+        data: pd.DataFrame,
+        columns: List[str] = None
+    ) -> pd.DataFrame:
+        """
+        Inverse transform scaled data
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Scaled data
+        columns : List[str], optional
+            List of columns for inverse transform
+        Returns:
+        --------
+        pd.DataFrame
+            Data in original scale
+        """
+        if not self.scalers:
+            logger.warning("Scalers not trained")
+            return data
+        data_inverse = data.copy()
+        if columns is None:
+            columns = [col for col in self.scalers.keys() if col in data_inverse.columns]
+        for col in columns:
+            if col in data_inverse.columns and col in self.scalers:
+                try:
+                    scaler_info = self.scaling_info.get(col, {})
+                    scaler_data = self.scalers[col]
+                    method = scaler_info.get('method', 'unknown')
+                    # For binary and categorical features, do nothing
+                    if method in ['none', 'error']:
+                        continue
+                    if isinstance(scaler_data, dict) and 'scaler' in scaler_data:
+                        scaler_params = scaler_data['scaler']
+                        if method == 'standard':
+                            mean = scaler_params.get('mean', 0)
+                            std = scaler_params.get('std', 1)
+                            data_inverse[col] = data_inverse[col] * std + mean
+                        elif method == 'minmax':
+                            min_val = scaler_params.get('min', 0)
+                            max_val = scaler_params.get('max', 1)
+                            if max_val > min_val:
+                                data_inverse[col] = data_inverse[col] * (max_val - min_val) + min_val
+                        elif method == 'robust':
+                            median = scaler_params.get('median', 0)
+                            iqr = scaler_params.get('iqr', 1)
+                            if iqr > 0:
+                                data_inverse[col] = data_inverse[col] * iqr + median
+                            else:
+                                std = scaler_params.get('std', 1)
+                                if std > 0:
+                                    data_inverse[col] = data_inverse[col] * std + median
+                    elif hasattr(scaler_data, 'inverse_transform'):
+                        # For sklearn objects
+                        from sklearn.base import BaseEstimator
+                        if isinstance(scaler_data, BaseEstimator):
+                            try:
+                                inverse_transformed = scaler_data.inverse_transform(
+                                    data_inverse[[col]].values.reshape(-1, 1)
+                                ).flatten()
+                                data_inverse[col] = inverse_transformed
+                            except Exception as e:
+                                logger.warning(f"Error in sklearn inverse transformation for {col}: {e}")
+                except Exception as e:
+                    logger.warning(f"Error in inverse transformation for column {col}: {e}")
+        return data_inverse
+    def _plot_scaling_results(
+        self,
+        original_data: pd.DataFrame,
+        scaled_data: pd.DataFrame,
+        columns: List[str],
+        method: str
+    ) -> None:
+        """Visualise scaling results"""
+        # Limit number of columns for visualisation
+        cols_to_plot = [col for col in columns if col in original_data.columns and col in scaled_data.columns][:8]
+        if not cols_to_plot:
+            return
+        n_cols = 4
+        n_rows = (len(cols_to_plot) + n_cols - 1) // n_cols
+        fig, axes = plt.subplots(n_rows, n_cols * 2, figsize=(16, 4 * n_rows))
+        for idx, col in enumerate(cols_to_plot):
+            row = idx // n_cols
+            col_idx = (idx % n_cols) * 2
+            # Distribution before scaling
+            axes[row, col_idx].hist(
+                original_data[col].dropna(),
+                bins=30,
+                alpha=0.7,
+                color='blue',
+                density=True
+            )
+            axes[row, col_idx].set_title(f'{col} (before)', fontsize=10)
+            axes[row, col_idx].set_xlabel('Value')
+            axes[row, col_idx].set_ylabel('Density')
+            axes[row, col_idx].grid(True, alpha=0.3)
+            # Distribution after scaling
+            axes[row, col_idx + 1].hist(
+                scaled_data[col].dropna(),
+                bins=30,
+                alpha=0.7,
+                color='green',
+                density=True
+            )
+            axes[row, col_idx + 1].set_title(f'{col} (after)', fontsize=10)
+            axes[row, col_idx + 1].set_xlabel('Scaled value')
+            axes[row, col_idx + 1].set_ylabel('Density')
+            axes[row, col_idx + 1].grid(True, alpha=0.3)
+        # Hide unused subplots
+        total_plots = n_rows * n_cols * 2
+        for idx in range(len(cols_to_plot) * 2, total_plots):
+            row = idx // (n_cols * 2)
+            col_idx = idx % (n_cols * 2)
+            axes[row, col_idx].set_visible(False)
+        plt.suptitle(f'Scaling results using {method} method', fontsize=14)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/scaling_results.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get scaling report"""
+        summary = {
+            'total_columns': len(self.scaling_info),
+            'methods_used': {},
+            'binary_features': [],
+            'categorical_features': [],
+            'continuous_features': [],
+            'errors': []
+        }
+        for col, info in self.scaling_info.items():
+            method = info.get('method', 'unknown')
+            if method not in summary['methods_used']:
+                summary['methods_used'][method] = 0
+            summary['methods_used'][method] += 1
+            if method == 'none' and info.get('scaler_type') == 'binary':
+                summary['binary_features'].append(col)
+            elif method in ['minmax', 'standard', 'robust']:
+                summary['continuous_features'].append(col)
+            elif method == 'error':
+                summary['errors'].append({
+                    'column': col,
+                    'error': info.get('error', 'unknown')
+                })
+        return {
+            'summary': summary,
+            'details': self.scaling_info
+        }

splitting/__init__.py ADDED Viewed

File without changes

splitting/data_splitter.py ADDED Viewed

	@@ -0,0 +1,403 @@

+# ============================================
+# CLASS 9: DATA SPLITTING
+# ============================================
+from datetime import datetime
+from typing import Dict, Optional, Tuple
+from venv import logger
+import pandas as pd
+from config.config import Config
+import numpy as np
+import matplotlib.pyplot as plt
+class DataSplitter:
+    """Class for splitting data into train, validation and test sets"""
+    def __init__(self, config: Config):
+        """
+        Initialise data splitter
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.split_info = {}
+        self.split_indices = {}
+        self.split_strategy = None
+    def split(
+        self,
+        data: pd.DataFrame,
+        test_size: Optional[float] = None,
+        validation_size: Optional[float] = None,
+        method: str = None,
+        random_state: int = 42,
+        **kwargs
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """
+        Split data into train, validation and test sets
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        test_size : float, optional
+            Test set size. If None, uses configuration value.
+        validation_size : float, optional
+            Validation set size. If None, uses configuration value.
+        method : str, optional
+            Splitting method: 'time', 'random', 'expanding_window', 'sliding_window'
+        random_state : int
+            Seed for reproducibility
+        **kwargs : dict
+            Additional parameters for method
+        Returns:
+        --------
+        Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]
+            Train, validation and test data
+        """
+        logger.info("\n" + "="*80)
+        logger.info("DATA SPLITTING")
+        logger.info("="*80)
+        test_size = test_size or self.config.test_size
+        validation_size = validation_size or self.config.validation_size
+        method = method or self.config.split_method
+        n = len(data)
+        logger.info(f"Total data: {n} records")
+        logger.info(f"Splitting method: {method}")
+        logger.info(f"Sizes: train={1-test_size-validation_size:.1%}, val={validation_size:.1%}, test={test_size:.1%}")
+        if method == 'time':
+            train_data, val_data, test_data = self._time_based_split(
+                data, test_size, validation_size
+            )
+        elif method == 'random':
+            train_data, val_data, test_data = self._random_split(
+                data, test_size, validation_size, random_state
+            )
+        elif method == 'expanding_window':
+            train_data, val_data, test_data = self._expanding_window_split(
+                data, test_size, validation_size, **kwargs
+            )
+        elif method == 'sliding_window':
+            train_data, val_data, test_data = self._sliding_window_split(
+                data, **kwargs
+            )
+        else:
+            logger.warning(f"Method {method} not supported, using time-based split")
+            train_data, val_data, test_data = self._time_based_split(
+                data, test_size, validation_size
+            )
+        # Save splitting information
+        self._save_split_info(data, train_data, val_data, test_data, method)
+        # Output information
+        self._log_split_summary(train_data, val_data, test_data)
+        # Visualisation of split
+        if self.config.save_plots:
+            self._plot_data_split(data, train_data, val_data, test_data)
+        return train_data, val_data, test_data
+    def _time_based_split(
+        self,
+        data: pd.DataFrame,
+        test_size: float,
+        validation_size: float
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Time-based splitting preserving temporal order"""
+        n = len(data)
+        # Calculate set sizes
+        test_size_int = int(n * test_size)
+        val_size_int = int(n * validation_size)
+        train_size_int = n - test_size_int - val_size_int
+        # Split data
+        train_data = data.iloc[:train_size_int].copy()
+        val_data = data.iloc[train_size_int:train_size_int + val_size_int].copy()
+        test_data = data.iloc[train_size_int + val_size_int:].copy()
+        self.split_strategy = 'time_based'
+        return train_data, val_data, test_data
+    def _random_split(
+        self,
+        data: pd.DataFrame,
+        test_size: float,
+        validation_size: float,
+        random_state: int
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Random data splitting"""
+        from sklearn.model_selection import train_test_split
+        # First split into train+val and test
+        train_val_data, test_data = train_test_split(
+            data,
+            test_size=test_size,
+            random_state=random_state,
+            shuffle=True
+        )
+        # Then split train+val into train and val
+        val_relative_size = validation_size / (1 - test_size)
+        train_data, val_data = train_test_split(
+            train_val_data,
+            test_size=val_relative_size,
+            random_state=random_state,
+            shuffle=True
+        )
+        self.split_strategy = 'random'
+        return train_data, val_data, test_data
+    def _expanding_window_split(
+        self,
+        data: pd.DataFrame,
+        test_size: float,
+        validation_size: float,
+        **kwargs
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Expanding window split"""
+        n = len(data)
+        # Minimum initial window size
+        initial_window = kwargs.get('initial_window', max(100, int(n * 0.1)))
+        # Final set sizes
+        test_size_int = int(n * test_size)
+        val_size_int = int(n * validation_size)
+        # Determine boundaries
+        test_start = n - test_size_int
+        val_start = test_start - val_size_int
+        # For expanding window, use all data up to val_start for training
+        train_data = data.iloc[:val_start].copy()
+        val_data = data.iloc[val_start:test_start].copy()
+        test_data = data.iloc[test_start:].copy()
+        self.split_strategy = 'expanding_window'
+        return train_data, val_data, test_data
+    def _sliding_window_split(
+        self,
+        data: pd.DataFrame,
+        **kwargs
+    ) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
+        """Sliding window split (for multiple train-val-test pairs)"""
+        window_size = kwargs.get('window_size', len(data) // 3)
+        step = kwargs.get('step', window_size // 2)
+        # For simplicity return single split
+        # In real scenarios can return list of splits
+        n = len(data)
+        train_end = n - window_size
+        val_end = train_end + window_size // 3
+        test_end = n
+        train_data = data.iloc[:train_end].copy()
+        val_data = data.iloc[train_end:val_end].copy()
+        test_data = data.iloc[val_end:].copy()
+        self.split_strategy = 'sliding_window'
+        return train_data, val_data, test_data
+    def _save_split_info(
+        self,
+        full_data: pd.DataFrame,
+        train_data: pd.DataFrame,
+        val_data: pd.DataFrame,
+        test_data: pd.DataFrame,
+        method: str
+    ) -> None:
+        """Save splitting information"""
+        n = len(full_data)
+        self.split_info = {
+            'method': method,
+            'strategy': self.split_strategy,
+            'train_size': len(train_data),
+            'val_size': len(val_data),
+            'test_size': len(test_data),
+            'train_percent': len(train_data) / n * 100,
+            'val_percent': len(val_data) / n * 100,
+            'test_percent': len(test_data) / n * 100,
+            'total_samples': n,
+            'features_count': len(full_data.columns),
+            'split_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+        }
+        # Add temporal period information if available
+        if isinstance(full_data.index, pd.DatetimeIndex):
+            self.split_info.update({
+                'train_period': {
+                    'start': train_data.index.min().strftime('%Y-%m-%d'),
+                    'end': train_data.index.max().strftime('%Y-%m-%d')
+                },
+                'val_period': {
+                    'start': val_data.index.min().strftime('%Y-%m-%d'),
+                    'end': val_data.index.max().strftime('%Y-%m-%d')
+                },
+                'test_period': {
+                    'start': test_data.index.min().strftime('%Y-%m-%d'),
+                    'end': test_data.index.max().strftime('%Y-%m-%d')
+                }
+            })
+        # Save split indices
+        self.split_indices = {
+            'train': train_data.index.tolist(),
+            'val': val_data.index.tolist(),
+            'test': test_data.index.tolist()
+        }
+    def _log_split_summary(
+        self,
+        train_data: pd.DataFrame,
+        val_data: pd.DataFrame,
+        test_data: pd.DataFrame
+    ) -> None:
+        """Log splitting summary"""
+        logger.info("✓ Data split completed:")
+        logger.info(f"  Train: {len(train_data)} records ({self.split_info['train_percent']:.1f}%)")
+        logger.info(f"  Validation: {len(val_data)} records ({self.split_info['val_percent']:.1f}%)")
+        logger.info(f"  Test: {len(test_data)} records ({self.split_info['test_percent']:.1f}%)")
+        if 'train_period' in self.split_info:
+            logger.info(f"\nPeriods:")
+            logger.info(f"  Train: {self.split_info['train_period']['start']} - {self.split_info['train_period']['end']}")
+            logger.info(f"  Validation: {self.split_info['val_period']['start']} - {self.split_info['val_period']['end']}")
+            logger.info(f"  Test: {self.split_info['test_period']['start']} - {self.split_info['test_period']['end']}")
+        # Target variable statistics
+        target = self.config.target_column
+        if target in train_data.columns:
+            logger.info(f"\nTarget variable '{target}' statistics:")
+            logger.info(f"  Train: mean={train_data[target].mean():.2f}, std={train_data[target].std():.2f}")
+            logger.info(f"  Validation: mean={val_data[target].mean():.2f}, std={val_data[target].std():.2f}")
+            logger.info(f"  Test: mean={test_data[target].mean():.2f}, std={test_data[target].std():.2f}")
+    def _plot_data_split(
+        self,
+        full_data: pd.DataFrame,
+        train_data: pd.DataFrame,
+        val_data: pd.DataFrame,
+        test_data: pd.DataFrame
+    ) -> None:
+        """Visualise data splitting"""
+        fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+        target = self.config.target_column
+        # 1. Time series with set highlighting
+        if target in full_data.columns and isinstance(full_data.index, pd.DatetimeIndex):
+            axes[0, 0].plot(train_data.index, train_data[target],
+                          label='Train', colour='blue', alpha=0.7, linewidth=1)
+            axes[0, 0].plot(val_data.index, val_data[target],
+                          label='Validation', colour='orange', alpha=0.7, linewidth=1)
+            axes[0, 0].plot(test_data.index, test_data[target],
+                          label='Test', colour='red', alpha=0.7, linewidth=1)
+            axes[0, 0].set_title(f'Data Split: {target}')
+            axes[0, 0].set_xlabel('Date')
+            axes[0, 0].set_ylabel(target)
+            axes[0, 0].legend()
+            axes[0, 0].grid(True, alpha=0.3)
+        # 2. Yearly distribution
+        if isinstance(full_data.index, pd.DatetimeIndex):
+            full_data['year'] = full_data.index.year
+            train_data['year'] = train_data.index.year
+            val_data['year'] = val_data.index.year
+            test_data['year'] = test_data.index.year
+            years = sorted(full_data['year'].unique())
+            train_counts = [len(train_data[train_data['year'] == year]) for year in years]
+            val_counts = [len(val_data[val_data['year'] == year]) for year in years]
+            test_counts = [len(test_data[test_data['year'] == year]) for year in years]
+            x = np.arange(len(years))
+            width = 0.25
+            axes[0, 1].bar(x - width, train_counts, width, label='Train', colour='blue', alpha=0.7)
+            axes[0, 1].bar(x, val_counts, width, label='Validation', colour='orange', alpha=0.7)
+            axes[0, 1].bar(x + width, test_counts, width, label='Test', colour='red', alpha=0.7)
+            axes[0, 1].set_title('Yearly Data Distribution')
+            axes[0, 1].set_xlabel('Year')
+            axes[0, 1].set_ylabel('Number of Records')
+            axes[0, 1].set_xticks(x)
+            axes[0, 1].set_xticklabels(years, rotation=45)
+            axes[0, 1].legend()
+            axes[0, 1].grid(True, alpha=0.3)
+            # Remove added columns
+            for df in [full_data, train_data, val_data, test_data]:
+                if 'year' in df.columns:
+                    df.drop('year', axis=1, inplace=True)
+        # 3. Target variable distribution
+        if target in full_data.columns:
+            axes[1, 0].hist(train_data[target].dropna(), bins=30, alpha=0.5, label='Train', density=True)
+            axes[1, 0].hist(val_data[target].dropna(), bins=30, alpha=0.5, label='Validation', density=True)
+            axes[1, 0].hist(test_data[target].dropna(), bins=30, alpha=0.5, label='Test', density=True)
+            axes[1, 0].set_title(f'{target} Distribution Across Sets')
+            axes[1, 0].set_xlabel(target)
+            axes[1, 0].set_ylabel('Density')
+            axes[1, 0].legend()
+            axes[1, 0].grid(True, alpha=0.3)
+        # 4. Set statistics
+        if target in full_data.columns:
+            stats_data = []
+            for name, df in [('Train', train_data), ('Validation', val_data), ('Test', test_data)]:
+                if target in df.columns:
+                    stats_data.append({
+                        'Dataset': name,
+                        'Mean': df[target].mean(),
+                        'Std': df[target].std(),
+                        'Min': df[target].min(),
+                        'Max': df[target].max()
+                    })
+            if stats_data:
+                stats_df = pd.DataFrame(stats_data)
+                stats_table = axes[1, 1].table(
+                    cellText=stats_df.round(2).values,
+                    colLabels=stats_df.columns,
+                    cellLoc='center',
+                    loc='center'
+                )
+                stats_table.auto_set_font_size(False)
+                stats_table.set_fontsize(9)
+                stats_table.scale(1, 1.5)
+                axes[1, 1].axis('off')
+                axes[1, 1].set_title('Set Statistics')
+        plt.suptitle(f'Data Splitting: {self.split_info["method"]} method', fontsize=14)
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/data_split.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def get_report(self) -> Dict:
+        """Get data splitting report"""
+        return self.split_info

stationarity/__init__.py ADDED Viewed

File without changes

stationarity/stationarity_checker.py ADDED Viewed

	@@ -0,0 +1,631 @@

+# ============================================
+# CLASS 6: STATIONARITY ANALYSIS
+# ============================================
+from typing import Dict, Optional
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+import matplotlib.pyplot as plt
+from statsmodels.tsa.stattools import adfuller, kpss, acf, pacf
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+class StationarityChecker:
+    """Class for checking time series stationarity"""
+    def __init__(self, config: Config):
+        """
+        Initialise stationarity checker
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.test_results = {}
+        self.transformed_series = {}
+        self.best_transformation = {}
+    def check(
+        self,
+        data: pd.DataFrame,
+        target_col: Optional[str] = None,
+        make_stationary: bool = True,
+        try_transformations: bool = True
+    ) -> Dict:
+        """
+        Check time series stationarity
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str, optional
+            Target variable. If None, uses configuration default.
+        make_stationary : bool
+            Transform series to stationary form
+        try_transformations : bool
+            Try various transformations to achieve stationarity
+        Returns:
+        --------
+        Dict
+            Stationarity test results
+        """
+        logger.info("\n" + "="*80)
+        logger.info("STATIONARITY ANALYSIS")
+        logger.info("="*80)
+        target_col = target_col or self.config.target_column
+        if target_col not in data.columns:
+            logger.error(f"Target variable '{target_col}' not found")
+            return {}
+        series = data[target_col].dropna()
+        if len(series) < 10:
+            logger.warning("Insufficient data for stationarity analysis")
+            return {}
+        # Perform analysis
+        results = self._perform_stationarity_tests(series, target_col)
+        # Save results
+        self.test_results[target_col] = results
+        # Visualisation
+        if self.config.save_plots:
+            self._plot_stationarity_analysis(data, target_col, results)
+        # Log results
+        self._log_test_results(target_col, results)
+        # Transform to stationary form
+        if make_stationary and not results['overall']['is_stationary']:
+            if try_transformations:
+                transformed_data = self._make_stationary(data, target_col, results)
+                if transformed_data is not None:
+                    data = transformed_data
+        return results
+    def _perform_stationarity_tests(
+        self,
+        series: pd.Series,
+        target_col: str
+    ) -> Dict:
+        """Perform various stationarity tests"""
+        results = {
+            'adf': self._adf_test(series),
+            'kpss': self._kpss_test(series),
+            'pp': self._pp_test(series),
+            'hurst': self._hurst_exponent(series),
+            'variance_ratio': self._variance_ratio_test(series),
+            'overall': {}
+        }
+        # Determine overall stationarity
+        adf_stationary = results['adf'].get('is_stationary', False)
+        kpss_stationary = results['kpss'].get('is_stationary', False)
+        pp_stationary = results['pp'].get('is_stationary', False)
+        # Stationarity determination logic
+        if adf_stationary and kpss_stationary:
+            overall_stationary = True
+            confidence = 'high'
+        elif adf_stationary and not kpss_stationary:
+            overall_stationary = True  # ADF more reliable for detecting stationarity
+            confidence = 'medium'
+        elif not adf_stationary and kpss_stationary:
+            overall_stationary = False  # KPSS indicates non-stationarity
+            confidence = 'medium'
+        else:
+            overall_stationary = False
+            confidence = 'high'
+        results['overall'] = {
+            'is_stationary': overall_stationary,
+            'confidence': confidence,
+            'recommendation': self._get_stationarity_recommendation(results)
+        }
+        return results
+    def _adf_test(self, series: pd.Series) -> Dict:
+        """Augmented Dickey-Fuller (ADF) test"""
+        try:
+            adf_result = adfuller(series, autolag='AIC')
+            return {
+                'statistic': float(adf_result[0]),
+                'pvalue': float(adf_result[1]),
+                'critical_values': {k: float(v) for k, v in adf_result[4].items()},
+                'is_stationary': adf_result[1] < 0.05,
+                'used_lag': int(adf_result[2]),
+                'nobs': int(adf_result[3])
+            }
+        except Exception as e:
+            logger.warning(f"ADF test failed: {e}")
+            return {
+                'statistic': np.nan,
+                'pvalue': np.nan,
+                'critical_values': {},
+                'is_stationary': False,
+                'error': str(e)
+            }
+    def _kpss_test(self, series: pd.Series) -> Dict:
+        """KPSS test"""
+        try:
+            kpss_result = kpss(series, regression='c', nlags='auto')
+            return {
+                'statistic': float(kpss_result[0]),
+                'pvalue': float(kpss_result[1]),
+                'critical_values': {k: float(v) for k, v in kpss_result[3].items()},
+                'is_stationary': kpss_result[1] > 0.05,  # KPSS: p > 0.05 indicates stationarity
+                'used_lag': int(kpss_result[2])
+            }
+        except Exception as e:
+            logger.warning(f"KPSS test failed: {e}")
+            return {
+                'statistic': np.nan,
+                'pvalue': np.nan,
+                'critical_values': {},
+                'is_stationary': False,
+                'error': str(e)
+            }
+    def _pp_test(self, series: pd.Series) -> Dict:
+        """Phillips-Perron test"""
+        try:
+            # Simplified PP test version
+            from statsmodels.tsa.stattools import PhillipsPerron
+            pp_result = PhillipsPerron(series)
+            return {
+                'statistic': float(pp_result.stat),
+                'pvalue': float(pp_result.pvalue),
+                'critical_values': pp_result.critical_values,
+                'is_stationary': pp_result.pvalue < 0.05
+            }
+        except:
+            # If statsmodels with PP test not available
+            return {
+                'statistic': np.nan,
+                'pvalue': np.nan,
+                'critical_values': {},
+                'is_stationary': False,
+                'note': 'Phillips-Perron test not available'
+            }
+    def _hurst_exponent(self, series: pd.Series) -> Dict:
+        """Calculate Hurst exponent"""
+        try:
+            # Simplified Hurst exponent calculation
+            lags = range(2, min(100, len(series)//4))
+            tau = []
+            for lag in lags:
+                # Split series into subsequences of length lag
+                n = len(series) // lag
+                if n < 2:
+                    continue
+                subseries = [series[i*lag:(i+1)*lag] for i in range(n)]
+                # Calculate R/S for each subsequence
+                rs_values = []
+                for sub in subseries:
+                    if len(sub) > 1:
+                        mean = np.mean(sub)
+                        deviations = sub - mean
+                        z = np.cumsum(deviations)
+                        r = np.max(z) - np.min(z)
+                        s = np.std(sub)
+                        if s > 0:
+                            rs_values.append(r / s)
+                if rs_values:
+                    tau.append(np.mean(rs_values))
+            if len(tau) > 2:
+                # Linear regression in log coordinates
+                x = np.log(lags[:len(tau)])
+                y = np.log(tau)
+                if len(x) > 1 and len(y) > 1:
+                    slope = np.polyfit(x, y, 1)[0]
+                    # Hurst exponent interpretation
+                    if slope > 0.5:
+                        trend_type = 'persistent'
+                    elif slope < 0.5:
+                        trend_type = 'anti-persistent'
+                    else:
+                        trend_type = 'random'
+                    return {
+                        'exponent': float(slope),
+                        'trend_type': trend_type,
+                        'interpretation': self._interpret_hurst(slope)
+                    }
+            return {
+                'exponent': np.nan,
+                'trend_type': 'unknown',
+                'interpretation': 'Insufficient data'
+            }
+        except Exception as e:
+            logger.debug(f"Hurst exponent not calculated: {e}")
+            return {
+                'exponent': np.nan,
+                'trend_type': 'unknown',
+                'error': str(e)
+            }
+    def _interpret_hurst(self, hurst_exponent: float) -> str:
+        """Interpret Hurst exponent"""
+        if hurst_exponent > 0.75:
+            return "Strong persistence (long-term memory)"
+        elif hurst_exponent > 0.6:
+            return "Moderate persistence"
+        elif hurst_exponent > 0.4:
+            return "Weak persistence / random walk"
+        elif hurst_exponent > 0.25:
+            return "Weak anti-persistence"
+        else:
+            return "Strong anti-persistence (frequent trend reversal)"
+    def _variance_ratio_test(self, series: pd.Series) -> Dict:
+        """Variance Ratio test for random walk"""
+        try:
+            # Simplified variance ratio test
+            if len(series) < 20:
+                return {'ratio': np.nan, 'is_random_walk': False}
+            # Calculate differences
+            diff1 = series.diff(1).dropna()
+            diff2 = series.diff(2).dropna()[1:]  # Shift to align indices
+            if len(diff1) < 5 or len(diff2) < 5:
+                return {'ratio': np.nan, 'is_random_walk': False}
+            var1 = np.var(diff1)
+            var2 = np.var(diff2)
+            if var1 > 0:
+                ratio = var2 / (2 * var1)
+                # For random walk ratio ≈ 1
+                is_random_walk = 0.8 < ratio < 1.2
+                return {
+                    'ratio': float(ratio),
+                    'is_random_walk': bool(is_random_walk),
+                    'var_diff1': float(var1),
+                    'var_diff2': float(var2)
+                }
+            else:
+                return {'ratio': np.nan, 'is_random_walk': False}
+        except Exception as e:
+            logger.debug(f"Variance ratio test failed: {e}")
+            return {'ratio': np.nan, 'is_random_walk': False, 'error': str(e)}
+    def _get_stationarity_recommendation(self, results: Dict) -> str:
+        """Get stationarity recommendations"""
+        # Check for keys before access
+        if 'overall' not in results or 'is_stationary' not in results['overall']:
+            return "Could not determine stationarity. Check data and test settings."
+        if results['overall']['is_stationary']:
+            return "Series is stationary, suitable for modelling"
+        else:
+            recommendations = []
+            # Check Hurst test results
+            if 'hurst' in results and 'exponent' in results['hurst']:
+                hurst_exponent = results['hurst']['exponent']
+                if not np.isnan(hurst_exponent) and hurst_exponent > 0.6:
+                    recommendations.append("Apply differencing to remove trend")
+            # Check ADF test
+            if 'adf' in results and 'pvalue' in results['adf']:
+                adf_pvalue = results['adf']['pvalue']
+                if not np.isnan(adf_pvalue) and adf_pvalue > 0.1:
+                    recommendations.append("Consider seasonal differencing due to non-stationarity")
+            if len(recommendations) == 0:
+                recommendations.append("Try logarithmic transformation and differencing")
+            return "; ".join(recommendations)
+    def _plot_stationarity_analysis(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        results: Dict
+    ) -> None:
+        """Visualise stationarity analysis"""
+        series = data[target_col]
+        fig, axes = plt.subplots(2, 3, figsize=(16, 10))
+        # 1. Original series
+        axes[0, 0].plot(series.index, series, linewidth=1)
+        axes[0, 0].set_title(f'Original Time Series: {target_col}')
+        axes[0, 0].set_xlabel('Date')
+        axes[0, 0].set_ylabel(target_col)
+        axes[0, 0].grid(True, alpha=0.3)
+        # 2. Rolling statistics
+        rolling_mean = series.rolling(window=365, center=True, min_periods=1).mean()
+        rolling_std = series.rolling(window=365, center=True, min_periods=1).std()
+        axes[0, 1].plot(series.index, series, label='Original series', alpha=0.7, linewidth=0.5)
+        axes[0, 1].plot(rolling_mean.index, rolling_mean, label='Rolling mean (365)', color='red', linewidth=2)
+        axes[0, 1].plot(rolling_std.index, rolling_std, label='Rolling STD (365)', color='green', linewidth=2)
+        axes[0, 1].set_title(f'Rolling Statistics: {target_col}')
+        axes[0, 1].set_xlabel('Date')
+        axes[0, 1].set_ylabel(target_col)
+        axes[0, 1].legend(fontsize=8)
+        axes[0, 1].grid(True, alpha=0.3)
+        # 3. ACF
+        plot_acf(series.dropna(), lags=50, ax=axes[0, 2], alpha=0.05)
+        axes[0, 2].set_title(f'Autocorrelation Function (ACF): {target_col}')
+        axes[0, 2].set_xlabel('Lag')
+        axes[0, 2].set_ylabel('Autocorrelation')
+        axes[0, 2].grid(True, alpha=0.3)
+        # 4. PACF
+        plot_pacf(series.dropna(), lags=50, ax=axes[1, 0], alpha=0.05)
+        axes[1, 0].set_title(f'Partial Autocorrelation Function (PACF): {target_col}')
+        axes[1, 0].set_xlabel('Lag')
+        axes[1, 0].set_ylabel('Partial Autocorrelation')
+        axes[1, 0].grid(True, alpha=0.3)
+        # 5. Histogram and Q-Q plot
+        axes[1, 1].hist(series.dropna(), bins=30, edgecolor='black', alpha=0.7, density=True)
+        axes[1, 1].set_title(f'Distribution: {target_col}')
+        axes[1, 1].set_xlabel('Value')
+        axes[1, 1].set_ylabel('Density')
+        axes[1, 1].grid(True, alpha=0.3)
+        # 6. Series differences
+        diff1 = series.diff(1).dropna()
+        axes[1, 2].plot(diff1.index, diff1, linewidth=0.5)
+        axes[1, 2].set_title(f'First Difference: {target_col}')
+        axes[1, 2].set_xlabel('Date')
+        axes[1, 2].set_ylabel(f'Δ{target_col}')
+        axes[1, 2].grid(True, alpha=0.3)
+        plt.suptitle(
+            f'Stationarity Analysis: {target_col}\n'
+            f'Stationary: {"✓ Yes" if results["overall"]["is_stationary"] else "✗ No"} '
+            f'(confidence: {results["overall"]["confidence"]})',
+            fontsize=14
+        )
+        plt.tight_layout()
+        plt.savefig(
+            f'{self.config.results_dir}/plots/stationarity_{target_col}.png',
+            dpi=300,
+            bbox_inches='tight'
+        )
+        plt.show()
+    def _log_test_results(self, target_col: str, results: Dict) -> None:
+        """Log test results"""
+        logger.info("\nSTATIONARITY TEST RESULTS:")
+        logger.info("-" * 50)
+        # ADF test
+        adf = results['adf']
+        logger.info(f"Augmented Dickey-Fuller (ADF) test:")
+        logger.info(f"  Statistic: {adf['statistic']:.4f}")
+        logger.info(f"  p-value: {adf['pvalue']:.4f}")
+        logger.info(f"  Stationary: {'✓ Yes' if adf['is_stationary'] else '✗ No'}")
+        # KPSS test
+        kpss_test = results['kpss']
+        if 'statistic' in kpss_test and not np.isnan(kpss_test['statistic']):
+            logger.info(f"\nKPSS test:")
+            logger.info(f"  Statistic: {kpss_test['statistic']:.4f}")
+            logger.info(f"  p-value: {kpss_test['pvalue']:.4f}")
+            logger.info(f"  Stationary: {'✓ Yes' if kpss_test['is_stationary'] else '✗ No'}")
+        # Hurst exponent
+        hurst = results['hurst']
+        if 'exponent' in hurst and not np.isnan(hurst['exponent']):
+            logger.info(f"\nHurst exponent:")
+            logger.info(f"  Value: {hurst['exponent']:.3f}")
+            logger.info(f"  Trend type: {hurst['trend_type']}")
+            logger.info(f"  Interpretation: {hurst.get('interpretation', '')}")
+        # Overall interpretation
+        logger.info(f"\nOVERALL CONCLUSION:")
+        logger.info("-" * 30)
+        logger.info(f"Stationary: {'✓ Yes' if results['overall']['is_stationary'] else '✗ No'}")
+        logger.info(f"Confidence: {results['overall']['confidence']}")
+        logger.info(f"Recommendation: {results['overall']['recommendation']}")
+    def _make_stationary(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        results: Dict
+    ) -> Optional[pd.DataFrame]:
+        """
+        Transform series to stationary form
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        target_col : str
+            Target variable
+        results : Dict
+            Stationarity test results
+        Returns:
+        --------
+        Optional[pd.DataFrame]
+            Data with stationary series or None if transformation failed
+        """
+        logger.info("\nTRANSFORMING TO STATIONARY FORM:")
+        logger.info("-" * 40)
+        data_processed = data.copy()
+        series = data_processed[target_col]
+        # Stationarisation methods in order of preference
+        methods = [
+            ('diff', 'first-order differencing'),
+            ('seasonal_diff', f'seasonal differencing (period={self.config.seasonal_period})'),
+            ('log_diff', 'logarithmic differencing'),
+            ('boxcox_diff', 'Box-Cox + differencing'),
+            ('detrend', 'detrending'),
+            ('combination', 'combined method')
+        ]
+        best_method = None
+        best_series = None
+        best_pvalue = 1.0
+        best_stationary = False
+        for method, method_name in methods:
+            try:
+                if method == 'diff':
+                    # Simple differencing
+                    transformed = series.diff(1).dropna()
+                    test_series = transformed
+                elif method == 'seasonal_diff':
+                    # Seasonal differencing
+                    transformed = series.diff(self.config.seasonal_period).dropna()
+                    test_series = transformed
+                elif method == 'log_diff':
+                    # Logarithmic differencing
+                    if (series > 0).all():
+                        log_series = np.log(series)
+                        transformed = log_series.diff(1).dropna()
+                        test_series = transformed
+                    else:
+                        # Shift for negative values
+                        shift = abs(series.min()) + 1 if series.min() <= 0 else 0
+                        log_series = np.log(series + shift)
+                        transformed = log_series.diff(1).dropna()
+                        test_series = transformed
+                elif method == 'boxcox_diff':
+                    # Box-Cox transformation + differencing
+                    try:
+                        from scipy.stats import boxcox
+                        # Add constant for positive values
+                        shift = abs(series.min()) + 1 if series.min() <= 0 else 0
+                        boxcox_series, _ = boxcox(series + shift)
+                        transformed = pd.Series(boxcox_series, index=series.index).diff(1).dropna()
+                        test_series = transformed
+                    except:
+                        continue
+                elif method == 'detrend':
+                    # Linear detrending
+                    x = np.arange(len(series))
+                    y = series.values
+                    coeffs = np.polyfit(x, y, 1)
+                    trend = np.polyval(coeffs, x)
+                    transformed = pd.Series(y - trend, index=series.index)
+                    test_series = transformed
+                elif method == 'combination':
+                    # Combined method: log + differencing + detrending
+                    if (series > 0).all():
+                        log_series = np.log(series)
+                        diff_series = log_series.diff(1)
+                        # Detrending residuals
+                        x = np.arange(len(diff_series))
+                        y = diff_series.values
+                        valid_mask = ~np.isnan(y)
+                        if valid_mask.sum() > 2:
+                            coeffs = np.polyfit(x[valid_mask], y[valid_mask], 1)
+                            trend = np.polyval(coeffs, x)
+                            transformed = pd.Series(y - trend, index=series.index)
+                            test_series = transformed.dropna()
+                        else:
+                            test_series = diff_series.dropna()
+                    else:
+                        continue
+                # Check stationarity after transformation
+                if len(test_series) > 10:
+                    adf_result = adfuller(test_series.dropna())
+                    is_stationary = adf_result[1] < 0.05
+                    pvalue = adf_result[1]
+                    logger.info(f"  Method: {method_name}")
+                    logger.info(f"    ADF p-value: {pvalue:.4f}")
+                    logger.info(f"    Stationary: {'✓ Yes' if is_stationary else '✗ No'}")
+                    # Save best method
+                    if is_stationary and pvalue < best_pvalue:
+                        best_pvalue = pvalue
+                        best_method = method
+                        best_series = transformed
+                        best_stationary = True
+                        if pvalue < 0.01:  # Very good result
+                            break
+            except Exception as e:
+                logger.debug(f"  Method {method} failed: {e}")
+                continue
+        # Save results
+        if best_series is not None:
+            new_col_name = f'{target_col}_stationary_{best_method}'
+            # Align indices
+            aligned_series = pd.Series(
+                best_series.values,
+                index=data_processed.index[-len(best_series):]
+            )
+            data_processed[new_col_name] = aligned_series
+            self.transformed_series[target_col] = {
+                'method': best_method,
+                'new_column': new_col_name,
+                'pvalue': float(best_pvalue),
+                'is_stationary': best_stationary,
+                'original_shape': len(series),
+                'transformed_shape': len(best_series)
+            }
+            self.best_transformation[target_col] = best_method
+            logger.info(f"\n✓ Selected method: {best_method}")
+            logger.info(f"  Saved as '{new_col_name}'")
+            logger.info(f"  p-value: {best_pvalue:.4f}")
+            return data_processed
+        else:
+            logger.warning("✗ Could not find suitable transformation for stationarisation")
+            return None
+    def get_report(self) -> Dict:
+        """Get stationarity report"""
+        return {
+            'test_results': self.test_results,
+            'transformed_series': self.transformed_series,
+            'best_transformations': self.best_transformation
+        }

temp_data.csv ADDED Viewed

The diff for this file is too large to render. See raw diff

validation/__init__.py ADDED Viewed

File without changes

validation/data_validator.py ADDED Viewed

	@@ -0,0 +1,655 @@

+# ============================================
+# CLASS 12: DATA VALIDATION
+# ============================================
+from datetime import datetime
+import json
+from pathlib import Path
+from typing import Dict, List
+from venv import logger
+from config.config import Config
+import pandas as pd
+import numpy as np
+class DataValidator:
+    """Class for data quality validation"""
+    def __init__(self, config: Config):
+        """
+        Initialise data validator
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.validation_results = {}
+        self.quality_metrics = {}
+        self.issues_found = {}
+    def validate(
+        self,
+        data: pd.DataFrame,
+        stage: str = 'final',
+        rules: Dict = None,
+        detailed: bool = True
+    ) -> Dict:
+        """
+        Validate data quality
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Input data
+        stage : str
+            Validation stage: 'raw', 'processed', 'final'
+        rules : Dict, optional
+            Validation rules. If None, uses configuration defaults.
+        detailed : bool
+            Whether to perform detailed validation
+        Returns:
+        --------
+        Dict
+            Validation results
+        """
+        logger.info("\n" + "="*80)
+        logger.info(f"DATA VALIDATION ({stage.upper()})")
+        logger.info("="*80)
+        rules = rules or self.config.validation_rules
+        validation_results = {
+            'stage': stage,
+            'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+            'data_shape': list(data.shape),
+            'basic_checks': {},
+            'quality_metrics': {},
+            'issues': {},
+            'recommendations': [],
+            'overall_score': 0,
+            'status': 'PASS'
+        }
+        # Basic checks
+        validation_results['basic_checks'] = self._basic_checks(data, rules)
+        # Quality checks
+        validation_results['quality_metrics'] = self._quality_metrics(data, rules)
+        # Problem detection
+        if detailed:
+            validation_results['issues'] = self._find_issues(data, rules)
+        # Recommendation generation
+        validation_results['recommendations'] = self._generate_recommendations(
+            validation_results['basic_checks'],
+            validation_results['quality_metrics'],
+            validation_results['issues']
+        )
+        # Overall score calculation
+        validation_results['overall_score'] = self._calculate_overall_score(validation_results)
+        # Status determination
+        if validation_results['overall_score'] >= 80:
+            validation_results['status'] = 'PASS'
+        elif validation_results['overall_score'] >= 60:
+            validation_results['status'] = 'WARNING'
+        else:
+            validation_results['status'] = 'FAIL'
+        # Save results
+        self.validation_results[stage] = validation_results
+        self.quality_metrics[stage] = validation_results['quality_metrics']
+        # Log results
+        self._log_validation_results(validation_results)
+        return validation_results
+    def _basic_checks(self, data: pd.DataFrame, rules: Dict) -> Dict:
+        """Basic data checks"""
+        checks = {}
+        # 1. Data size check
+        checks['min_rows'] = {
+            'value': len(data),
+            'threshold': rules.get('min_rows', 100),
+            'passed': len(data) >= rules.get('min_rows', 100)
+        }
+        # 2. Target variable presence check
+        target = self.config.target_column
+        checks['has_target'] = {
+            'value': target in data.columns,
+            'passed': target in data.columns
+        }
+        # 3. Missing values check
+        missing_percentage = (data.isnull().sum().sum() / data.size) * 100
+        checks['missing_percentage'] = {
+            'value': missing_percentage,
+            'threshold': rules.get('max_missing_percentage', 30),
+            'passed': missing_percentage <= rules.get('max_missing_percentage', 30)
+        }
+        # 4. Duplicates check
+        duplicate_count = data.duplicated().sum()
+        duplicate_percentage = (duplicate_count / len(data)) * 100
+        checks['duplicates'] = {
+            'value': duplicate_percentage,
+            'threshold': 5,  # Maximum 5% duplicates
+            'passed': duplicate_percentage <= 5
+        }
+        # 5. Data types check
+        numeric_count = len(data.select_dtypes(include=[np.number]).columns)
+        checks['numeric_features'] = {
+            'value': numeric_count,
+            'passed': numeric_count >= 1  # At least one numeric feature required
+        }
+        return checks
+    def _quality_metrics(self, data: pd.DataFrame, rules: Dict) -> Dict:
+        """Data quality metrics"""
+        metrics = {}
+        # 1. Numeric features statistics
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        if len(numeric_cols) > 0:
+            numeric_stats = {}
+            for col in numeric_cols:
+                col_data = data[col].dropna()
+                if len(col_data) > 0:
+                    numeric_stats[col] = {
+                        'mean': float(col_data.mean()),
+                        'std': float(col_data.std()),
+                        'skewness': float(col_data.skew()),
+                        'kurtosis': float(col_data.kurtosis()),
+                        'zeros_percentage': float((col_data == 0).sum() / len(col_data) * 100),
+                        'unique_percentage': float(col_data.nunique() / len(col_data) * 100)
+                    }
+            metrics['numeric_statistics'] = numeric_stats
+        # 2. Data stability (for time series)
+        if isinstance(data.index, pd.DatetimeIndex):
+            stability_metrics = self._calculate_temporal_stability(data)
+            metrics['temporal_stability'] = stability_metrics
+        # 3. Feature informativeness
+        if self.config.target_column in data.columns:
+            informativeness = self._calculate_feature_informativeness(data)
+            metrics['feature_informativeness'] = informativeness
+        # 4. Target variable quality
+        target = self.config.target_column
+        if target in data.columns:
+            target_data = data[target].dropna()
+            if len(target_data) > 0:
+                target_metrics = {
+                    'missing_percentage': float(target_data.isnull().sum() / len(data) * 100),
+                    'unique_values': int(target_data.nunique()),
+                    'is_constant': bool(target_data.nunique() <= 1),
+                    'has_outliers': self._check_target_outliers(target_data),
+                    'distribution_type': self._identify_distribution(target_data)
+                }
+                metrics['target_quality'] = target_metrics
+        # 5. Class balance (for classification) - not applicable here, but kept as placeholder
+        metrics['class_balance'] = {'note': 'Not applicable for regression'}
+        return metrics
+    def _calculate_temporal_stability(self, data: pd.DataFrame) -> Dict:
+        """Calculate time series stability metrics"""
+        stability = {}
+        if not isinstance(data.index, pd.DatetimeIndex):
+            return stability
+        # Split into periods (e.g., by years)
+        if 'year' not in data.columns:
+            data_copy = data.copy()
+            data_copy['year'] = data_copy.index.year
+        else:
+            data_copy = data
+        years = sorted(data_copy['year'].unique())
+        if len(years) > 1:
+            # Statistics by years for numeric columns
+            year_stats = {}
+            for col in data.select_dtypes(include=[np.number]).columns[:5]:  # First 5 columns
+                yearly_means = data_copy.groupby('year')[col].mean()
+                yearly_stds = data_copy.groupby('year')[col].std()
+                # Coefficient of variation between years
+                if yearly_means.std() > 0:
+                    cv_between_years = yearly_means.std() / yearly_means.mean()
+                else:
+                    cv_between_years = 0
+                year_stats[col] = {
+                    'yearly_means': yearly_means.to_dict(),
+                    'yearly_stds': yearly_stds.to_dict(),
+                    'cv_between_years': float(cv_between_years),
+                    'mean_stability': float(1 - cv_between_years)  # 1 - CV, closer to 1 means more stable
+                }
+            stability['yearly_statistics'] = year_stats
+        # Check for time gaps
+        time_diff = pd.Series(data.index).diff().dropna()
+        if len(time_diff) > 0:
+            max_gap = time_diff.max()
+            avg_gap = time_diff.mean()
+            gap_std = time_diff.std()
+            stability['time_gaps'] = {
+                'max_gap_days': float(max_gap.days if hasattr(max_gap, 'days') else max_gap),
+                'avg_gap_days': float(avg_gap.days if hasattr(avg_gap, 'days') else avg_gap),
+                'gap_std': float(gap_std.days if hasattr(gap_std, 'days') else gap_std),
+                'has_irregular_gaps': gap_std > avg_gap * 0.5  # If standard deviation > 50% of mean
+            }
+        # Seasonal stability
+        if len(data) > 365:
+            try:
+                # Analyse seasonal patterns
+                seasonal_stability = self._analyse_seasonal_stability(data)
+                stability['seasonal_stability'] = seasonal_stability
+            except:
+                pass
+        return stability
+    def _analyse_seasonal_stability(self, data: pd.DataFrame) -> Dict:
+        """Analyse seasonal patterns stability"""
+        if not isinstance(data.index, pd.DatetimeIndex):
+            return {}
+        # For simplicity, analyse only target variable
+        target = self.config.target_column
+        if target not in data.columns:
+            return {}
+        series = data[target]
+        # Split by years and compare seasonal patterns
+        data_copy = data.copy()
+        data_copy['year'] = data_copy.index.year
+        data_copy['month'] = data_copy.index.month
+        if 'year' in data_copy.columns and 'month' in data_copy.columns:
+            monthly_means = data_copy.groupby(['year', 'month'])[target].mean().unstack()
+            if not monthly_means.empty:
+                # Correlation between years
+                yearly_corr = monthly_means.corr().mean().mean()
+                # Variation between years
+                monthly_cv = monthly_means.std() / monthly_means.mean()
+                avg_monthly_cv = monthly_cv.mean()
+                return {
+                    'yearly_correlation': float(yearly_corr),
+                    'average_monthly_cv': float(avg_monthly_cv),
+                    'seasonal_consistency': 'high' if yearly_corr > 0.8 and avg_monthly_cv < 0.3 else
+                                           'medium' if yearly_corr > 0.6 else 'low'
+                }
+        return {}
+    def _calculate_feature_informativeness(self, data: pd.DataFrame) -> Dict:
+        """Calculate feature informativeness"""
+        informativeness = {}
+        target = self.config.target_column
+        if target not in data.columns:
+            return informativeness
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        numeric_cols = [col for col in numeric_cols if col != target]
+        for col in numeric_cols[:20]:  # Limit number of features for analysis
+            try:
+                # Correlation with target variable
+                correlation = data[col].corr(data[target])
+                # Mutual information (approximated)
+                # For simplicity, use absolute correlation as informativeness measure
+                informativeness[col] = {
+                    'correlation_with_target': float(correlation),
+                    'abs_correlation': float(abs(correlation)),
+                    'informativeness': 'high' if abs(correlation) > 0.5 else
+                                      'medium' if abs(correlation) > 0.3 else 'low'
+                }
+            except:
+                continue
+        return informativeness
+    def _check_target_outliers(self, target_series: pd.Series) -> Dict:
+        """Check target variable for outliers"""
+        if len(target_series) < 10:
+            return {'has_outliers': False, 'outlier_percentage': 0}
+        q1 = target_series.quantile(0.25)
+        q3 = target_series.quantile(0.75)
+        iqr = q3 - q1
+        if iqr > 0:
+            lower_bound = q1 - 1.5 * iqr
+            upper_bound = q3 + 1.5 * iqr
+            outliers = target_series[(target_series < lower_bound) | (target_series > upper_bound)]
+            outlier_percentage = len(outliers) / len(target_series) * 100
+            return {
+                'has_outliers': len(outliers) > 0,
+                'outlier_count': int(len(outliers)),
+                'outlier_percentage': float(outlier_percentage),
+                'outlier_bounds': {'lower': float(lower_bound), 'upper': float(upper_bound)}
+            }
+        return {'has_outliers': False, 'outlier_percentage': 0}
+    def _identify_distribution(self, series: pd.Series) -> str:
+        """Identify distribution type"""
+        if len(series) < 30:
+            return 'insufficient_data'
+        skewness = series.skew()
+        kurtosis = series.kurtosis()
+        if abs(skewness) < 0.5 and abs(kurtosis) < 1:
+            return 'normal_like'
+        elif skewness > 1:
+            return 'right_skewed'
+        elif skewness < -1:
+            return 'left_skewed'
+        elif kurtosis > 3:
+            return 'heavy_tailed'
+        elif kurtosis < 2:
+            return 'light_tailed'
+        else:
+            return 'unknown'
+    def _find_issues(self, data: pd.DataFrame, rules: Dict) -> Dict:
+        """Find data problems"""
+        issues = {
+            'critical': [],
+            'warning': [],
+            'info': []
+        }
+        # 1. Check missing values in important features
+        missing_info = data.isnull().sum()
+        high_missing_cols = missing_info[missing_info / len(data) * 100 > 20].index.tolist()
+        for col in high_missing_cols:
+            missing_pct = missing_info[col] / len(data) * 100
+            if missing_pct > 50:
+                issues['critical'].append(f"Column '{col}': {missing_pct:.1f}% missing values (critical)")
+            elif missing_pct > 20:
+                issues['warning'].append(f"Column '{col}': {missing_pct:.1f}% missing values")
+        # 2. Check constant features
+        for col in data.columns:
+            if data[col].nunique() <= 1:
+                issues['critical'].append(f"Column '{col}': constant value")
+        # 3. Check feature correlation with itself (lags)
+        numeric_cols = data.select_dtypes(include=[np.number]).columns
+        for col in numeric_cols:
+            if '_lag_' in col or '_diff_' in col:
+                base_col = col.split('_lag_')[0] if '_lag_' in col else col.split('_diff_')[0]
+                if base_col in numeric_cols:
+                    correlation = data[col].corr(data[base_col])
+                    if pd.notna(correlation) and abs(correlation) > 0.95:
+                        issues['info'].append(f"Column '{col}': high correlation with '{base_col}' ({correlation:.3f})")
+        # 4. Check time gaps
+        if isinstance(data.index, pd.DatetimeIndex):
+            time_diff = pd.Series(data.index).diff().dropna()
+            if len(time_diff) > 0:
+                max_gap = time_diff.max()
+                if hasattr(max_gap, 'days') and max_gap.days > 30:
+                    issues['warning'].append(f"Detected time gap: {max_gap.days} days")
+        # 5. Check target variable
+        target = self.config.target_column
+        if target in data.columns:
+            target_data = data[target].dropna()
+            if len(target_data) > 0:
+                if target_data.nunique() <= 1:
+                    issues['critical'].append(f"Target variable '{target}': constant value")
+                # Check for outliers
+                outlier_check = self._check_target_outliers(target_data)
+                if outlier_check.get('has_outliers', False) and outlier_check.get('outlier_percentage', 0) > 10:
+                    issues['warning'].append(f"Target variable '{target}': {outlier_check['outlier_percentage']:.1f}% outliers")
+        # 6. Check multicollinearity (simplified)
+        if len(numeric_cols) > 5:
+            corr_matrix = data[numeric_cols].corr().abs()
+            high_corr_pairs = []
+            for i in range(len(corr_matrix.columns)):
+                for j in range(i+1, len(corr_matrix.columns)):
+                    if corr_matrix.iloc[i, j] > 0.9:
+                        col1 = corr_matrix.columns[i]
+                        col2 = corr_matrix.columns[j]
+                        high_corr_pairs.append((col1, col2, corr_matrix.iloc[i, j]))
+            if len(high_corr_pairs) > 5:
+                issues['warning'].append(f"Detected multicollinearity: {len(high_corr_pairs)} pairs with correlation > 0.9")
+        return issues
+    def _generate_recommendations(
+        self,
+        basic_checks: Dict,
+        quality_metrics: Dict,
+        issues: Dict
+    ) -> List[str]:
+        """Generate data improvement recommendations"""
+        recommendations = []
+        # Recommendations based on basic checks
+        for check_name, check_info in basic_checks.items():
+            if not check_info.get('passed', True):
+                if check_name == 'min_rows':
+                    recommendations.append(f"Increase data volume: current row count ({check_info['value']}) below minimum threshold ({check_info['threshold']})")
+                elif check_name == 'has_target':
+                    recommendations.append(f"Add target variable '{self.config.target_column}' to data")
+                elif check_name == 'missing_percentage':
+                    recommendations.append(f"Handle missing values: {check_info['value']:.1f}% missing exceeds threshold {check_info['threshold']}%")
+                elif check_name == 'duplicates':
+                    recommendations.append(f"Remove duplicates: {check_info['value']:.1f}% duplicate rows")
+        # Recommendations based on issues
+        if issues.get('critical'):
+            recommendations.append("Resolve critical issues before using data")
+        if issues.get('warning'):
+            recommendations.append("Consider addressing warnings to improve data quality")
+        # Recommendations based on quality metrics
+        target_metrics = quality_metrics.get('target_quality', {})
+        if target_metrics.get('is_constant', False):
+            recommendations.append(f"Target variable '{self.config.target_column}' is constant, different target variable needed")
+        if target_metrics.get('has_outliers', {}).get('has_outliers', False):
+            outlier_pct = target_metrics['has_outliers'].get('outlier_percentage', 0)
+            if outlier_pct > 5:
+                recommendations.append(f"Handle outliers in target variable: {outlier_pct:.1f}% outliers")
+        # Time series stability recommendations
+        temporal_stability = quality_metrics.get('temporal_stability', {})
+        if temporal_stability.get('time_gaps', {}).get('has_irregular_gaps', False):
+            recommendations.append("Detected irregular time intervals, consider resampling")
+        return recommendations
+    def _calculate_overall_score(self, validation_results: Dict) -> float:
+        """Calculate overall data quality score"""
+        score = 100
+        # Penalties for basic checks
+        basic_checks = validation_results.get('basic_checks', {})
+        for check_name, check_info in basic_checks.items():
+            if not check_info.get('passed', True):
+                if check_name == 'min_rows':
+                    score -= 30
+                elif check_name == 'has_target':
+                    score -= 50
+                elif check_name == 'missing_percentage':
+                    missing_pct = check_info.get('value', 0)
+                    if missing_pct > 50:
+                        score -= 40
+                    elif missing_pct > 20:
+                        score -= 20
+                    elif missing_pct > 5:
+                        score -= 10
+                elif check_name == 'duplicates':
+                    duplicate_pct = check_info.get('value', 0)
+                    if duplicate_pct > 20:
+                        score -= 30
+                    elif duplicate_pct > 10:
+                        score -= 15
+                    elif duplicate_pct > 5:
+                        score -= 5
+        # Penalties for issues
+        issues = validation_results.get('issues', {})
+        if issues.get('critical'):
+            score -= len(issues['critical']) * 20
+        if issues.get('warning'):
+            score -= len(issues['warning']) * 5
+        # Bonuses for good metrics
+        quality_metrics = validation_results.get('quality_metrics', {})
+        target_metrics = quality_metrics.get('target_quality', {})
+        if not target_metrics.get('is_constant', True):
+            score += 10
+        if target_metrics.get('missing_percentage', 100) < 1:
+            score += 5
+        # Limit score to 0-100 range
+        return max(0, min(100, score))
+    def _log_validation_results(self, validation_results: Dict) -> None:
+        """Log validation results"""
+        stage = validation_results['stage']
+        status = validation_results['status']
+        score = validation_results['overall_score']
+        logger.info(f"VALIDATION RESULTS ({stage}):")
+        logger.info(f"  Status: {status}")
+        logger.info(f"  Overall score: {score}/100")
+        logger.info(f"  Data shape: {validation_results['data_shape'][0]}x{validation_results['data_shape'][1]}")
+        # Basic checks
+        logger.info("\nBASIC CHECKS:")
+        for check_name, check_info in validation_results['basic_checks'].items():
+            status_icon = "✓" if check_info.get('passed', True) else "✗"
+            logger.info(f"  {status_icon} {check_name}: {check_info.get('value', 'N/A')}")
+        # Issues
+        issues = validation_results['issues']
+        if any(issues.values()):
+            logger.info("\nDETECTED ISSUES:")
+            for severity, issue_list in issues.items():
+                if issue_list:
+                    logger.info(f"  {severity.upper()}:")
+                    for issue in issue_list[:5]:  # Show only first 5 issues of each type
+                        logger.info(f"    - {issue}")
+                    if len(issue_list) > 5:
+                        logger.info(f"    ... and {len(issue_list) - 5} more issues")
+        else:
+            logger.info("\n✓ No issues detected")
+        # Recommendations
+        recommendations = validation_results['recommendations']
+        if recommendations:
+            logger.info("\nRECOMMENDATIONS:")
+            for i, rec in enumerate(recommendations, 1):
+                logger.info(f"  {i}. {rec}")
+        # Conclusion
+        if status == 'PASS':
+            logger.info("\n✓ Data passed validation and is ready for use")
+        elif status == 'WARNING':
+            logger.info("\n⚠ Data requires attention, there are issues to address")
+        else:
+            logger.info("\n✗ Data requires significant improvement before use")
+    def generate_report(self, stage: str = 'final') -> Dict:
+        """Generate detailed validation report"""
+        if stage not in self.validation_results:
+            return {}
+        report = self.validation_results[stage].copy()
+        # Add metadata
+        report['config'] = self.config.to_dict()
+        report['validator_version'] = '1.0'
+        report['generation_time'] = datetime.now().strftime('%Y-%m-%d %H:%M:%S')
+        # Add detailed metrics
+        quality_metrics = report.get('quality_metrics', {})
+        if 'numeric_statistics' in quality_metrics:
+            # Numeric features summary
+            numeric_stats = quality_metrics['numeric_statistics']
+            report['numeric_summary'] = {
+                'total_numeric_features': len(numeric_stats),
+                'features_with_high_skewness': sum(1 for s in numeric_stats.values() if abs(s.get('skewness', 0)) > 1),
+                'features_with_high_kurtosis': sum(1 for s in numeric_stats.values() if abs(s.get('kurtosis', 0)) > 3),
+                'features_with_many_zeros': sum(1 for s in numeric_stats.values() if s.get('zeros_percentage', 0) > 50)
+            }
+        return report
+    def save_report(self, stage: str = 'final', path: str = None) -> None:
+        """Save validation report to file"""
+        if stage not in self.validation_results:
+            logger.warning(f"Report for stage '{stage}' not found")
+            return
+        report = self.generate_report(stage)
+        if path is None:
+            path = f'{self.config.results_dir}/reports/validation_report_{stage}.json'
+        # Create directory if needed
+        Path(path).parent.mkdir(parents=True, exist_ok=True)
+        # Custom JSON encoder
+        class NumpyEncoder(json.JSONEncoder):
+            def default(self, obj):
+                if isinstance(obj, (np.integer, np.floating)):
+                    if np.isnan(obj):
+                        return None
+                    return float(obj)
+                elif isinstance(obj, np.bool_):
+                    return bool(obj)
+                elif isinstance(obj, np.ndarray):
+                    return obj.tolist()
+                elif isinstance(obj, pd.Timestamp):
+                    return obj.strftime('%Y-%m-%d %H:%M:%S')
+                return super().default(obj)
+        with open(path, 'w', encoding='utf-8') as f:
+            json.dump(report, f, indent=4, ensure_ascii=False, cls=NumpyEncoder)
+        logger.info(f"✓ Validation report saved: {path}")

visualization/__init__.py ADDED Viewed

File without changes

visualization/visualization_manager.py ADDED Viewed

	@@ -0,0 +1,1462 @@

+# ============================================
+# CLASS 13: VISUALISATION MANAGER (UPDATED)
+# ============================================
+import os
+from datetime import datetime
+import json
+from typing import Dict, List, Optional, Tuple, Union, Any
+import pandas as pd
+import numpy as np
+from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
+import matplotlib.pyplot as plt
+import seaborn as sns
+from scipy.stats import gaussian_kde
+import matplotlib
+matplotlib.use('Agg')  # Use non-display backend
+from config.config import Config
+import logging
+# Logging setup
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+class VisualisationManager:
+    """Class for managing all visualisations"""
+    def __init__(self, config: Config):
+        """
+        Initialise visualisation manager
+        Parameters:
+        -----------
+        config : Config
+            Experiment configuration
+        """
+        self.config = config
+        self.plots_generated = {}
+        self.plot_files = {}
+        self.figure_count = 0
+        # Create directory structure for saving plots
+        self._create_directory_structure()
+    def _create_directory_structure(self) -> None:
+        """Create directory structure for saving plots"""
+        base_dir = self.config.results_dir
+        # Main plot directories
+        self.plots_dir = os.path.join(base_dir, "plots")
+        self.correlations_dir = os.path.join(base_dir, "plots", "correlations")
+        self.distributions_dir = os.path.join(base_dir, "plots", "distributions")
+        self.features_dir = os.path.join(base_dir, "plots", "features")
+        self.time_series_dir = os.path.join(base_dir, "plots", "time_series")
+        self.preprocessing_dir = os.path.join(base_dir, "plots", "preprocessing")
+        self.summary_dir = os.path.join(base_dir, "plots", "summary")
+        self.reports_dir = os.path.join(base_dir, "reports")
+        # Create directories
+        directories = [
+            self.plots_dir,
+            self.correlations_dir,
+            self.distributions_dir,
+            self.features_dir,
+            self.time_series_dir,
+            self.preprocessing_dir,
+            self.summary_dir,
+            self.reports_dir
+        ]
+        for directory in directories:
+            os.makedirs(directory, exist_ok=True)
+            logger.debug(f"Created directory: {directory}")
+    def _save_figure(self, fig: plt.Figure, filename: str,
+                    subdirectory: str = None, dpi: int = 300) -> str:
+        """
+        Save plot and close it
+        Parameters:
+        -----------
+        fig : matplotlib.figure.Figure
+            Plot figure object
+        filename : str
+            Filename for saving
+        subdirectory : str, optional
+            Subdirectory for saving
+        dpi : int
+            Save quality
+        Returns:
+        --------
+        str : full path to saved file
+        """
+        if not filename.endswith('.png'):
+            filename = f"{filename}.png"
+        if subdirectory:
+            save_dir = os.path.join(self.plots_dir, subdirectory)
+            os.makedirs(save_dir, exist_ok=True)
+        else:
+            save_dir = self.plots_dir
+        filepath = os.path.join(save_dir, filename)
+        try:
+            fig.savefig(filepath, dpi=dpi, bbox_inches='tight', facecolor='white')
+            logger.info(f"✓ Plot saved: {filepath}")
+        except Exception as e:
+            logger.error(f"✗ Error saving plot {filename}: {e}")
+            filepath = None
+        # Close plot without display
+        plt.close(fig)
+        return filepath
+    # ============================================
+    # MAIN VISUALISATION METHODS
+    # ============================================
+    def create_summary_dashboard(
+        self,
+        data: pd.DataFrame,
+        preprocessing_stages: Dict = None,
+        filename: str = "summary_dashboard"
+    ) -> str:
+        """
+        Create summary visualisation dashboard
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Data for visualisation
+        preprocessing_stages : Dict, optional
+            Preprocessing stages information
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        logger.info("\n" + "="*80)
+        logger.info("CREATING SUMMARY DASHBOARD")
+        logger.info("="*80)
+        target_col = self.config.target_column
+        try:
+            # Create large dashboard
+            fig = plt.figure(figsize=(20, 24))
+            gs = fig.add_gridspec(6, 4, hspace=0.3, wspace=0.3)
+            # 1. Time series of target variable
+            ax1 = fig.add_subplot(gs[0, :2])
+            if target_col in data.columns and isinstance(data.index, pd.DatetimeIndex):
+                ax1.plot(data.index, data[target_col], linewidth=1, color='blue', alpha=0.7)
+                ax1.set_title(f'Time Series: {target_col}', fontsize=12, fontweight='bold')
+                ax1.set_xlabel('Date', fontsize=10)
+                ax1.set_ylabel(target_col, fontsize=10)
+                ax1.grid(True, alpha=0.3)
+                ax1.tick_params(axis='x', rotation=45)
+            else:
+                ax1.text(0.5, 0.5, 'No time series data available',
+                        ha='center', va='center', transform=ax1.transAxes)
+            # 2. Target variable distribution
+            ax2 = fig.add_subplot(gs[0, 2:])
+            if target_col in data.columns:
+                values = data[target_col].dropna()
+                if len(values) > 0:
+                    ax2.hist(values, bins=30, edgecolor='black', alpha=0.7, color='green')
+                    ax2.set_title(f'Distribution: {target_col}', fontsize=12, fontweight='bold')
+                    ax2.set_xlabel(target_col, fontsize=10)
+                    ax2.set_ylabel('Frequency', fontsize=10)
+                    ax2.grid(True, alpha=0.3)
+                else:
+                    ax2.text(0.5, 0.5, 'No data for distribution',
+                            ha='center', va='center', transform=ax2.transAxes)
+            # 3. Correlation matrix (top features)
+            ax3 = fig.add_subplot(gs[1, :])
+            numeric_cols = data.select_dtypes(include=[np.number]).columns
+            if len(numeric_cols) > 1:
+                display_cols = list(numeric_cols[:15])
+                if target_col not in display_cols and target_col in data.columns:
+                    display_cols = [target_col] + [c for c in display_cols if c != target_col][:14]
+                corr_matrix = data[display_cols].corr()
+                mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+                im = ax3.imshow(corr_matrix.where(~mask), cmap='coolwarm', vmin=-1, vmax=1, aspect='auto')
+                ax3.set_title('Correlation Matrix (Top 15 Features)',
+                             fontsize=12, fontweight='bold')
+                ax3.set_xticks(range(len(display_cols)))
+                ax3.set_yticks(range(len(display_cols)))
+                ax3.set_xticklabels(display_cols, rotation=90, fontsize=8)
+                ax3.set_yticklabels(display_cols, fontsize=8)
+                plt.colorbar(im, ax=ax3, shrink=0.8)
+            # 4. Seasonal patterns
+            ax4 = fig.add_subplot(gs[2, :2])
+            if target_col in data.columns and isinstance(data.index, pd.DatetimeIndex):
+                data_copy = data.copy()
+                data_copy['month'] = data_copy.index.month
+                monthly_avg = data_copy.groupby('month')[target_col].mean()
+                colors = plt.cm.Set3(np.linspace(0, 1, len(monthly_avg)))
+                ax4.bar(monthly_avg.index, monthly_avg.values, color=colors, edgecolor='black')
+                ax4.set_title('Average Values by Month', fontsize=12, fontweight='bold')
+                ax4.set_xlabel('Month', fontsize=10)
+                ax4.set_ylabel(f'Average {target_col}', fontsize=10)
+                ax4.set_xticks(range(1, 13))
+                month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
+                              'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
+                ax4.set_xticklabels(month_names)
+                ax4.grid(True, alpha=0.3, axis='y')
+            # 5. Weekly patterns
+            ax5 = fig.add_subplot(gs[2, 2:])
+            if target_col in data.columns and isinstance(data.index, pd.DatetimeIndex):
+                data_copy = data.copy()
+                data_copy['dayofweek'] = data_copy.index.dayofweek
+                daily_avg = data_copy.groupby('dayofweek')[target_col].mean()
+                colors = plt.cm.Paired(np.linspace(0, 1, len(daily_avg)))
+                ax5.bar(daily_avg.index, daily_avg.values, color=colors, edgecolor='black')
+                ax5.set_title('Average Values by Day of Week', fontsize=12, fontweight='bold')
+                ax5.set_xlabel('Day of Week', fontsize=10)
+                ax5.set_ylabel(f'Average {target_col}', fontsize=10)
+                ax5.set_xticks(range(7))
+                ax5.set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
+                ax5.grid(True, alpha=0.3, axis='y')
+            # 6. Trend and seasonality
+            ax6 = fig.add_subplot(gs[3, :])
+            if target_col in data.columns and len(data) > 30:
+                try:
+                    window_size = min(365, len(data) // 10)
+                    if window_size >= 7:
+                        rolling_mean = data[target_col].rolling(window=window_size, center=True).mean()
+                        rolling_std = data[target_col].rolling(window=window_size, center=True).std()
+                        ax6.plot(data.index, data[target_col], alpha=0.5,
+                                label='Original Series', linewidth=0.5, color='blue')
+                        ax6.plot(rolling_mean.index, rolling_mean,
+                                label=f'Rolling Mean ({window_size} days)',
+                                color='red', linewidth=2)
+                        ax6.fill_between(rolling_mean.index,
+                                        rolling_mean - rolling_std,
+                                        rolling_mean + rolling_std,
+                                        alpha=0.2, color='red')
+                        ax6.set_title('Trend and Volatility', fontsize=12, fontweight='bold')
+                        ax6.set_xlabel('Date', fontsize=10)
+                        ax6.set_ylabel(target_col, fontsize=10)
+                        ax6.legend(fontsize=9, loc='upper left')
+                        ax6.grid(True, alpha=0.3)
+                    else:
+                        ax6.text(0.5, 0.5, 'Insufficient data for trend analysis',
+                                ha='center', va='center', transform=ax6.transAxes)
+                except Exception as e:
+                    logger.warning(f"Error plotting trend: {e}")
+                    ax6.text(0.5, 0.5, 'Error plotting trend',
+                            ha='center', va='center', transform=ax6.transAxes)
+            # 7. Preprocessing statistics
+            if preprocessing_stages:
+                ax7 = fig.add_subplot(gs[4, :2])
+                stages = list(preprocessing_stages.keys())
+                values = list(preprocessing_stages.values())
+                colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(stages)))
+                bars = ax7.bar(range(len(stages)), values, color=colors, edgecolor='black')
+                ax7.set_title('Preprocessing Statistics', fontsize=12, fontweight='bold')
+                ax7.set_xlabel('Processing Stage', fontsize=10)
+                ax7.set_ylabel('Value', fontsize=10)
+                ax7.set_xticks(range(len(stages)))
+                ax7.set_xticklabels([s[:15] + '...' if len(s) > 15 else s for s in stages],
+                                   rotation=45, ha='right', fontsize=9)
+                ax7.grid(True, alpha=0.3, axis='y')
+                # Add values on bars
+                for bar, value in zip(bars, values):
+                    height = bar.get_height()
+                    ax7.text(bar.get_x() + bar.get_width()/2., height,
+                            f'{value:.2f}', ha='center', va='bottom', fontsize=8)
+            # 8. Data information
+            ax8 = fig.add_subplot(gs[4, 2:])
+            ax8.axis('off')
+            info_text = []
+            info_text.append("GENERAL CHARACTERISTICS:")
+            info_text.append(f"• Number of records: {len(data):,}")
+            info_text.append(f"• Number of features: {len(data.columns)}")
+            if isinstance(data.index, pd.DatetimeIndex):
+                info_text.append(f"• Period: {data.index.min().strftime('%Y-%m-%d')} - "
+                               f"{data.index.max().strftime('%Y-%m-%d')}")
+                info_text.append(f"• Days of data: {(data.index.max() - data.index.min()).days}")
+            if target_col in data.columns:
+                target_stats = data[target_col].describe()
+                info_text.append(f"\nTARGET VARIABLE '{target_col}':")
+                info_text.append(f"• Mean: {target_stats['mean']:.2f}")
+                info_text.append(f"• Standard deviation: {target_stats['std']:.2f}")
+                info_text.append(f"• Minimum: {target_stats['min']:.2f}")
+                info_text.append(f"• 25%: {target_stats['25%']:.2f}")
+                info_text.append(f"• 50% (median): {target_stats['50%']:.2f}")
+                info_text.append(f"• 75%: {target_stats['75%']:.2f}")
+                info_text.append(f"• Maximum: {target_stats['max']:.2f}")
+            info_text.append(f"\nDATA TYPES:")
+            for dtype, count in data.dtypes.value_counts().items():
+                info_text.append(f"• {dtype}: {count} columns")
+            missing_info = data.isnull().sum()
+            missing_total = missing_info.sum()
+            missing_percent = missing_total / data.size * 100
+            info_text.append(f"\nMISSING VALUES:")
+            info_text.append(f"• Total missing: {missing_total:,}")
+            info_text.append(f"• Missing percentage: {missing_percent:.2f}%")
+            if missing_total > 0:
+                top_missing = missing_info.nlargest(5)
+                info_text.append(f"• Top 5 columns with missing values:")
+                for col, count in top_missing.items():
+                    percent = count / len(data) * 100
+                    info_text.append(f"  {col}: {count} ({percent:.1f}%)")
+            ax8.text(0.02, 0.98, '\n'.join(info_text), transform=ax8.transAxes,
+                    fontsize=8, verticalalignment='top', fontfamily='monospace')
+            # 9. Autocorrelation plot
+            ax9 = fig.add_subplot(gs[5, :2])
+            if target_col in data.columns:
+                try:
+                    series = data[target_col].dropna()
+                    if len(series) > 50:
+                        plot_acf(series, lags=min(50, len(series)-1), ax=ax9, alpha=0.05)
+                        ax9.set_title('Autocorrelation Function (ACF)', fontsize=12, fontweight='bold')
+                        ax9.set_xlabel('Lag', fontsize=10)
+                        ax9.set_ylabel('Autocorrelation', fontsize=10)
+                        ax9.grid(True, alpha=0.3)
+                    else:
+                        ax9.text(0.5, 0.5, 'Insufficient data for ACF',
+                                ha='center', va='center', transform=ax9.transAxes)
+                except Exception as e:
+                    logger.warning(f"Error plotting ACF: {e}")
+                    ax9.text(0.5, 0.5, 'Error calculating ACF',
+                            ha='center', va='center', transform=ax9.transAxes)
+            # 10. Partial autocorrelation plot
+            ax10 = fig.add_subplot(gs[5, 2:])
+            if target_col in data.columns:
+                try:
+                    series = data[target_col].dropna()
+                    if len(series) > 50:
+                        plot_pacf(series, lags=min(50, len(series)-1), ax=ax10, alpha=0.05)
+                        ax10.set_title('Partial Autocorrelation Function (PACF)',
+                                      fontsize=12, fontweight='bold')
+                        ax10.set_xlabel('Lag', fontsize=10)
+                        ax10.set_ylabel('Partial Autocorrelation', fontsize=10)
+                        ax10.grid(True, alpha=0.3)
+                    else:
+                        ax10.text(0.5, 0.5, 'Insufficient data for PACF',
+                                 ha='center', va='center', transform=ax10.transAxes)
+                except Exception as e:
+                    logger.warning(f"Error plotting PACF: {e}")
+                    ax10.text(0.5, 0.5, 'Error calculating PACF',
+                             ha='center', va='center', transform=ax10.transAxes)
+            plt.suptitle('Data Analysis Summary Dashboard', fontsize=16, fontweight='bold', y=0.98)
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "summary")
+            self.plot_files['summary_dashboard'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating summary dashboard: {e}")
+            return None
+    # ============================================
+    # SPECIFIC METHODS FOR SAVING YOUR PLOTS
+    # ============================================
+    def save_data_split_plot(self, filename: str = "data_split.png") -> str:
+        """
+        Save data split plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['data_split'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving data_split plot: {e}")
+            return None
+    def save_feature_selection_correlation_plot(self, filename: str = "feature_selection_correlation.png") -> str:
+        """
+        Save feature selection correlation plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "correlations")
+            self.plot_files['feature_selection_correlation'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving feature_selection_correlation plot: {e}")
+            return None
+    def save_missing_values_analysis_plot(self, filename: str = "missing_values_analysis.png") -> str:
+        """
+        Save missing values analysis plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['missing_values_analysis'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving missing_values_analysis plot: {e}")
+            return None
+    def save_outlier_handling_results_plot(self, filename: str = "outlier_handling_results.png") -> str:
+        """
+        Save outlier handling results plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['outlier_handling_results'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving outlier_handling_results plot: {e}")
+            return None
+    def save_outliers_analysis_plot(self, filename: str = "outliers_analysis.png") -> str:
+        """
+        Save outliers analysis plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['outliers_analysis'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving outliers_analysis plot: {e}")
+            return None
+    def save_scaling_results_plot(self, filename: str = "scaling_results.png") -> str:
+        """
+        Save scaling results plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            self.plot_files['scaling_results'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving scaling_results plot: {e}")
+            return None
+    def save_stationarity_analysis_plot(self, filename: str = "stationarity_analysis.png") -> str:
+        """
+        Save stationarity analysis plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['stationarity_analysis'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving stationarity_analysis plot: {e}")
+            return None
+    def save_temporal_outliers_plot(self, filename: str = "temporal_outliers.png") -> str:
+        """
+        Save temporal outliers plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['temporal_outliers'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving temporal_outliers plot: {e}")
+            return None
+    # ============================================
+    # UNIVERSAL METHOD FOR SAVING ANY PLOT
+    # ============================================
+    def save_current_plot(self, filename: str, subdirectory: str = None) -> str:
+        """
+        Universal method for saving current plot
+        Parameters:
+        -----------
+        filename : str
+            Filename for saving
+        subdirectory : str, optional
+            Subdirectory for saving
+        Returns:
+        --------
+        str : path to saved file
+        """
+        try:
+            fig = plt.gcf()  # Get current figure
+            filepath = self._save_figure(fig, filename, subdirectory)
+            # Save plot information
+            plot_key = filename.replace('.png', '').replace('.jpg', '')
+            self.plot_files[plot_key] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error saving plot {filename}: {e}")
+            return None
+    # ============================================
+    # ADDITIONAL VISUALISATION METHODS
+    # ============================================
+    def create_feature_importance_plot(
+        self,
+        feature_importance: Dict,
+        top_n: int = 20,
+        filename: str = "feature_importance"
+    ) -> str:
+        """
+        Create feature importance plot
+        Parameters:
+        -----------
+        feature_importance : Dict
+            Dictionary with feature importance
+        top_n : int
+            Number of top features to display
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        if not feature_importance:
+            logger.warning("No feature importance data for visualisation")
+            return None
+        try:
+            # Convert to Series and sort
+            importance_series = pd.Series(feature_importance).sort_values(ascending=False)
+            top_features = importance_series.head(top_n)
+            # Create plot
+            fig, ax = plt.subplots(figsize=(12, 8))
+            y_pos = np.arange(len(top_features))
+            colors = plt.cm.plasma(np.linspace(0.2, 0.9, len(top_features)))
+            bars = ax.barh(y_pos, top_features.values, color=colors, edgecolor='black')
+            ax.set_yticks(y_pos)
+            ax.set_yticklabels(top_features.index, fontsize=10)
+            ax.invert_yaxis()
+            ax.set_xlabel('Feature Importance', fontsize=11, fontweight='bold')
+            ax.set_title(f'Top-{top_n} Most Important Features', fontsize=14, fontweight='bold')
+            ax.grid(True, alpha=0.3, axis='x')
+            # Add values on bars
+            for i, (bar, value) in enumerate(zip(bars, top_features.values)):
+                width = bar.get_width()
+                ax.text(width * 1.01, bar.get_y() + bar.get_height()/2,
+                       f'{value:.4f}', va='center', fontsize=9, fontweight='bold')
+            # Add additional information
+            plt.text(0.02, 0.98, f'Total features: {len(importance_series)}',
+                    transform=fig.transFigure, fontsize=9, verticalalignment='top')
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "features")
+            self.plot_files['feature_importance'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating feature importance plot: {e}")
+            return None
+    def create_correlation_heatmap(
+        self,
+        data: pd.DataFrame,
+        top_n: int = 20,
+        filename: str = "correlation_heatmap"
+    ) -> Tuple[str, Optional[str]]:
+        """
+        Create correlation heatmap
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Data for analysis
+        top_n : int
+            Number of top features to display
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        Tuple[str, Optional[str]]:
+            (path to main heatmap, path to target correlation heatmap)
+        """
+        target_col = self.config.target_column
+        try:
+            numeric_cols = data.select_dtypes(include=[np.number]).columns.tolist()
+            if len(numeric_cols) < 2:
+                logger.warning("Insufficient numeric features for correlation analysis")
+                return None, None
+            # Create two heatmaps
+            # 1. Main correlation heatmap between all features
+            main_filepath = self._create_main_correlation_heatmap(data, numeric_cols, top_n, filename)
+            # 2. Target correlation heatmap
+            target_filepath = None
+            if target_col in data.columns and target_col in numeric_cols:
+                target_filepath = self._create_target_correlation_heatmap(data, target_col, numeric_cols, filename)
+            return main_filepath, target_filepath
+        except Exception as e:
+            logger.error(f"Error creating correlation heatmap: {e}")
+            return None, None
+    def _create_main_correlation_heatmap(
+        self,
+        data: pd.DataFrame,
+        numeric_cols: List[str],
+        top_n: int,
+        filename: str
+    ) -> str:
+        """Create main correlation heatmap"""
+        # Limit number of features for better readability
+        if len(numeric_cols) > top_n:
+            # Select features with highest variance
+            variances = data[numeric_cols].var().sort_values(ascending=False)
+            selected_cols = variances.head(top_n).index.tolist()
+        else:
+            selected_cols = numeric_cols
+        # Calculate correlation
+        corr_matrix = data[selected_cols].corr()
+        fig, ax = plt.subplots(figsize=(14, 12))
+        # Mask for upper triangle
+        mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
+        # Create heatmap
+        sns.heatmap(
+            corr_matrix,
+            annot=True,
+            fmt='.2f',
+            cmap='coolwarm',
+            center=0,
+            square=True,
+            mask=mask,
+            cbar_kws={'shrink': 0.8, 'label': 'Correlation Coefficient'},
+            linewidths=0.5,
+            linecolor='white',
+            ax=ax,
+            annot_kws={'size': 8}
+        )
+        ax.set_title(f'Correlation Matrix Between Features (Top-{top_n})',
+                    fontsize=14, fontweight='bold', pad=20)
+        plt.tight_layout()
+        # Save
+        filepath = self._save_figure(fig, filename, "correlations")
+        self.plot_files['correlation_heatmap_main'] = filepath
+        return filepath
+    def _create_target_correlation_heatmap(
+        self,
+        data: pd.DataFrame,
+        target_col: str,
+        numeric_cols: List[str],
+        filename: str
+    ) -> str:
+        """Create target correlation heatmap"""
+        # Calculate correlations with target variable
+        correlations = data[numeric_cols].corrwith(data[target_col]).sort_values(key=abs, ascending=False)
+        # Exclude target variable itself
+        correlations = correlations[correlations.index != target_col]
+        # Take top 15 features
+        top_features = correlations.head(15)
+        fig, ax = plt.subplots(figsize=(10, 8))
+        colors = ['red' if x < 0 else 'green' for x in top_features.values]
+        bars = ax.barh(range(len(top_features)), top_features.values, color=colors, edgecolor='black')
+        ax.set_yticks(range(len(top_features)))
+        ax.set_yticklabels(top_features.index, fontsize=10)
+        ax.invert_yaxis()
+        ax.set_xlabel('Correlation Coefficient', fontsize=11, fontweight='bold')
+        ax.set_title(f'Feature Correlations with Target Variable "{target_col}"',
+                    fontsize=14, fontweight='bold', pad=20)
+        ax.grid(True, alpha=0.3, axis='x')
+        ax.axvline(x=0, color='black', linestyle='-', linewidth=0.5)
+        # Add values on bars
+        for bar, value in zip(bars, top_features.values):
+            width = bar.get_width()
+            ax.text(width + (0.01 if width >= 0 else -0.04),
+                   bar.get_y() + bar.get_height()/2,
+                   f'{value:.3f}',
+                   va='center',
+                   ha='left' if width >= 0 else 'right',
+                   fontsize=9,
+                   fontweight='bold',
+                   color='black')
+        plt.tight_layout()
+        # Save
+        target_filename = f"{filename}_with_target"
+        filepath = self._save_figure(fig, target_filename, "correlations")
+        self.plot_files['correlation_with_target'] = filepath
+        return filepath
+    def create_distribution_comparison(
+        self,
+        original_data: pd.DataFrame,
+        processed_data: pd.DataFrame,
+        columns: List[str] = None,
+        max_columns: int = 12,
+        filename: str = "distribution_comparison"
+    ) -> str:
+        """
+        Compare distributions before and after processing
+        Parameters:
+        -----------
+        original_data : pd.DataFrame
+            Original data
+        processed_data : pd.DataFrame
+            Processed data
+        columns : List[str], optional
+            List of columns to compare
+        max_columns : int
+            Maximum number of columns to display
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        try:
+            if columns is None:
+                # Select numeric columns common to both datasets
+                numeric_cols_original = original_data.select_dtypes(include=[np.number]).columns
+                numeric_cols_processed = processed_data.select_dtypes(include=[np.number]).columns
+                common_cols = list(set(numeric_cols_original) & set(numeric_cols_processed))
+                # Sort by variance in original data
+                variances = original_data[common_cols].var().sort_values(ascending=False)
+                columns = variances.head(max_columns).index.tolist()
+            n_cols = min(4, len(columns))
+            n_rows = (len(columns) + n_cols - 1) // n_cols
+            fig, axes = plt.subplots(n_rows, n_cols, figsize=(n_cols * 4, n_rows * 3.5))
+            fig.suptitle('Distribution Comparison Before and After Processing',
+                        fontsize=16, fontweight='bold', y=0.98)
+            if n_rows == 1 and n_cols == 1:
+                axes = np.array([axes])
+            axes = axes.flat if hasattr(axes, 'flat') else [axes]
+            for idx, col in enumerate(columns):
+                if idx >= len(axes):
+                    break
+                ax = axes[idx]
+                if col in original_data.columns and col in processed_data.columns:
+                    original_values = original_data[col].dropna()
+                    processed_values = processed_data[col].dropna()
+                    if len(original_values) > 0 and len(processed_values) > 0:
+                        # Use common bins for comparison
+                        all_values = pd.concat([original_values, processed_values])
+                        bins = np.histogram_bin_edges(all_values, bins=30)
+                        # Histograms
+                        ax.hist(original_values, bins=bins, alpha=0.5,
+                               label='Before Processing', density=True, color='blue')
+                        ax.hist(processed_values, bins=bins, alpha=0.5,
+                               label='After Processing', density=True, color='orange')
+                        # Add KDE
+                        try:
+                            if len(original_values) > 10:
+                                kde_original = gaussian_kde(original_values)
+                                x_range = np.linspace(original_values.min(), original_values.max(), 100)
+                                ax.plot(x_range, kde_original(x_range), 'b-', linewidth=1.5, alpha=0.8)
+                            if len(processed_values) > 10:
+                                kde_processed = gaussian_kde(processed_values)
+                                x_range = np.linspace(processed_values.min(), processed_values.max(), 100)
+                                ax.plot(x_range, kde_processed(x_range), 'orange', linewidth=1.5, alpha=0.8)
+                        except:
+                            pass
+                        # Add statistics
+                        stats_text = []
+                        if len(original_values) > 0:
+                            stats_text.append(f"Before: μ={original_values.mean():.2f}, σ={original_values.std():.2f}")
+                        if len(processed_values) > 0:
+                            stats_text.append(f"After: μ={processed_values.mean():.2f}, σ={processed_values.std():.2f}")
+                        if stats_text:
+                            ax.text(0.02, 0.98, '\n'.join(stats_text),
+                                   transform=ax.transAxes, fontsize=8,
+                                   verticalalignment='top',
+                                   bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
+                        ax.set_title(f'{col}', fontsize=11, fontweight='bold')
+                        ax.set_xlabel('Value', fontsize=9)
+                        ax.set_ylabel('Density', fontsize=9)
+                        ax.legend(fontsize=8)
+                        ax.grid(True, alpha=0.3)
+                    else:
+                        ax.text(0.5, 0.5, 'No data',
+                               ha='center', va='center', transform=ax.transAxes)
+                else:
+                    ax.text(0.5, 0.5, 'Column not found',
+                           ha='center', va='center', transform=ax.transAxes)
+            # Hide unused subplots
+            for idx in range(len(columns), len(axes)):
+                axes[idx].set_visible(False)
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "distributions")
+            self.plot_files['distribution_comparison'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating distribution comparison: {e}")
+            return None
+    def create_time_series_decomposition_plot(
+        self,
+        decomposition_result: Dict,
+        filename: str = "time_series_decomposition"
+    ) -> str:
+        """
+        Visualise time series decomposition
+        Parameters:
+        -----------
+        decomposition_result : Dict
+            Decomposition results
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        target_col = self.config.target_column
+        try:
+            fig, axes = plt.subplots(4, 1, figsize=(14, 10))
+            fig.suptitle(f'Time Series Decomposition: {target_col}',
+                        fontsize=16, fontweight='bold', y=0.98)
+            # Original series
+            if 'observed' in decomposition_result:
+                observed = decomposition_result['observed']
+                axes[0].plot(observed, color='blue', linewidth=1.5)
+                axes[0].set_ylabel('Observed', fontsize=11, fontweight='bold')
+                axes[0].grid(True, alpha=0.3)
+                axes[0].set_title('Original Time Series', fontsize=12)
+            # Trend
+            if 'trend' in decomposition_result and decomposition_result['trend'] is not None:
+                trend = decomposition_result['trend']
+                axes[1].plot(trend, color='red', linewidth=2)
+                axes[1].set_ylabel('Trend', fontsize=11, fontweight='bold')
+                axes[1].grid(True, alpha=0.3)
+                axes[1].set_title('Trend Component', fontsize=12)
+            # Seasonality
+            if 'seasonal' in decomposition_result and decomposition_result['seasonal'] is not None:
+                seasonal = decomposition_result['seasonal']
+                axes[2].plot(seasonal, color='green', linewidth=1.5)
+                axes[2].set_ylabel('Seasonal', fontsize=11, fontweight='bold')
+                axes[2].grid(True, alpha=0.3)
+                axes[2].set_title('Seasonal Component', fontsize=12)
+            # Residuals
+            if 'residual' in decomposition_result and decomposition_result['residual'] is not None:
+                residual = decomposition_result['residual']
+                axes[3].plot(residual, color='purple', linewidth=1, alpha=0.7)
+                axes[3].set_ylabel('Residuals', fontsize=11, fontweight='bold')
+                axes[3].set_xlabel('Date', fontsize=11, fontweight='bold')
+                axes[3].grid(True, alpha=0.3)
+                axes[3].set_title('Residual Component', fontsize=12)
+                # Add residual statistics
+                if len(residual) > 0:
+                    stats_text = (f"Mean: {residual.mean():.4f}\n"
+                                 f"Std: {residual.std():.4f}\n"
+                                 f"Min: {residual.min():.4f}\n"
+                                 f"Max: {residual.max():.4f}")
+                    axes[3].text(0.02, 0.98, stats_text, transform=axes[3].transAxes,
+                                fontsize=8, verticalalignment='top',
+                                bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "time_series")
+            self.plot_files['time_series_decomposition'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating time series decomposition: {e}")
+            return None
+    def create_data_quality_report(
+        self,
+        validation_results: Dict,
+        filename: str = "data_quality_report"
+    ) -> str:
+        """
+        Create visual data quality report
+        Parameters:
+        -----------
+        validation_results : Dict
+            Validation results
+        filename : str
+            Filename for saving
+        Returns:
+        --------
+        str : path to saved file or None if error
+        """
+        try:
+            fig = plt.figure(figsize=(16, 12))
+            fig.suptitle('Data Quality Report', fontsize=18, fontweight='bold', y=0.98)
+            # Use GridSpec for more complex layout
+            gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)
+            # 1. Quality radar chart (top left)
+            ax1 = fig.add_subplot(gs[0, 0], projection='polar')
+            categories = ['Size', 'Missing', 'Duplicates', 'Stability', 'Informativeness']
+            # Extract values from validation results
+            if 'quality_metrics' in validation_results:
+                values = [
+                    validation_results['quality_metrics'].get('size_score', 0.5),
+                    validation_results['quality_metrics'].get('missing_score', 0.5),
+                    validation_results['quality_metrics'].get('duplicates_score', 0.5),
+                    validation_results['quality_metrics'].get('stability_score', 0.5),
+                    validation_results['quality_metrics'].get('informativeness_score', 0.5)
+                ]
+            else:
+                values = [0.8, 0.7, 0.9, 0.6, 0.8]
+            N = len(categories)
+            angles = [n / float(N) * 2 * np.pi for n in range(N)]
+            angles += angles[:1]
+            values += values[:1]
+            ax1.plot(angles, values, 'o-', linewidth=2, color='blue')
+            ax1.fill(angles, values, alpha=0.25, color='blue')
+            ax1.set_xticks(angles[:-1])
+            ax1.set_xticklabels(categories, fontsize=10)
+            ax1.set_ylim(0, 1)
+            ax1.set_title('Data Quality Radar Chart', fontsize=12, fontweight='bold')
+            ax1.grid(True)
+            # 2. Check status (top right)
+            ax2 = fig.add_subplot(gs[0, 1])
+            basic_checks = validation_results.get('basic_checks', {})
+            checks_passed = sum(1 for check in basic_checks.values() if check.get('passed', False))
+            checks_total = len(basic_checks)
+            checks_failed = checks_total - checks_passed
+            if checks_total > 0:
+                colors = ['#4CAF50' if checks_passed > 0 else '#FF6B6B',
+                         '#FF6B6B' if checks_failed > 0 else '#4CAF50']
+                bars = ax2.bar(['Passed', 'Failed'],
+                              [checks_passed, checks_failed],
+                              color=colors, edgecolor='black')
+                ax2.set_title(f'Basic Checks: {checks_passed}/{checks_total}',
+                            fontsize=12, fontweight='bold')
+                ax2.set_ylabel('Number of Checks', fontsize=10)
+                ax2.grid(True, alpha=0.3, axis='y')
+                # Add values on bars
+                for bar, value in zip(bars, [checks_passed, checks_failed]):
+                    height = bar.get_height()
+                    ax2.text(bar.get_x() + bar.get_width()/2., height,
+                            f'{value}', ha='center', va='bottom', fontsize=10, fontweight='bold')
+            else:
+                ax2.text(0.5, 0.5, 'No check data available',
+                        ha='center', va='center', transform=ax2.transAxes)
+                ax2.set_title('Basic Checks', fontsize=12, fontweight='bold')
+            # 3. Overall score (top right)
+            ax3 = fig.add_subplot(gs[0, 2])
+            overall_score = validation_results.get('overall_score', 0)
+            status = validation_results.get('status', 'UNKNOWN')
+            # Score pie chart
+            sizes = [overall_score, 100 - overall_score]
+            if overall_score >= 80:
+                colors = ['#4CAF50', '#E0E0E0']  # Green
+            elif overall_score >= 60:
+                colors = ['#FFC107', '#E0E0E0']  # Yellow
+            else:
+                colors = ['#F44336', '#E0E0E0']  # Red
+            wedges, texts, autotexts = ax3.pie(sizes, colors=colors, startangle=90,
+                                              autopct='%1.1f%%', pctdistance=0.85)
+            # Central text
+            status_colors = {'PASS': '#4CAF50', 'WARNING': '#FFC107', 'FAIL': '#F44336'}
+            status_color = status_colors.get(status, '#757575')
+            ax3.text(0, 0, f'{overall_score}/100\n{status}',
+                    ha='center', va='center', fontsize=14, fontweight='bold',
+                    color=status_color)
+            ax3.set_title('Overall Quality Score', fontsize=12, fontweight='bold')
+            # 4. Issue distribution by type (left middle)
+            ax4 = fig.add_subplot(gs[1, 0])
+            issues = validation_results.get('issues', {})
+            issue_counts = {
+                'Critical': len(issues.get('critical', [])),
+                'Warnings': len(issues.get('warning', [])),
+                'Informational': len(issues.get('info', []))
+            }
+            if any(issue_counts.values()):
+                colors = ['#F44336', '#FF9800', '#2196F3']
+                bars = ax4.bar(issue_counts.keys(), issue_counts.values(),
+                              color=colors, edgecolor='black')
+                ax4.set_title('Data Issues by Type', fontsize=12, fontweight='bold')
+                ax4.set_ylabel('Number of Issues', fontsize=10)
+                ax4.tick_params(axis='x', rotation=45)
+                ax4.grid(True, alpha=0.3, axis='y')
+                # Add values on bars
+                for bar, value in zip(bars, issue_counts.values()):
+                    height = bar.get_height()
+                    ax4.text(bar.get_x() + bar.get_width()/2., height,
+                            f'{value}', ha='center', va='bottom', fontsize=10, fontweight='bold')
+            else:
+                ax4.text(0.5, 0.5, 'No issues detected',
+                        ha='center', va='center', transform=ax4.transAxes, fontsize=12)
+                ax4.set_title('Data Issues', fontsize=12, fontweight='bold')
+            # 5. Detailed information (remaining cells)
+            ax5 = fig.add_subplot(gs[1:, 1:])
+            ax5.axis('off')
+            # Form text report
+            report_text = []
+            report_text.append("DETAILED REPORT:")
+            report_text.append("=" * 40)
+            # Basic information
+            report_text.append("\nBASIC INFORMATION:")
+            report_text.append(f"• Overall score: {overall_score}/100")
+            report_text.append(f"• Status: {status}")
+            report_text.append(f"• Checks passed: {checks_passed}/{checks_total}")
+            # Check details
+            if basic_checks:
+                report_text.append("\nCHECK DETAILS:")
+                for check_name, check_result in basic_checks.items():
+                    status_icon = "✓" if check_result.get('passed', False) else "✗"
+                    report_text.append(f"• {status_icon} {check_name}: {check_result.get('message', '')}")
+            # Issues
+            if any(issue_counts.values()):
+                report_text.append("\nDETECTED ISSUES:")
+                if issue_counts['Critical'] > 0:
+                    report_text.append("\nCRITICAL:")
+                    for issue in issues.get('critical', []):
+                        report_text.append(f"  • {issue}")
+                if issue_counts['Warnings'] > 0:
+                    report_text.append("\nWARNINGS:")
+                    for issue in issues.get('warning', []):
+                        report_text.append(f"  • {issue}")
+                if issue_counts['Informational'] > 0:
+                    report_text.append("\nINFORMATIONAL:")
+                    for issue in issues.get('info', []):
+                        report_text.append(f"  • {issue}")
+            # Recommendations
+            recommendations = validation_results.get('recommendations', [])
+            if recommendations:
+                report_text.append("\nRECOMMENDATIONS:")
+                for i, rec in enumerate(recommendations, 1):
+                    report_text.append(f"{i}. {rec}")
+            ax5.text(0.02, 0.98, '\n'.join(report_text), transform=ax5.transAxes,
+                    fontsize=9, verticalalignment='top', fontfamily='monospace',
+                    bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.1))
+            plt.tight_layout()
+            # Save
+            filepath = self._save_figure(fig, filename, "reports")
+            self.plot_files['data_quality_report'] = filepath
+            return filepath
+        except Exception as e:
+            logger.error(f"Error creating data quality report: {e}")
+            return None
+    # ============================================
+    # METHODS FOR BATCH SAVING
+    # ============================================
+    def save_all_preprocessing_plots(self) -> Dict[str, str]:
+        """
+        Save all preprocessing plots from current session
+        Returns:
+        --------
+        Dict[str, str] : dictionary with paths to saved plots
+        """
+        logger.info("Saving all preprocessing plots...")
+        plots_saved = {}
+        # Get all open figures
+        figure_numbers = plt.get_fignums()
+        if not figure_numbers:
+            logger.warning("No open plots to save")
+            return plots_saved
+        # Save each plot
+        for fig_num in figure_numbers:
+            fig = plt.figure(fig_num)
+            filename = f"preprocessing_plot_{fig_num}.png"
+            filepath = self._save_figure(fig, filename, "preprocessing")
+            if filepath:
+                plots_saved[f"plot_{fig_num}"] = filepath
+        logger.info(f"Saved {len(plots_saved)} preprocessing plots")
+        return plots_saved
+    def create_all_visualizations(
+        self,
+        data: pd.DataFrame,
+        processed_data: pd.DataFrame = None,
+        feature_importance: Dict = None,
+        decomposition_result: Dict = None,
+        validation_results: Dict = None,
+        preprocessing_stages: Dict = None
+    ) -> Dict[str, str]:
+        """
+        Create all visualisations in one call
+        Parameters:
+        -----------
+        data : pd.DataFrame
+            Original data
+        processed_data : pd.DataFrame, optional
+            Processed data
+        feature_importance : Dict, optional
+            Feature importance
+        decomposition_result : Dict, optional
+            Decomposition results
+        validation_results : Dict, optional
+            Validation results
+        preprocessing_stages : Dict, optional
+            Preprocessing stages
+        Returns:
+        --------
+        Dict[str, str] : dictionary with paths to created plots
+        """
+        logger.info("\n" + "="*80)
+        logger.info("STARTING ALL VISUALISATIONS CREATION")
+        logger.info("="*80)
+        result_files = {}
+        # 1. Summary dashboard
+        if data is not None:
+            logger.info("Creating summary dashboard...")
+            summary_path = self.create_summary_dashboard(data, preprocessing_stages)
+            if summary_path:
+                result_files['summary'] = summary_path
+        # 2. Correlation heatmaps
+        if data is not None:
+            logger.info("Creating correlation heatmaps...")
+            main_corr, target_corr = self.create_correlation_heatmap(data)
+            if main_corr:
+                result_files['correlation_main'] = main_corr
+            if target_corr:
+                result_files['correlation_target'] = target_corr
+        # 3. Distribution comparison
+        if data is not None and processed_data is not None:
+            logger.info("Creating distribution comparison...")
+            dist_path = self.create_distribution_comparison(data, processed_data)
+            if dist_path:
+                result_files['distribution'] = dist_path
+        # 4. Feature importance
+        if feature_importance:
+            logger.info("Creating feature importance plot...")
+            feat_path = self.create_feature_importance_plot(feature_importance)
+            if feat_path:
+                result_files['feature_importance'] = feat_path
+        # 5. Time series decomposition
+        if decomposition_result:
+            logger.info("Creating time series decomposition...")
+            decomp_path = self.create_time_series_decomposition_plot(decomposition_result)
+            if decomp_path:
+                result_files['decomposition'] = decomp_path
+        # 6. Data quality report
+        if validation_results:
+            logger.info("Creating data quality report...")
+            quality_path = self.create_data_quality_report(validation_results)
+            if quality_path:
+                result_files['quality_report'] = quality_path
+        # Save information about all plots
+        self.save_plots_info()
+        logger.info("\n" + "="*80)
+        logger.info("VISUALISATIONS SUCCESSFULLY CREATED")
+        logger.info("="*80)
+        for plot_name, plot_path in result_files.items():
+            if plot_path:
+                logger.info(f"✓ {plot_name}: {plot_path}")
+        return result_files
+    def get_all_plots(self) -> Dict:
+        """Get information about all created plots"""
+        return self.plot_files
+    def save_plots_info(self, filename: str = "plots_info.json") -> None:
+        """Save plot information to JSON file"""
+        try:
+            plots_info = {
+                'total_plots': len(self.plot_files),
+                'plots': self.plot_files,
+                'directories': {
+                    'correlations': self.correlations_dir,
+                    'distributions': self.distributions_dir,
+                    'features': self.features_dir,
+                    'time_series': self.time_series_dir,
+                    'preprocessing': self.preprocessing_dir,
+                    'summary': self.summary_dir,
+                    'reports': self.reports_dir
+                },
+                'generation_time': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
+                'config': {
+                    'target_column': self.config.target_column,
+                    'results_dir': self.config.results_dir
+                }
+            }
+            filepath = os.path.join(self.reports_dir, filename)
+            with open(filepath, 'w', encoding='utf-8') as f:
+                json.dump(plots_info, f, indent=4, ensure_ascii=False, default=str)
+            logger.info(f"✓ Plot information saved: {filepath}")
+        except Exception as e:
+            logger.error(f"✗ Error saving plot information: {e}")
+    def move_existing_plots(self, source_dir: str = None) -> Dict[str, str]:
+        """
+        Move existing plots from specified directory to structured folders
+        Parameters:
+        -----------
+        source_dir : str, optional
+            Directory with existing plots
+        Returns:
+        --------
+        Dict[str, str] : dictionary with information about moved files
+        """
+        if source_dir is None:
+            source_dir = self.plots_dir
+        if not os.path.exists(source_dir):
+            logger.warning(f"Source directory doesn't exist: {source_dir}")
+            return {}
+        # File to folder mapping
+        file_to_folder_map = {
+            # Time series
+            'data_split.png': 'time_series',
+            'stationarity_raskhodvoda.png': 'time_series',
+            'stationarity_analysis.png': 'time_series',
+            'temporal_outliers.png': 'time_series',
+            # Correlations
+            'feature_selection_correlation.png': 'correlations',
+            # Preprocessing
+            'missing_values_analysis.png': 'preprocessing',
+            'outlier_handling_results.png': 'preprocessing',
+            'outliers_analysis.png': 'preprocessing',
+            'scaling_results.png': 'preprocessing',
+            # Default
+            'default': 'summary'
+        }
+        moved_files = {}
+        for filename in os.listdir(source_dir):
+            if filename.endswith('.png'):
+                source_path = os.path.join(source_dir, filename)
+                # Determine destination folder
+                target_folder = file_to_folder_map.get(filename, file_to_folder_map['default'])
+                target_dir = os.path.join(self.plots_dir, target_folder)
+                # Create destination folder if doesn't exist
+                os.makedirs(target_dir, exist_ok=True)
+                # Target path
+                target_path = os.path.join(target_dir, filename)
+                try:
+                    # Move file
+                    os.rename(source_path, target_path)
+                    moved_files[filename] = target_path
+                    logger.info(f"Moved: {filename} -> {target_folder}/")
+                except Exception as e:
+                    logger.error(f"Error moving {filename}: {e}")
+        logger.info(f"Moved {len(moved_files)} files")
+        return moved_files