made data pipeline for data processing in process_data_pipeline.py

Browse files

Files changed (6) hide show

data/README_pipeline.md +144 -0
data/align_data.py +41 -13
data/euv_data_cleaning.py +26 -3
data/iti_data_processing.py +26 -3
data/pipeline_config_template.yaml +39 -0
data/process_data_pipeline.py +294 -0

data/README_pipeline.md ADDED Viewed

	@@ -0,0 +1,144 @@

+# Data Processing Pipeline
+This directory contains a comprehensive data processing pipeline for solar flare data analysis.
+## Pipeline Scripts
+### Main Orchestrator
+- **`process_data_pipeline.py`** - Main orchestrator script that runs all processing steps in sequence
+### Individual Processing Steps
+1. **`euv_data_cleaning.py`** - Removes bad AIA files based on timestamp validation
+2. **`iti_data_processing.py`** - Processes good AIA data using ITI methods
+3. **`align_data.py`** - Concatenates GOES data and checks for missing data
+### Configuration
+- **`pipeline_config.py`** - Configuration system for all directory paths and settings
+- **`pipeline_config_template.yaml`** - YAML template for creating custom configurations
+- **`pipeline_config_template.py`** - Python template (backward compatibility)
+## Usage
+### Run the Complete Pipeline
+```bash
+# Run all steps (skip completed ones) with default configuration
+python process_data_pipeline.py
+# Force rerun all steps
+python process_data_pipeline.py --force
+# Use custom configuration file (YAML or Python)
+python process_data_pipeline.py --config my_config.yaml
+# Display current configuration
+python process_data_pipeline.py --show-config
+# Validate configuration paths
+python process_data_pipeline.py --validate
+# Create YAML configuration template
+python process_data_pipeline.py --create-template
+```
+### Run Individual Steps
+```bash
+# EUV data cleaning
+python euv_data_cleaning.py
+# ITI data processing
+python iti_data_processing.py
+# Data alignment
+python align_data.py
+```
+## Pipeline Features
+- **Modular Configuration**: Easy to change directory paths and settings
+- **Automatic Step Skipping**: Steps that are already completed are automatically skipped
+- **Comprehensive Logging**: All operations are logged to both console and file (`data_processing_pipeline.log`)
+- **Error Handling**: Pipeline stops on first error with detailed error reporting
+- **Progress Tracking**: Real-time progress updates and timing information
+- **Configuration Validation**: Check if all required paths exist before running
+- **Template Generation**: Create custom configuration templates easily
+## Configuration
+### Default Directories
+The pipeline uses the following default directories (configurable):
+**Input Directories:**
+- `/mnt/data/AUGUST/SDO-AIA-timespan/` - Raw AIA data
+- `/mnt/data/NEW-FLARE/combined/` - GOES CSV files
+- `/mnt/data/NEW-FLARE/AIA_processed/` - Processed AIA files
+**Output Directories:**
+- `/mnt/data/AUGUST/SDO-AIA_bad/` - Bad AIA files (from EUV cleaning)
+- `/mnt/data/AUGUST/AIA_ITI/` - Processed AIA data (from ITI processing)
+- `/mnt/data/NEW-FLARE/GOES-SXR-A/` - GOES SXR-A data (from alignment)
+- `/mnt/data/NEW-FLARE/GOES-SXR-B/` - GOES SXR-B data (from alignment)
+- `/mnt/data/NEW-FLARE/AIA_ITI_MISSING/` - AIA files with missing GOES data
+### Custom Configuration
+To use custom directories:
+1. **Create a YAML configuration file:**
+   ```bash
+   python process_data_pipeline.py --create-template
+   # Edit pipeline_config_template.yaml with your paths
+   ```
+2. **Use your custom configuration:**
+   ```bash
+   python process_data_pipeline.py --config my_config.yaml
+   ```
+3. **Validate your configuration:**
+   ```bash
+   python process_data_pipeline.py --config my_config.yaml --validate
+   ```
+**YAML Configuration Example:**
+```yaml
+# Data Processing Pipeline Configuration
+base_data_dir: /your/data/path
+euv:
+  input_folder: /your/data/AIA-raw
+  bad_files_dir: /your/data/AIA-bad
+  wavelengths: [94, 131, 171, 193, 211, 304]
+iti:
+  input_folder: /your/data/AIA-raw
+  output_folder: /your/data/AIA-processed
+  wavelengths: [94, 131, 171, 193, 211, 304]
+alignment:
+  goes_data_dir: /your/data/GOES
+  aia_processed_dir: /your/data/AIA-processed
+  output_sxr_a_dir: /your/data/GOES-SXR-A
+  output_sxr_b_dir: /your/data/GOES-SXR-B
+  aia_missing_dir: /your/data/AIA-missing
+processing:
+  max_processes: null  # null = use all CPU cores
+  batch_size_multiplier: 4
+  min_batch_size: 1
+```
+## Requirements
+- Python 3.6+
+- Required packages: pandas, numpy, astropy, tqdm, multiprocessing, pyyaml
+- Sufficient disk space for data processing
+- Access to the specified data directories
+## Logging
+The pipeline creates detailed logs in `data_processing_pipeline.log` including:
+- Start/end times for each step
+- Success/failure status
+- Error messages and stack traces
+- Processing statistics

data/align_data.py CHANGED Viewed

@@ -13,29 +13,57 @@ import re
 warnings.filterwarnings('ignore')
 # =============================================================================
-# CONFIGURATION - Change these paths as needed
 # =============================================================================
 #
-# To change directories, simply modify the variables below:
-# - All paths are relative to your system
-# - Directories will be created automatically if they don't exist
-# - Use absolute paths for best results
 #
 # =============================================================================
 # Input directories
-GOES_DATA_DIR = "/mnt/data/NEW-FLARE/combined"  # Directory containing GOES CSV files
-AIA_PROCESSED_DIR = "/mnt/data/NEW-FLARE/AIA_processed"  # Directory with processed AIA files
 # Output directories
-OUTPUT_SXR_A_DIR = "/mnt/data/NEW-FLARE/GOES-SXR-A"  # Output directory for SXR-A data
-OUTPUT_SXR_B_DIR = "/mnt/data/NEW-FLARE/GOES-SXR-B"  # Output directory for SXR-B data
-AIA_MISSING_DIR = "/mnt/data/NEW-FLARE/AIA_ITI_MISSING"  # Directory for AIA files with missing GOES data
 # Processing configuration
-BATCH_SIZE_MULTIPLIER = 4  # Number of batches per process (adjust for performance)
-MIN_BATCH_SIZE = 1  # Minimum batch size
-MAX_PROCESSES = None  # Maximum number of processes (None = use all CPU cores)
 # =============================================================================

 warnings.filterwarnings('ignore')
 # =============================================================================
+# CONFIGURATION - Load from environment or use defaults
 # =============================================================================
 #
+# Configuration is loaded from environment variables set by the pipeline orchestrator
+# or falls back to default values if running standalone
 #
 # =============================================================================
+import os
+import json
+def load_config():
+    """Load configuration from environment or use defaults."""
+    if 'PIPELINE_CONFIG' in os.environ:
+        try:
+            config = json.loads(os.environ['PIPELINE_CONFIG'])
+            return config
+        except:
+            pass
+    # Default configuration
+    return {
+        'alignment': {
+            'goes_data_dir': "/mnt/data/NEW-FLARE/combined",
+            'aia_processed_dir': "/mnt/data/NEW-FLARE/AIA_processed",
+            'output_sxr_a_dir': "/mnt/data/NEW-FLARE/GOES-SXR-A",
+            'output_sxr_b_dir': "/mnt/data/NEW-FLARE/GOES-SXR-B",
+            'aia_missing_dir': "/mnt/data/NEW-FLARE/AIA_ITI_MISSING"
+        },
+        'processing': {
+            'batch_size_multiplier': 4,
+            'min_batch_size': 1,
+            'max_processes': None
+        }
+    }
+config = load_config()
 # Input directories
+GOES_DATA_DIR = config['alignment']['goes_data_dir']
+AIA_PROCESSED_DIR = config['alignment']['aia_processed_dir']
 # Output directories
+OUTPUT_SXR_A_DIR = config['alignment']['output_sxr_a_dir']
+OUTPUT_SXR_B_DIR = config['alignment']['output_sxr_b_dir']
+AIA_MISSING_DIR = config['alignment']['aia_missing_dir']
 # Processing configuration
+BATCH_SIZE_MULTIPLIER = config['processing']['batch_size_multiplier']
+MIN_BATCH_SIZE = config['processing']['min_batch_size']
+MAX_PROCESSES = config['processing']['max_processes']
 # =============================================================================

data/euv_data_cleaning.py CHANGED Viewed

@@ -15,8 +15,31 @@ from itipy.data.dataset import get_intersecting_files
 from astropy.io import fits
 # Configuration for all wavelengths to process
-wavelengths = [94, 131, 171, 193, 211, 304]
-base_input_folder = '/mnt/data/NEW-FLARE/SDO-AIA-flaring'
 aia_files = get_intersecting_files(base_input_folder, wavelengths)
@@ -64,7 +87,7 @@ for wavelength in wavelengths:
         filename = pd.to_datetime(names).strftime('%Y-%m-%dT%H:%M:%S') + ".fits"
         file_path = os.path.join(base_input_folder, f"{wavelength}/{filename}")
         # Destination path
-        destination_folder = os.path.join("/mnt/data/NEW-FLARE/SDO-AIA_bad", str(wavelength))
         os.makedirs(destination_folder, exist_ok=True)
         # Move or report missing
         if os.path.exists(file_path):

 from astropy.io import fits
 # Configuration for all wavelengths to process
+# Load configuration from environment or use defaults
+import os
+import json
+def load_config():
+    """Load configuration from environment or use defaults."""
+    if 'PIPELINE_CONFIG' in os.environ:
+        try:
+            config = json.loads(os.environ['PIPELINE_CONFIG'])
+            return config
+        except:
+            pass
+    # Default configuration
+    return {
+        'euv': {
+            'wavelengths': [94, 131, 171, 193, 211, 304],
+            'input_folder': '/mnt/data/AUGUST/SDO-AIA-timespan',
+            'bad_files_dir': '/mnt/data/AUGUST/SDO-AIA_bad'
+        }
+    }
+config = load_config()
+wavelengths = config['euv']['wavelengths']
+base_input_folder = config['euv']['input_folder']
 aia_files = get_intersecting_files(base_input_folder, wavelengths)
         filename = pd.to_datetime(names).strftime('%Y-%m-%dT%H:%M:%S') + ".fits"
         file_path = os.path.join(base_input_folder, f"{wavelength}/{filename}")
         # Destination path
+        destination_folder = os.path.join(config['euv']['bad_files_dir'], str(wavelength))
         os.makedirs(destination_folder, exist_ok=True)
         # Move or report missing
         if os.path.exists(file_path):

data/iti_data_processing.py CHANGED Viewed

@@ -14,9 +14,32 @@ from multiprocessing import Pool
 from tqdm import tqdm
 # Configuration for all wavelengths to process
-wavelengths = [94, 131, 171, 193, 211, 304]
-base_input_folder = '/mnt/data/SDO-AIA-flaring'
-output_folder = '/mnt/data/AIA_ITI'
 os.makedirs(output_folder, exist_ok=True)
 sdo_norms = {

 from tqdm import tqdm
 # Configuration for all wavelengths to process
+# Load configuration from environment or use defaults
+import os
+import json
+def load_config():
+    """Load configuration from environment or use defaults."""
+    if 'PIPELINE_CONFIG' in os.environ:
+        try:
+            config = json.loads(os.environ['PIPELINE_CONFIG'])
+            return config
+        except:
+            pass
+    # Default configuration
+    return {
+        'iti': {
+            'wavelengths': [94, 131, 171, 193, 211, 304],
+            'input_folder': '/mnt/data/AUGUST/SDO-AIA-timespan',
+            'output_folder': '/mnt/data/AUGUST/AIA_ITI'
+        }
+    }
+config = load_config()
+wavelengths = config['iti']['wavelengths']
+base_input_folder = config['iti']['input_folder']
+output_folder = config['iti']['output_folder']
 os.makedirs(output_folder, exist_ok=True)
 sdo_norms = {

data/pipeline_config_template.yaml ADDED Viewed

	@@ -0,0 +1,39 @@

+# Data Processing Pipeline Configuration
+#
+# Modify the paths below to match your system setup.
+# All paths should be absolute paths for best results.
+#
+# Usage: python process_data_pipeline.py --config this_file.yaml
+#
+base_data_dir: /mnt/data
+euv:
+  input_folder: /mnt/data/AUGUST/SDO-AIA-timespan
+  bad_files_dir: /mnt/data/AUGUST/SDO-AIA_bad
+  wavelengths:
+  - 94
+  - 131
+  - 171
+  - 193
+  - 211
+  - 304
+iti:
+  input_folder: /mnt/data/AUGUST/SDO-AIA-timespan
+  output_folder: /mnt/data/AUGUST/AIA_ITI
+  wavelengths:
+  - 94
+  - 131
+  - 171
+  - 193
+  - 211
+  - 304
+alignment:
+  goes_data_dir: /mnt/data/NEW-FLARE/combined
+  aia_processed_dir: /mnt/data/NEW-FLARE/AIA_processed
+  output_sxr_a_dir: /mnt/data/NEW-FLARE/GOES-SXR-A
+  output_sxr_b_dir: /mnt/data/NEW-FLARE/GOES-SXR-B
+  aia_missing_dir: /mnt/data/NEW-FLARE/AIA_ITI_MISSING
+processing:
+  max_processes: null
+  batch_size_multiplier: 4
+  min_batch_size: 1

data/process_data_pipeline.py ADDED Viewed

	@@ -0,0 +1,294 @@

+#!/usr/bin/env python3
+"""
+Data Processing Pipeline Orchestrator
+This script orchestrates the three main data processing steps:
+1. EUV data cleaning (euv_data_cleaning.py) - removes bad AIA files
+2. ITI data processing (iti_data_processing.py) - processes the good data
+3. Data alignment (align_data.py) - concatenates GOES data and checks for missing data
+Each step can be skipped if it's already completed.
+Configuration:
+- Use --config to specify a custom configuration file (YAML or Python)
+- Use --show-config to display current configuration
+- Use --create-template to create a YAML configuration template
+"""
+import os
+import sys
+import subprocess
+import time
+import logging
+from datetime import datetime
+from pathlib import Path
+from pipeline_config import PipelineConfig
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.FileHandler('data_processing_pipeline.log'),
+        logging.StreamHandler(sys.stdout)
+    ]
+)
+logger = logging.getLogger(__name__)
+class DataProcessingPipeline:
+    def __init__(self, base_dir=None, config=None):
+        """
+        Initialize the data processing pipeline.
+        Args:
+            base_dir: Base directory for the project. If None, uses current script's directory.
+            config: PipelineConfig instance. If None, uses default configuration.
+        """
+        if base_dir is None:
+            self.base_dir = Path(__file__).parent
+        else:
+            self.base_dir = Path(base_dir)
+        # Load configuration
+        self.config = config if config is not None else PipelineConfig()
+        # Define script paths
+        self.scripts = {
+            'euv_cleaning': self.base_dir / 'euv_data_cleaning.py',
+            'iti_processing': self.base_dir / 'iti_data_processing.py',
+            'align_data': self.base_dir / 'align_data.py'
+        }
+        # Define step names and descriptions
+        self.steps = {
+            'euv_cleaning': {
+                'name': 'EUV Data Cleaning',
+                'description': 'Remove bad AIA files based on timestamp validation',
+                'output_check': self._check_euv_cleaning_complete
+            },
+            'iti_processing': {
+                'name': 'ITI Data Processing',
+                'description': 'Process good AIA data using ITI methods',
+                'output_check': self._check_iti_processing_complete
+            },
+            'align_data': {
+                'name': 'Data Alignment',
+                'description': 'Concatenate GOES data and check for missing data',
+                'output_check': self._check_align_data_complete
+            }
+        }
+    def _check_euv_cleaning_complete(self):
+        """
+        Check if EUV data cleaning is complete by looking for the bad files directory.
+        """
+        bad_files_dir = Path(self.config.get_path('euv', 'bad_files_dir'))
+        if bad_files_dir.exists():
+            # Check if any files were moved (indicating cleaning was done)
+            wavelengths = self.config.get_path('euv', 'wavelengths')
+            wavelength_dirs = [bad_files_dir / str(wl) for wl in wavelengths]
+            return any(d.exists() and any(d.iterdir()) for d in wavelength_dirs)
+        return False
+    def _check_iti_processing_complete(self):
+        """
+        Check if ITI data processing is complete by looking for processed files.
+        """
+        output_dir = Path(self.config.get_path('iti', 'output_folder'))
+        if output_dir.exists():
+            # Check if there are processed .npy files
+            npy_files = list(output_dir.glob('*.npy'))
+            return len(npy_files) > 0
+        return False
+    def _check_align_data_complete(self):
+        """
+        Check if data alignment is complete by looking for output directories.
+        """
+        output_dirs = [
+            Path(self.config.get_path('alignment', 'output_sxr_a_dir')),
+            Path(self.config.get_path('alignment', 'output_sxr_b_dir'))
+        ]
+        return all(d.exists() and any(d.iterdir()) for d in output_dirs)
+    def run_script(self, script_name, step_info):
+        """
+        Run a single processing script.
+        Args:
+            script_name: Name of the script to run
+            step_info: Dictionary containing step information
+        Returns:
+            bool: True if successful, False otherwise
+        """
+        script_path = self.scripts[script_name]
+        if not script_path.exists():
+            logger.error(f"Script not found: {script_path}")
+            return False
+        logger.info(f"Starting {step_info['name']}...")
+        logger.info(f"Description: {step_info['description']}")
+        logger.info(f"Running: {script_path}")
+        # Create environment variables for configuration
+        env = os.environ.copy()
+        env.update({
+            'PIPELINE_CONFIG': str(self.config.config),
+            'BASE_DATA_DIR': self.config.get_path('base_data_dir', 'base_data_dir')
+        })
+        start_time = time.time()
+        try:
+            # Run the script
+            result = subprocess.run(
+                [sys.executable, str(script_path)],
+                capture_output=True,
+                text=True,
+                cwd=self.base_dir,
+                env=env
+            )
+            end_time = time.time()
+            duration = end_time - start_time
+            if result.returncode == 0:
+                logger.info(f"✓ {step_info['name']} completed successfully in {duration:.2f} seconds")
+                if result.stdout:
+                    logger.debug(f"Output: {result.stdout}")
+                return True
+            else:
+                logger.error(f"✗ {step_info['name']} failed with return code {result.returncode}")
+                logger.error(f"Error output: {result.stderr}")
+                return False
+        except Exception as e:
+            end_time = time.time()
+            duration = end_time - start_time
+            logger.error(f"✗ {step_info['name']} failed with exception: {e}")
+            logger.error(f"Duration: {duration:.2f} seconds")
+            return False
+    def run_pipeline(self, force_rerun=False):
+        """
+        Run the complete data processing pipeline.
+        Args:
+            force_rerun: If True, run all steps regardless of completion status
+        """
+        logger.info("=" * 80)
+        logger.info("Starting Data Processing Pipeline")
+        logger.info("=" * 80)
+        logger.info(f"Base directory: {self.base_dir}")
+        logger.info(f"Force rerun: {force_rerun}")
+        logger.info("=" * 80)
+        pipeline_start_time = time.time()
+        successful_steps = 0
+        failed_steps = 0
+        for step_name, step_info in self.steps.items():
+            logger.info(f"\n--- Step: {step_info['name']} ---")
+            # Check if step is already complete
+            if not force_rerun and step_info['output_check']():
+                logger.info(f"✓ {step_info['name']} already completed - skipping")
+                successful_steps += 1
+                continue
+            # Run the step
+            if self.run_script(step_name, step_info):
+                successful_steps += 1
+            else:
+                failed_steps += 1
+                logger.error(f"Pipeline stopped due to failure in {step_info['name']}")
+                break
+        pipeline_end_time = time.time()
+        total_duration = pipeline_end_time - pipeline_start_time
+        # Summary
+        logger.info("\n" + "=" * 80)
+        logger.info("PIPELINE SUMMARY")
+        logger.info("=" * 80)
+        logger.info(f"Total duration: {total_duration:.2f} seconds")
+        logger.info(f"Successful steps: {successful_steps}")
+        logger.info(f"Failed steps: {failed_steps}")
+        if failed_steps == 0:
+            logger.info("✓ All steps completed successfully!")
+        else:
+            logger.error("✗ Pipeline completed with errors")
+        logger.info("=" * 80)
+        return failed_steps == 0
+def main():
+    """Main function to run the pipeline."""
+    import argparse
+    parser = argparse.ArgumentParser(description='Data Processing Pipeline Orchestrator')
+    parser.add_argument('--force', action='store_true',
+                       help='Force rerun all steps regardless of completion status')
+    parser.add_argument('--base-dir', type=str,
+                       help='Base directory for the project (default: script directory)')
+    parser.add_argument('--config', type=str,
+                       help='Path to custom configuration file (YAML or Python)')
+    parser.add_argument('--show-config', action='store_true',
+                       help='Display current configuration and exit')
+    parser.add_argument('--create-template', action='store_true',
+                       help='Create a YAML configuration template file and exit')
+    parser.add_argument('--validate', action='store_true',
+                       help='Validate configuration paths and exit')
+    args = parser.parse_args()
+    # Handle special commands
+    if args.create_template:
+        config = PipelineConfig()
+        config.save_config_template()
+        return
+    # Load configuration
+    config = PipelineConfig(args.config)
+    if args.show_config:
+        config.print_config()
+        return
+    if args.validate:
+        is_valid, missing_paths = config.validate_paths()
+        if is_valid:
+            print("✓ All required paths exist")
+        else:
+            print("✗ Missing required paths:")
+            for path in missing_paths:
+                print(f"  - {path}")
+        return
+    # Create pipeline instance
+    pipeline = DataProcessingPipeline(args.base_dir, config)
+    # Validate paths before running
+    is_valid, missing_paths = config.validate_paths()
+    if not is_valid:
+        logger.error("Configuration validation failed. Missing required paths:")
+        for path in missing_paths:
+            logger.error(f"  - {path}")
+        logger.error("Use --validate to check configuration")
+        sys.exit(1)
+    # Create necessary directories
+    config.create_directories()
+    # Run the pipeline
+    success = pipeline.run_pipeline(force_rerun=args.force)
+    # Exit with appropriate code
+    sys.exit(0 if success else 1)
+if __name__ == "__main__":
+    main()