air_quality_forecast package

Submodules

air_quality_forecast.api_caller module

class air_quality_forecast.api_caller.APICaller[source]

Bases: object

_current_time() str[source]

Returns the current time in the format “YYYY-MM-DD HH:MM:SS”.

_get_luchtmeet_data(components: str, station_number: int) list[source]

Fetches luchtmeet data for a specific formula from the API over the past three days.

Parameters:
  • components – The components to query (e.g., ‘O3,NO2,PM25’).

  • station_number – The station number to query.

Returns:

List of JSON data from the API response for the given components.

_two_days_ago() str[source]

Returns the date and time three days before the current time.

get_luchtmeet_data() DataFrame[source]

Averages out the luchtmeet data for all required formulas over the past three days.

Returns:

A pandas DataFrame with daily averaged data for each formula.

get_vc_data() DataFrame[source]

Fetches weather data for Utrecht for the last three days from Visual Crossing API.

Returns:

A pandas DataFrame with the weather data for the last three days.

lag_data() DataFrame[source]

Lag the air quality and weather data into a single row.

Returns:

DataFrame with lagged data in a single row.

air_quality_forecast.data_pipeline module

class air_quality_forecast.data_pipeline.DataLoader(raw_data_path: str, processed_data_path: str)[source]

Bases: object

save_to_csv(name: str, data: DataFrame) None[source]

Save the data to a CSV file.

Parameters:
  • name – The name of the file to save.

  • data – The Pandas DataFrame to save as a CSV.

class air_quality_forecast.data_pipeline.FeatureProcessor(griftpark_data: DataFrame, utrecht_data: DataFrame)[source]

Bases: object

apply_time_shift(t_max: int = 3) DataFrame[source]

Apply time shift to the dataset.

Parameters:

t_max – Maximum time shift, representing the number of days used to predict future values.

Returns:

The time-shifted Pandas DataFrame.

merge_raw_data() DataFrame[source]

Merge the raw datasets based on date.

Returns:

The merged Pandas DataFrame.

preprocess_data() DataFrame[source]

Preprocess the data by applying feature selection, missing value removal, and time shift.

Returns:

The preprocessed Pandas DataFrame.

select_features() DataFrame[source]

Select relevant features from the merged data.

Returns:

A Pandas DataFrame with selected features.

sort_data_by_date() DataFrame[source]

Sort the merged data by date, starting from the most recent to the oldest.

Returns:

The sorted Pandas DataFrame.

class air_quality_forecast.data_pipeline.PreprocessingPipeline[source]

Bases: object

run_pipeline() DataFrame[source]

Run the entire preprocessing pipeline: load data, process features, normalize, and save to CSV.

Parameters:

normalizer_type – The type of normalizer to use.

Returns:

The final normalized Pandas DataFrame.

train_test_split(x: DataFrame, y: DataFrame, test_size: float = 0.2) Tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame, DataFrame][source]

Split the data into training and testing sets.

Parameters:

data – The data to split as a Pandas DataFrame.

Returns:

A tuple of the training and testing data as Pandas DataFrames.

air_quality_forecast.get_prediction_data module

air_quality_forecast.get_prediction_data.main()[source]

air_quality_forecast.main module

air_quality_forecast.model_development module

class air_quality_forecast.model_development.RegressorTrainer(experiment_name: str, regressor: BaseEstimator, param_space: Dict[str, Any], cv_splits: int = 5, n_iter: int = 50)[source]

Bases: object

_evaluate_model() None[source]

Evaluate the best model on the test data and log metrics.

Raises:
  • ValueError – If test data has not been set. Call _set_data first.

  • ValueError – If Bayesian search has not been performed. Call _perform_search first.

_optimize_and_evaluate()[source]

Perform Bayesian optimization for hyperparameters and evaluate the best model on the test data.

This method wraps _perform_search and _evaluate_model in a single MLflow run.

Perform Bayesian optimization for hyperparameters.

This method initializes and performs a BayesSearchCV search for the best hyperparameters of the regressor. The search is performed on the training data using a TimeSeriesSplit cross-validation scheme. The best hyperparameters and the corresponding mean squared error are logged to MLflow.

Raises:

ValueError – If training data has not been set. Call _set_data first.

_set_data(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray) None[source]

Set the training and test data as class attributes.

Parameters:
  • x_train (np.ndarray) – Training data features.

  • y_train (np.ndarray) – Training data labels.

  • x_test (np.ndarray) – Test data features.

  • y_test (np.ndarray) – Test data labels.

_setup_mlflow() None[source]

Set up MLflow configuration.

This method launches the MLflow server, sets the MLflow experiment name and tracking URI, enables system metrics logging, and turns on autologging of metrics and parameters.

static launch_mlflow_server() None[source]

Launch MLflow server at http://127.0.0.1:5000, if not already running.

If the port is already in use, it will print a message saying so. If there is an error launching the server, it will print the error.

static port_in_use(port: int) bool[source]

Check if a port is in use.

Parameters:

port (int) – Port to check.

Returns:

True if the port is in use, False otherwise.

Return type:

bool

run(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray) None[source]

Run the Bayesian optimization workflow.

Parameters:
  • x_train (np.ndarray) – Training data features.

  • y_train (np.ndarray) – Training data labels.

  • x_test (np.ndarray) – Test data features.

  • y_test (np.ndarray) – Test data labels.

air_quality_forecast.model_development.convert_param_space(param_space: dict)[source]

Convert a parameter space dictionary to a format usable by skopt.

This function takes a dictionary where the keys are parameter names and the values are lists of two values that represent the range of possible values for that parameter. The function then converts these ranges to skopt parameter objects and returns a new dictionary where the parameter names are the same but the values are now skopt parameter objects.

Parameters:

param_space (dict) – A dictionary with parameter names as keys and ranges of possible values as values.

Returns:

A dictionary with parameter names as keys and skopt parameter

objects as values.

Return type:

dict

air_quality_forecast.model_development.run_bayesian_optimization(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray, experiment_name: str, regressor: BaseEstimator, param_space: Dict[str, Any], n_iter: int) None[source]

Run Bayesian optimization to search for the best hyperparameters for a given regressor.

Parameters:
  • x_train (np.ndarray) – Training data features.

  • y_train (np.ndarray) – Training data labels.

  • x_test (np.ndarray) – Test data features.

  • y_test (np.ndarray) – Test data labels.

  • experiment_name (str) – The name of the MLflow experiment.

  • regressor (sklearn model) – The regressor to optimize.

  • param_space (dict) – The parameter space for Bayesian optimization.

  • n_iter (int) – Number of iterations for the search.

air_quality_forecast.model_development.train_all_models()[source]

Train all models using Bayesian optimization.

This function is used to train the three models (DecisionTreeRegressor, XGBRegressor, and RandomForestRegressor) using Bayesian optimization.

The training and test data are loaded from the “data/processed” directory.

The hyperparameter search spaces for each model are loaded from “configs/hyperparameter_search_spaces.yaml”.

The trained models are logged to the MLflow server.

air_quality_forecast.model_development.train_one_model(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray, experiment_name: str, model: str, param_space: Dict[str, Any], n_iter: int) None[source]

air_quality_forecast.parser_ui module

air_quality_forecast.parser_ui.create_parser() ArgumentParser[source]

Parse command line arguments. It is quite nice to use command line arguments if you just want to plot or test a single algorithm without having to run all other algorithms.

air_quality_forecast.parser_ui.load_data(x_train_path: str, y_train_path: str, x_test_path: str, y_test_path: str) Tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]

Load the data from the given paths.

air_quality_forecast.parser_ui.main()[source]
air_quality_forecast.parser_ui.normalize_data(normalizer, x_train, x_test) Tuple[DataFrame, DataFrame][source]

Normalize the data using the given normalizer.

air_quality_forecast.parser_ui.train_model(x_train: DataFrame, y_train: DataFrame, x_test: DataFrame, y_test: DataFrame, experiment_name: str, model: str, param_space: Dict[str, Any], n_iter: int) None[source]

Train a model using Bayesian optimization.

air_quality_forecast.prediction module

class air_quality_forecast.prediction.PredictorModels[source]

Bases: object

_load_models() None[source]

Loads the pre-trained models from the saved_models directory.

The models are loaded in the following order:

  1. Decision Tree Regressor

  2. Random Forest Regressor

  3. XGBoost Regressor

The models are loaded from the following paths:

  • Decision Tree Regressor: saved_models/decision_tree.pkl

  • Random Forest Regressor: saved_models/random_forest.pkl

  • XGBoost Regressor: saved_models/xgboost.xgb

decision_tree_predictions(x_test: DataFrame) ndarray[source]

Makes predictions using the loaded decision tree regressor.

Parameters:

x_test (pd.DataFrame) – Input data to make predictions on.

Returns:

y_pred – Predicted values.

Return type:

np.ndarray

random_forest_predictions(x_test: DataFrame) ndarray[source]

Makes predictions using the loaded Random Forest regressor.

Parameters:

x_test (pd.DataFrame) – Data points to make predictions on.

Returns:

y_pred – Predicted values for the input data points.

Return type:

np.ndarray

xgb_predictions(x_test: DataFrame, normalized: bool) ndarray[source]

Makes predictions using the loaded XGBoost regressor.

Parameters:
  • x_test (pd.DataFrame) – Data points to make predictions on.

  • normalized (bool) – Whether the data is normalized or not.

Returns:

y_pred – Predicted values for the input data points.

Return type:

np.ndarray

air_quality_forecast.utils module

class air_quality_forecast.utils.FeatureSelector[source]

Bases: object

change_to_numeric()[source]

Change each entry to a numerical value.

rename_initial_columns()[source]

Rename the columns of the datasets to remove whitespaces.

select_cols_by_correlation() list[source]

Select columns based on correlation criteria.

uninformative_columns() list[source]

Those columns provide no information that the model can use

class air_quality_forecast.utils.InputValidator[source]

Bases: object

static validate_file_exists(path: str, variable_name: str) None[source]

Validate that the file path exists.

Parameters:
  • path – The file path to validate.

  • variable_name – The name of the variable for error messages.

Raises:

FileNotFoundError – If the path does not exist.

static validate_type(value, expected_type, variable_name: str) None[source]

Validate the type of the given variable.

Parameters:
  • value – The value to validate.

  • expected_type – The expected type of the value.

  • variable_name – The name of the variable for error messages.

Raises:

TypeError – If the value is not of the expected type.

Module contents