air_quality_forecast package
Submodules
air_quality_forecast.api_caller module
- class air_quality_forecast.api_caller.APICaller[source]
Bases:
object- _get_luchtmeet_data(components: str, station_number: int) list[source]
Fetches luchtmeet data for a specific formula from the API over the past three days.
- Parameters:
components – The components to query (e.g., ‘O3,NO2,PM25’).
station_number – The station number to query.
- Returns:
List of JSON data from the API response for the given components.
- get_luchtmeet_data() DataFrame[source]
Averages out the luchtmeet data for all required formulas over the past three days.
- Returns:
A pandas DataFrame with daily averaged data for each formula.
air_quality_forecast.data_pipeline module
- class air_quality_forecast.data_pipeline.DataLoader(raw_data_path: str, processed_data_path: str)[source]
Bases:
object
- class air_quality_forecast.data_pipeline.FeatureProcessor(griftpark_data: DataFrame, utrecht_data: DataFrame)[source]
Bases:
object- apply_time_shift(t_max: int = 3) DataFrame[source]
Apply time shift to the dataset.
- Parameters:
t_max – Maximum time shift, representing the number of days used to predict future values.
- Returns:
The time-shifted Pandas DataFrame.
- merge_raw_data() DataFrame[source]
Merge the raw datasets based on date.
- Returns:
The merged Pandas DataFrame.
- preprocess_data() DataFrame[source]
Preprocess the data by applying feature selection, missing value removal, and time shift.
- Returns:
The preprocessed Pandas DataFrame.
- class air_quality_forecast.data_pipeline.PreprocessingPipeline[source]
Bases:
object- run_pipeline() DataFrame[source]
Run the entire preprocessing pipeline: load data, process features, normalize, and save to CSV.
- Parameters:
normalizer_type – The type of normalizer to use.
- Returns:
The final normalized Pandas DataFrame.
- train_test_split(x: DataFrame, y: DataFrame, test_size: float = 0.2) Tuple[DataFrame, DataFrame, DataFrame, DataFrame, DataFrame, DataFrame][source]
Split the data into training and testing sets.
- Parameters:
data – The data to split as a Pandas DataFrame.
- Returns:
A tuple of the training and testing data as Pandas DataFrames.
air_quality_forecast.get_prediction_data module
air_quality_forecast.main module
air_quality_forecast.model_development module
- class air_quality_forecast.model_development.RegressorTrainer(experiment_name: str, regressor: BaseEstimator, param_space: Dict[str, Any], cv_splits: int = 5, n_iter: int = 50)[source]
Bases:
object- _evaluate_model() None[source]
Evaluate the best model on the test data and log metrics.
- Raises:
ValueError – If test data has not been set. Call _set_data first.
ValueError – If Bayesian search has not been performed. Call _perform_search first.
- _optimize_and_evaluate()[source]
Perform Bayesian optimization for hyperparameters and evaluate the best model on the test data.
This method wraps _perform_search and _evaluate_model in a single MLflow run.
- _perform_search() None[source]
Perform Bayesian optimization for hyperparameters.
This method initializes and performs a BayesSearchCV search for the best hyperparameters of the regressor. The search is performed on the training data using a TimeSeriesSplit cross-validation scheme. The best hyperparameters and the corresponding mean squared error are logged to MLflow.
- Raises:
ValueError – If training data has not been set. Call _set_data first.
- _set_data(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray) None[source]
Set the training and test data as class attributes.
- Parameters:
x_train (np.ndarray) – Training data features.
y_train (np.ndarray) – Training data labels.
x_test (np.ndarray) – Test data features.
y_test (np.ndarray) – Test data labels.
- _setup_mlflow() None[source]
Set up MLflow configuration.
This method launches the MLflow server, sets the MLflow experiment name and tracking URI, enables system metrics logging, and turns on autologging of metrics and parameters.
- static launch_mlflow_server() None[source]
Launch MLflow server at http://127.0.0.1:5000, if not already running.
If the port is already in use, it will print a message saying so. If there is an error launching the server, it will print the error.
- static port_in_use(port: int) bool[source]
Check if a port is in use.
- Parameters:
port (int) – Port to check.
- Returns:
True if the port is in use, False otherwise.
- Return type:
bool
- run(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray) None[source]
Run the Bayesian optimization workflow.
- Parameters:
x_train (np.ndarray) – Training data features.
y_train (np.ndarray) – Training data labels.
x_test (np.ndarray) – Test data features.
y_test (np.ndarray) – Test data labels.
- air_quality_forecast.model_development.convert_param_space(param_space: dict)[source]
Convert a parameter space dictionary to a format usable by skopt.
This function takes a dictionary where the keys are parameter names and the values are lists of two values that represent the range of possible values for that parameter. The function then converts these ranges to skopt parameter objects and returns a new dictionary where the parameter names are the same but the values are now skopt parameter objects.
- Parameters:
param_space (dict) – A dictionary with parameter names as keys and ranges of possible values as values.
- Returns:
- A dictionary with parameter names as keys and skopt parameter
objects as values.
- Return type:
dict
- air_quality_forecast.model_development.run_bayesian_optimization(x_train: ndarray, y_train: ndarray, x_test: ndarray, y_test: ndarray, experiment_name: str, regressor: BaseEstimator, param_space: Dict[str, Any], n_iter: int) None[source]
Run Bayesian optimization to search for the best hyperparameters for a given regressor.
- Parameters:
x_train (np.ndarray) – Training data features.
y_train (np.ndarray) – Training data labels.
x_test (np.ndarray) – Test data features.
y_test (np.ndarray) – Test data labels.
experiment_name (str) – The name of the MLflow experiment.
regressor (sklearn model) – The regressor to optimize.
param_space (dict) – The parameter space for Bayesian optimization.
n_iter (int) – Number of iterations for the search.
- air_quality_forecast.model_development.train_all_models()[source]
Train all models using Bayesian optimization.
This function is used to train the three models (DecisionTreeRegressor, XGBRegressor, and RandomForestRegressor) using Bayesian optimization.
The training and test data are loaded from the “data/processed” directory.
The hyperparameter search spaces for each model are loaded from “configs/hyperparameter_search_spaces.yaml”.
The trained models are logged to the MLflow server.
air_quality_forecast.parser_ui module
- air_quality_forecast.parser_ui.create_parser() ArgumentParser[source]
Parse command line arguments. It is quite nice to use command line arguments if you just want to plot or test a single algorithm without having to run all other algorithms.
- air_quality_forecast.parser_ui.load_data(x_train_path: str, y_train_path: str, x_test_path: str, y_test_path: str) Tuple[DataFrame, DataFrame, DataFrame, DataFrame][source]
Load the data from the given paths.
air_quality_forecast.prediction module
- class air_quality_forecast.prediction.PredictorModels[source]
Bases:
object- _load_models() None[source]
Loads the pre-trained models from the saved_models directory.
The models are loaded in the following order:
Decision Tree Regressor
Random Forest Regressor
XGBoost Regressor
The models are loaded from the following paths:
Decision Tree Regressor: saved_models/decision_tree.pkl
Random Forest Regressor: saved_models/random_forest.pkl
XGBoost Regressor: saved_models/xgboost.xgb
- decision_tree_predictions(x_test: DataFrame) ndarray[source]
Makes predictions using the loaded decision tree regressor.
- Parameters:
x_test (pd.DataFrame) – Input data to make predictions on.
- Returns:
y_pred – Predicted values.
- Return type:
np.ndarray
- random_forest_predictions(x_test: DataFrame) ndarray[source]
Makes predictions using the loaded Random Forest regressor.
- Parameters:
x_test (pd.DataFrame) – Data points to make predictions on.
- Returns:
y_pred – Predicted values for the input data points.
- Return type:
np.ndarray
- xgb_predictions(x_test: DataFrame, normalized: bool) ndarray[source]
Makes predictions using the loaded XGBoost regressor.
- Parameters:
x_test (pd.DataFrame) – Data points to make predictions on.
normalized (bool) – Whether the data is normalized or not.
- Returns:
y_pred – Predicted values for the input data points.
- Return type:
np.ndarray
air_quality_forecast.utils module
- class air_quality_forecast.utils.InputValidator[source]
Bases:
object- static validate_file_exists(path: str, variable_name: str) None[source]
Validate that the file path exists.
- Parameters:
path – The file path to validate.
variable_name – The name of the variable for error messages.
- Raises:
FileNotFoundError – If the path does not exist.
- static validate_type(value, expected_type, variable_name: str) None[source]
Validate the type of the given variable.
- Parameters:
value – The value to validate.
expected_type – The expected type of the value.
variable_name – The name of the variable for error messages.
- Raises:
TypeError – If the value is not of the expected type.