GHCNM Data Processing Scripts
This document provides an overview of the Python scripts used to download, process, and format the Global Historical Climatology Network Monthly (GHCNM) datasets for the application.
1. ghcnm_read_prec.py (Precipitation Data)
Purpose:
Downloads the latest GHCNM v4 precipitation archive, parses the station files concurrently, applies quality control filters, and outputs the combined dataset into a compressed Parquet format suitable for the web application.
Workflow Details:
- Download & Extract:
- Scrapes the NOAA NCEI precipitation archive directory to find the latest
.tar.gzfile. - Downloads and extracts the fixed-width format
.csvfiles into a temporarymisc/data/ppdirectory.
- Scrapes the NOAA NCEI precipitation archive directory to find the latest
- Concurrent Processing:
- Uses
concurrent.futures.ProcessPoolExecutorutilizing 75% of available CPU cores to process thousands of station files in parallel, preventing system freezing.
- Uses
- Data Cleaning & Transformation:
- Parses the fixed-width formatted text files into a pandas DataFrame.
- Splits the
year_monthstring into separateyearandmonthcolumns. - Converts precipitation values from tenths of millimeters to millimeters.
- Handles "Trace Precipitation" records (flagged as
-1) by converting them to0mm.
- Quality Control & Filtering:
- Drops rows containing bad quality flags (
O,R,T,S,K). - Skips stations entirely if they have fewer than 120 valid records (10 years of data).
- Skips stations entirely if they contain extreme, likely erroneous outliers (precipitation > 2000 mm in a single month).
- Drops rows containing bad quality flags (
- Output:
- Combines everything into a long-format dataset (
ID,YEAR,MONTH,VALUE). - Saves the main dataset as
www/data/tabs/prec_long.parquet. - Generates an availability summary (
first_yearandlast_yearper station) and saves it towww/data/tabs/prec_availability.csv.
- Combines everything into a long-format dataset (
2. ghcnm_read_tavg.py (Average Temperature Data)
Purpose:
Downloads the latest GHCNM v4 average temperature dataset, parses both the data (.dat) and inventory/metadata (.inv) files, converts the data from a wide matrix to a long format, and cleans up the temporary files.
Workflow Details:
- Download & Extract:
- Directly downloads the
ghcnm.tavg.latest.qcf.tar.gzarchive from NOAA. - Extracts the contents to
misc/data.
- Directly downloads the
- Data Processing (
.datfile):- Finds the
.datfile and parses the fixed-width format. - The raw data is in a "wide" format containing 12 columns for months (
VALUE1toVALUE12). - Converts temperature values to Celsius (dividing by 100) and handles missing values (
-9999). - Uses
pd.wide_to_longto transform the data into a long, tidy format (ID,YEAR,MONTH,VALUE).
- Finds the
- Metadata Processing (
.invfile):- Finds the
.invinventory file to extract station metadata. - Parses fixed-width columns for
ID,LATITUDE,LONGITUDE,STNELEV(elevation), andNAME. - Replaces missing elevation values (
-999.0) withNaN.
- Finds the
- Output:
- Saves the main temperature dataset to
www/data/tabs/tavg_long.parquet. - Generates an availability summary (
first_yearandlast_yearper station) and saves it towww/data/tabs/tavg_availability.csv. - Saves the parsed station metadata to
www/data/tabs/tavg_meta.csv.
- Saves the main temperature dataset to
- Cleanup:
- Automatically removes all downloaded folders and extracted files in
misc/data/usingglobandshutilto free up disk space.
- Automatically removes all downloaded folders and extracted files in