SNR-Bias GRL Reproducibility Package

This repository archives the project-controlled code, trained checkpoints, derived result tables, plotting code, and manuscript-facing figures for the GRL manuscript:

Signal-to-Noise Filtering Biases Seismic Deep Learning From Training to Deployment

The purpose of the archive is to make the manuscript results auditable without redistributing large upstream datasets. Raw waveform files, continuous picker JSONL files, station databases, and ambient-noise HDF5 data are not included. Those data products are already public or citable from their original sources and should be downloaded separately when rerunning training or association.

Quick Start: Open-Data Figure Reproduction

The primary reproduction entrypoint recomputes statistics from the public waveform, continuous-pick, and dispersion data products and then regenerates the main manuscript figures from the newly generated intermediate outputs:

pip install -r requirements.txt
python code/scripts/reproduce_paper_figures_from_open_data.py \
  --credit-h5 /path/to/credit-x1.h5 \
  --credit-keys /path/to/creditkeys.npz \
  --ncf-h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
  --continuous-pick-dir /path/to/SeismicX-Cont/hourly/all \
  --continuous-label-json /path/to/SeismicX-Cont/annotations_for_continuous_hdf5.json \
  --continuous-waveform-db /path/to/SeismicX-Cont/waveform_index.sqlite \
  --phase-balanced-root /path/to/phase_balanced_20190706_20211113 \
  --work-dir open_data_work

This writes:

open_data_work/
  open_data_reproduction_manifest.json
  outputs/
  training_manifests/
  figures/
    fig_observability_real_data_v1.pdf
    fig_learning_selection_generalization_summary_v2.pdf
    fig_event_geometry_distribution_polished.pdf

The workflow never reads results/manuscript_figures/*_data.csv as source input. CSV/JSON files created under open_data_work/ are intermediate outputs generated during that run. The training manifests in open_data_work/ record the exact seed-fixed phase-picking and dispersion training examples selected in that run.

For detailed input expectations, smoke-test options, and the Figure 3 phase-balanced association preparation commands, see OPEN_DATA_REPRODUCTION.md.

Fast Cached Figure Check

For quick visual inspection only, the archive also includes the manuscript plotted-data exports and a renderer:

python code/scripts/plot_all_paper_figures.py

This path redraws the figures from cached CSV/JSON exports in results/manuscript_figures/. It is useful for checking figure rendering but is not the primary data-backed reproduction workflow.

Repository Layout

.
├── code/
│   ├── scripts/          # Training, evaluation, aggregation, bootstrap, and plotting scripts
│   ├── odata/            # Continuous filtering and REAL-association helper scripts
│   ├── models/           # Phase-picking neural-network definitions
│   ├── utils/            # Dataset and waveform utility code used by training scripts
│   ├── dispnet.v2.3.py   # Dispersion model and training utilities
│   └── pnsn.train.v3.60s.py
├── checkpoints/
│   ├── base/             # Base PNSN checkpoint used for transfer learning
│   ├── phase_picker/     # Fine-tuned and scratch phase-picking checkpoints
│   └── dispersion/       # DispNet checkpoints
├── configs/
│   └── manuscript_reproduction.json
├── training_manifests/   # Exact seed-fixed training keys/records, no raw waveforms
├── results/
│   ├── manuscript_figures/
│   ├── phase_picker/
│   ├── dispersion/
│   ├── multiseed/
│   ├── bootstrap/
│   └── snr_filtered_test_precision/
├── DATASETS.bib
├── OPEN_DATA_REPRODUCTION.md
├── CITATION.cff
├── LICENSE
├── NOTICE
└── CHECKSUMS.sha256

What Is Included

This archive includes:

All manuscript-facing Python scripts used for phase-picking, dispersion, filtering, aggregation, bootstrap summaries, and figure generation.
Phase-picker model definitions and the DispNet v2.3 model definition.
The base PNSN checkpoint used for transfer-learning experiments.
Three-seed phase-picking checkpoints for both fine-tuning and scratch training: seed20260609, seed20260610, and seed20260611.
Three-seed dispersion checkpoints for full, medium-SNR, and high-SNR matched training.
Fixed reproduction configuration in configs/manuscript_reproduction.json, including the manuscript seeds, subset-selection seed rules, thresholds, train/evaluation budgets, and association settings.
Exact training manifests:
- training_manifests/phase_picker/seed*_train_records.jsonl.gz lists the CREDIT-X1local record keys and labels selected for each seed and condition.
- training_manifests/dispersion/seed*_train_keys.txt lists the SeisDispFusion-NCF training keys selected for each seed and condition.
- training_manifests/source_caches/ stores compressed record-index and SNR caches used to regenerate those manifests from the public datasets.
Per-seed summaries, training logs, multi-seed aggregate tables, bootstrap tables, and SNR-filtered-test precision summaries.
The open-data reproduction driver code/scripts/reproduce_paper_figures_from_open_data.py.
The direct pretrained phase-picker baseline evaluator code/scripts/evaluate_phase_direct_baseline.py.
Cached plotted-data CSV/JSON exports and the final main figures used in the current manuscript, for fast visual checks.
Documentation for matching the archived results to the GRL manuscript figures and tables.

This archive does not include:

CREDIT-X1local waveform HDF5 files or split-key files.
SeismicX-Cont continuous waveform, annotation, station-index, or pick JSONL files.
SeisDispFusion-NCF waveform/dispersion HDF5 files.
Full intermediate REAL working directories or large phase-balanced continuous pick streams.
Unreported exploratory smoke-test outputs that were not used in the GRL manuscript.

Data Sources

Download or access the data below before rerunning training, evaluation, or continuous association.

Continuous Waveform and Association Data

Dataset: SeismicX-Cont
URL: https://huggingface.co/datasets/cangyeone/SeismicX-Cont
DOI: 10.57967/hf/9006
Revision used in the manuscript: 96367f8
Used for the two-day continuous association diagnostic and labeled pick coverage checks.

Ambient-Noise Dispersion Data

Dataset: SeisDispFusion-NCF
URL: https://huggingface.co/datasets/cangyeone/SeisDispFusion-NCF
DOI: 10.57967/hf/9114
Revision used in the manuscript: afcd805
Used for the ambient-noise dispersion SNR-filtering experiment.

CREDIT-X1local

Article: CREDIT-X1local: A reference dataset for machine learning seismology from ChinArray in Southwest China
DOI: 10.1016/j.eqs.2024.01.018
URL: https://www.equsci.org.cn/en/article/doi/10.1016/j.eqs.2024.01.018
Used for the matched phase-picking training experiment.

BibTeX entries for the data sources are in DATASETS.bib.

The primary open-data reproduction command is documented in OPEN_DATA_REPRODUCTION.md.

Environment

The scripts were run with Python 3.12 on macOS. Python 3.10 or newer should be adequate for the plotting and table scripts. Training requires PyTorch and HDF5 support.

python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt

For deterministic CPU-only checks, set:

export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export KMP_DUPLICATE_LIB_OK=TRUE

Reproduce the Phase-Picking Experiment

Expected local inputs:

CREDIT-X1local waveform file, for example /path/to/credit-x1.h5
CREDIT split keys, for example /path/to/creditkeys.npz
Base checkpoint included here: checkpoints/base/pnsn.v3.pt

Run the three manuscript seeds:

for seed in 20260609 20260610 20260611; do
  python code/scripts/snr_transfer_phase_balanced_experiment.py \
    --h5 /path/to/credit-x1.h5 \
    --keys /path/to/creditkeys.npz \
    --base-ckpt checkpoints/base/pnsn.v3.pt \
    --out-dir results/phase_picker/seed${seed}_rerun \
    --seed ${seed} \
    --train-steps 2000 \
    --scratch-train-steps 10000 \
    --train-batch 16 \
    --eval-samples 10000 \
    --init-modes finetune scratch \
    --filter-mode record-any \
    --s-threshold-mode same-as-p \
    --match-mode phase-composition
done

The current manuscript used matched sample budgets:

60,837 waveforms per training condition.
identical record-level phase composition: 2,951 P-only, 57,844 P+S, and 42 S-only records.
retained-waveform rule: a waveform is kept if any P or S label passes the phase-aware SNR threshold, and all original labels in that waveform are kept.

Published checkpoints from the manuscript runs are in:

checkpoints/phase_picker/seed20260609/
checkpoints/phase_picker/seed20260610/
checkpoints/phase_picker/seed20260611/

Each seed contains fine-tuned and scratch checkpoints for full, medium-SNR, and high-SNR matched training.

The exact seed-fixed training-record manifests are in:

training_manifests/phase_picker/
  phase_training_manifest_summary.csv
  seed20260609_full_train_records.jsonl.gz
  seed20260609_p5_s_bal_train_records.jsonl.gz
  seed20260609_p10_s_bal_train_records.jsonl.gz
  ...

Regenerate the manifests from the archived source caches with:

python code/scripts/export_training_manifests.py \
  --out-dir training_manifests_rerun \
  --phase-records-json training_manifests/source_caches/credit_records_train_all.json.gz \
  --phase-snr-json training_manifests/source_caches/credit_train_phase_snr_db.json.gz \
  --seeds 20260609 20260610 20260611

Reproduce the Dispersion Experiment

Expected local input:

SeisDispFusion-NCF HDF5 file, for example /path/to/ncf_disp_dataset_with_disp_image.h5

Run the three manuscript seeds:

for seed in 20260609 20260610 20260611; do
  python code/scripts/disp_snr_transfer_experiment.py \
    --h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
    --out-dir results/dispersion/seed${seed}_rerun \
    --seed ${seed} \
    --mode scratch \
    --epochs 5 \
    --batch-size 256 \
    --device cpu \
    --num-workers 0
done

The manuscript run used:

11,033 training samples per condition.
8,292 unfiltered test samples.
training conditions: full matched, SNR >3.04 dB matched, and SNR >6.77 dB matched.
optimizer: AdamW, learning rate 2e-4.

Published trained checkpoints are in checkpoints/dispersion/seed*/.

The exact seed-fixed dispersion training-key manifests are in:

training_manifests/dispersion/
  dispersion_training_manifest_summary.csv
  seed20260609_full_train_keys.txt
  seed20260609_snr_q1_train_keys.txt
  seed20260609_snr_q2_train_keys.txt
  ...

Regenerate them from the archived SNR cache with:

python code/scripts/export_training_manifests.py \
  --out-dir training_manifests_rerun \
  --dispersion-snr-json training_manifests/source_caches/ncf_snr_cache_seed20260609.json.gz \
  --seeds 20260609 20260610 20260611

Reproduce the SNR-Filtered Test Precision Table

After the phase-picking checkpoints and CREDIT-X1local inputs are available, rerun the Table 2 precision/recall audit with:

python code/scripts/snr_filtered_test_precision_table.py \
  --h5 /path/to/credit-x1.h5 \
  --keys /path/to/creditkeys.npz \
  --source-dirs \
    results/phase_picker/seed20260609_rerun \
    results/phase_picker/seed20260610_rerun \
    results/phase_picker/seed20260611_rerun \
  --out-dir results/snr_filtered_test_precision_rerun

The manuscript-facing saved results are in results/snr_filtered_test_precision/.

Aggregate the Three Training Seeds

The saved manuscript aggregates are in results/multiseed/. To recompute aggregates from a parent-project layout, run:

python code/scripts/grl_aggregate_multiseed.py \
  --seeds 20260609 20260610 20260611 \
  --phase-prefix snr_transfer_phase_any_seed \
  --out-dir results/multiseed_rerun

If running directly inside this archive after retraining, point the script or copy the rerun summaries so the expected outputs/<experiment>/summary.json layout is available.

Continuous Association Diagnostic

The continuous association diagnostic is a deployment-side audit using the same learned picker output stream but different retention rules before REAL association.

Manuscript settings:

days: 2019-07-06 and 2021-11-13
SNR conversion: 10*log10(SNR ratio), where the stored picker SNR is a standard-deviation ratio
baseline SNR threshold: 4.25 dB
retained picks: 576,875
confidence comparator: top 576,875 picks by phase probability only
REAL chunking: 15-min windows
REAL -R: 0.4/25/0.05/3/5
REAL -S: 4/2/3/2/1.0/0.1/1.0
event matching: 5 s origin time and 30 km epicentral distance for the baseline association; 3 s and 20 km for the phase-balanced sensitivity

Helper scripts are in code/odata/. The saved manuscript summaries are in results/continuous_association/, results/bootstrap/, and results/manuscript_figures/.

Main Numerical Results in the Archive

Baseline continuous association:

Filter	Retained picks	Event TP	Event precision	Event recall	Event F1
SNR >= 4.25 dB	576,875	1,301	0.472	0.556	0.511
Top phase probability	576,875	1,561	0.491	0.667	0.565

Paired catalog-event bootstrap:

SNR event recall: 0.556
confidence event recall: 0.667
SNR minus confidence recall difference: -0.111
95% interval: [-0.127, -0.094]
reference catalog events: 2,340
resamples: 10,000

Matched learning results are stored in:

results/multiseed/phase_any/
results/multiseed/phase_dispersion/
results/snr_filtered_test_precision/

Check File Integrity

After download, verify the archive contents:

shasum -a 256 -c CHECKSUMS.sha256

If you rerun scripts, generated files will change and the checksum file should be regenerated for the new archive state.

Reproducibility Boundaries

This archive supports two levels of reproducibility.

First, manuscript figures and tables can be checked from the archived derived CSV/JSON summaries and plotted-data exports without downloading the raw data. This is the expected quick rendering check for reviewers.

Second, full training, testing, SNR filtering, and association can be rerun after downloading the public datasets listed above. These reruns require substantial storage and compute. The intended entrypoint is code/scripts/reproduce_paper_figures_from_open_data.py; the local data paths are explicit command-line arguments because raw data are not redistributed here.

The archive does not include the manuscript source itself. The submitted GRL manuscript should cite this repository URL and the final immutable commit hash in the Open Research section.

The archive includes exact training-key manifests and fixed seed values, but it does not redistribute raw CREDIT-X1local, SeismicX-Cont, or SeisDispFusion-NCF waveform/dispersion arrays. Those raw data should be obtained from the public sources listed above.

The archive does not claim that the reported SNR thresholds are universal physical thresholds across tasks. Phase-picking SNR and dispersion SNR are task-specific filtering scales, as described in the manuscript and SI.

Citation

Please cite this archive, the GRL manuscript, and the upstream datasets when reusing the materials. A CITATION.cff file and data-source BibTeX entries are included. The repository URL for this archive is:

https://huggingface.co/cangyeone/snr_bias

Use the final Hugging Face commit hash as the archive revision in the GRL Open Research statement after the upload is complete.

License

Unless otherwise noted, project-controlled code, derived outputs, summary tables, figure assets, model checkpoints, and reproducibility notes in this archive are released under the Creative Commons Attribution 4.0 International license (CC BY 4.0).

Raw datasets are not redistributed in this archive and should be used under the terms stated by their original repositories or publications.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support