- SNR-Bias GRL Reproducibility Package
- Quick Start: Open-Data Figure Reproduction
- Fast Cached Figure Check
- Repository Layout
- What Is Included
- Data Sources
- Environment
- Reproduce the Phase-Picking Experiment
- Reproduce the Dispersion Experiment
- Reproduce the SNR-Filtered Test Precision Table
- Aggregate the Three Training Seeds
- Continuous Association Diagnostic
- Main Numerical Results in the Archive
- Check File Integrity
- Reproducibility Boundaries
- Citation
- License
- Quick Start: Open-Data Figure Reproduction
SNR-Bias GRL Reproducibility Package
This repository archives the project-controlled code, trained checkpoints, derived result tables, plotting code, and manuscript-facing figures for the GRL manuscript:
Signal-to-Noise Filtering Biases Seismic Deep Learning From Training to Deployment
The purpose of the archive is to make the manuscript results auditable without redistributing large upstream datasets. Raw waveform files, continuous picker JSONL files, station databases, and ambient-noise HDF5 data are not included. Those data products are already public or citable from their original sources and should be downloaded separately when rerunning training or association.
Quick Start: Open-Data Figure Reproduction
The primary reproduction entrypoint recomputes statistics from the public waveform, continuous-pick, and dispersion data products and then regenerates the main manuscript figures from the newly generated intermediate outputs:
pip install -r requirements.txt
python code/scripts/reproduce_paper_figures_from_open_data.py \
--credit-h5 /path/to/credit-x1.h5 \
--credit-keys /path/to/creditkeys.npz \
--ncf-h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
--continuous-pick-dir /path/to/SeismicX-Cont/hourly/all \
--continuous-label-json /path/to/SeismicX-Cont/annotations_for_continuous_hdf5.json \
--continuous-waveform-db /path/to/SeismicX-Cont/waveform_index.sqlite \
--phase-balanced-root /path/to/phase_balanced_20190706_20211113 \
--work-dir open_data_work
This writes:
open_data_work/
open_data_reproduction_manifest.json
outputs/
training_manifests/
figures/
fig_observability_real_data_v1.pdf
fig_learning_selection_generalization_summary_v2.pdf
fig_event_geometry_distribution_polished.pdf
The workflow never reads results/manuscript_figures/*_data.csv as source
input. CSV/JSON files created under open_data_work/ are intermediate outputs
generated during that run. The training manifests in open_data_work/ record
the exact seed-fixed phase-picking and dispersion training examples selected in
that run.
For detailed input expectations, smoke-test options, and the Figure 3
phase-balanced association preparation commands, see
OPEN_DATA_REPRODUCTION.md.
Fast Cached Figure Check
For quick visual inspection only, the archive also includes the manuscript plotted-data exports and a renderer:
python code/scripts/plot_all_paper_figures.py
This path redraws the figures from cached CSV/JSON exports in
results/manuscript_figures/. It is useful for checking figure rendering but is
not the primary data-backed reproduction workflow.
Repository Layout
.
βββ code/
β βββ scripts/ # Training, evaluation, aggregation, bootstrap, and plotting scripts
β βββ odata/ # Continuous filtering and REAL-association helper scripts
β βββ models/ # Phase-picking neural-network definitions
β βββ utils/ # Dataset and waveform utility code used by training scripts
β βββ dispnet.v2.3.py # Dispersion model and training utilities
β βββ pnsn.train.v3.60s.py
βββ checkpoints/
β βββ base/ # Base PNSN checkpoint used for transfer learning
β βββ phase_picker/ # Fine-tuned and scratch phase-picking checkpoints
β βββ dispersion/ # DispNet checkpoints
βββ configs/
β βββ manuscript_reproduction.json
βββ training_manifests/ # Exact seed-fixed training keys/records, no raw waveforms
βββ results/
β βββ manuscript_figures/
β βββ phase_picker/
β βββ dispersion/
β βββ multiseed/
β βββ bootstrap/
β βββ snr_filtered_test_precision/
βββ DATASETS.bib
βββ OPEN_DATA_REPRODUCTION.md
βββ CITATION.cff
βββ LICENSE
βββ NOTICE
βββ CHECKSUMS.sha256
What Is Included
This archive includes:
- All manuscript-facing Python scripts used for phase-picking, dispersion, filtering, aggregation, bootstrap summaries, and figure generation.
- Phase-picker model definitions and the DispNet v2.3 model definition.
- The base PNSN checkpoint used for transfer-learning experiments.
- Three-seed phase-picking checkpoints for both fine-tuning and scratch
training:
seed20260609,seed20260610, andseed20260611. - Three-seed dispersion checkpoints for full, medium-SNR, and high-SNR matched training.
- Fixed reproduction configuration in
configs/manuscript_reproduction.json, including the manuscript seeds, subset-selection seed rules, thresholds, train/evaluation budgets, and association settings. - Exact training manifests:
training_manifests/phase_picker/seed*_train_records.jsonl.gzlists the CREDIT-X1local record keys and labels selected for each seed and condition.training_manifests/dispersion/seed*_train_keys.txtlists the SeisDispFusion-NCF training keys selected for each seed and condition.training_manifests/source_caches/stores compressed record-index and SNR caches used to regenerate those manifests from the public datasets.
- Per-seed summaries, training logs, multi-seed aggregate tables, bootstrap tables, and SNR-filtered-test precision summaries.
- The open-data reproduction driver
code/scripts/reproduce_paper_figures_from_open_data.py. - The direct pretrained phase-picker baseline evaluator
code/scripts/evaluate_phase_direct_baseline.py. - Cached plotted-data CSV/JSON exports and the final main figures used in the current manuscript, for fast visual checks.
- Documentation for matching the archived results to the GRL manuscript figures and tables.
This archive does not include:
- CREDIT-X1local waveform HDF5 files or split-key files.
- SeismicX-Cont continuous waveform, annotation, station-index, or pick JSONL files.
- SeisDispFusion-NCF waveform/dispersion HDF5 files.
- Full intermediate REAL working directories or large phase-balanced continuous pick streams.
- Unreported exploratory smoke-test outputs that were not used in the GRL manuscript.
Data Sources
Download or access the data below before rerunning training, evaluation, or continuous association.
Continuous Waveform and Association Data
- Dataset: SeismicX-Cont
- URL: https://huggingface.co/datasets/cangyeone/SeismicX-Cont
- DOI:
10.57967/hf/9006 - Revision used in the manuscript:
96367f8 - Used for the two-day continuous association diagnostic and labeled pick coverage checks.
Ambient-Noise Dispersion Data
- Dataset: SeisDispFusion-NCF
- URL: https://huggingface.co/datasets/cangyeone/SeisDispFusion-NCF
- DOI:
10.57967/hf/9114 - Revision used in the manuscript:
afcd805 - Used for the ambient-noise dispersion SNR-filtering experiment.
CREDIT-X1local
- Article: CREDIT-X1local: A reference dataset for machine learning seismology from ChinArray in Southwest China
- DOI:
10.1016/j.eqs.2024.01.018 - URL: https://www.equsci.org.cn/en/article/doi/10.1016/j.eqs.2024.01.018
- Used for the matched phase-picking training experiment.
BibTeX entries for the data sources are in DATASETS.bib.
The primary open-data reproduction command is documented in
OPEN_DATA_REPRODUCTION.md.
Environment
The scripts were run with Python 3.12 on macOS. Python 3.10 or newer should be adequate for the plotting and table scripts. Training requires PyTorch and HDF5 support.
python -m venv .venv
source .venv/bin/activate
pip install -U pip
pip install -r requirements.txt
For deterministic CPU-only checks, set:
export OMP_NUM_THREADS=1
export MKL_NUM_THREADS=1
export KMP_DUPLICATE_LIB_OK=TRUE
Reproduce the Phase-Picking Experiment
Expected local inputs:
- CREDIT-X1local waveform file, for example
/path/to/credit-x1.h5 - CREDIT split keys, for example
/path/to/creditkeys.npz - Base checkpoint included here:
checkpoints/base/pnsn.v3.pt
Run the three manuscript seeds:
for seed in 20260609 20260610 20260611; do
python code/scripts/snr_transfer_phase_balanced_experiment.py \
--h5 /path/to/credit-x1.h5 \
--keys /path/to/creditkeys.npz \
--base-ckpt checkpoints/base/pnsn.v3.pt \
--out-dir results/phase_picker/seed${seed}_rerun \
--seed ${seed} \
--train-steps 2000 \
--scratch-train-steps 10000 \
--train-batch 16 \
--eval-samples 10000 \
--init-modes finetune scratch \
--filter-mode record-any \
--s-threshold-mode same-as-p \
--match-mode phase-composition
done
The current manuscript used matched sample budgets:
60,837waveforms per training condition.- identical record-level phase composition:
2,951P-only,57,844P+S, and42S-only records. - retained-waveform rule: a waveform is kept if any P or S label passes the phase-aware SNR threshold, and all original labels in that waveform are kept.
Published checkpoints from the manuscript runs are in:
checkpoints/phase_picker/seed20260609/
checkpoints/phase_picker/seed20260610/
checkpoints/phase_picker/seed20260611/
Each seed contains fine-tuned and scratch checkpoints for full, medium-SNR, and high-SNR matched training.
The exact seed-fixed training-record manifests are in:
training_manifests/phase_picker/
phase_training_manifest_summary.csv
seed20260609_full_train_records.jsonl.gz
seed20260609_p5_s_bal_train_records.jsonl.gz
seed20260609_p10_s_bal_train_records.jsonl.gz
...
Regenerate the manifests from the archived source caches with:
python code/scripts/export_training_manifests.py \
--out-dir training_manifests_rerun \
--phase-records-json training_manifests/source_caches/credit_records_train_all.json.gz \
--phase-snr-json training_manifests/source_caches/credit_train_phase_snr_db.json.gz \
--seeds 20260609 20260610 20260611
Reproduce the Dispersion Experiment
Expected local input:
- SeisDispFusion-NCF HDF5 file, for example
/path/to/ncf_disp_dataset_with_disp_image.h5
Run the three manuscript seeds:
for seed in 20260609 20260610 20260611; do
python code/scripts/disp_snr_transfer_experiment.py \
--h5 /path/to/ncf_disp_dataset_with_disp_image.h5 \
--out-dir results/dispersion/seed${seed}_rerun \
--seed ${seed} \
--mode scratch \
--epochs 5 \
--batch-size 256 \
--device cpu \
--num-workers 0
done
The manuscript run used:
11,033training samples per condition.8,292unfiltered test samples.- training conditions: full matched, SNR
>3.04 dBmatched, and SNR>6.77 dBmatched. - optimizer: AdamW, learning rate
2e-4.
Published trained checkpoints are in checkpoints/dispersion/seed*/.
The exact seed-fixed dispersion training-key manifests are in:
training_manifests/dispersion/
dispersion_training_manifest_summary.csv
seed20260609_full_train_keys.txt
seed20260609_snr_q1_train_keys.txt
seed20260609_snr_q2_train_keys.txt
...
Regenerate them from the archived SNR cache with:
python code/scripts/export_training_manifests.py \
--out-dir training_manifests_rerun \
--dispersion-snr-json training_manifests/source_caches/ncf_snr_cache_seed20260609.json.gz \
--seeds 20260609 20260610 20260611
Reproduce the SNR-Filtered Test Precision Table
After the phase-picking checkpoints and CREDIT-X1local inputs are available, rerun the Table 2 precision/recall audit with:
python code/scripts/snr_filtered_test_precision_table.py \
--h5 /path/to/credit-x1.h5 \
--keys /path/to/creditkeys.npz \
--source-dirs \
results/phase_picker/seed20260609_rerun \
results/phase_picker/seed20260610_rerun \
results/phase_picker/seed20260611_rerun \
--out-dir results/snr_filtered_test_precision_rerun
The manuscript-facing saved results are in
results/snr_filtered_test_precision/.
Aggregate the Three Training Seeds
The saved manuscript aggregates are in results/multiseed/. To recompute
aggregates from a parent-project layout, run:
python code/scripts/grl_aggregate_multiseed.py \
--seeds 20260609 20260610 20260611 \
--phase-prefix snr_transfer_phase_any_seed \
--out-dir results/multiseed_rerun
If running directly inside this archive after retraining, point the script or
copy the rerun summaries so the expected outputs/<experiment>/summary.json
layout is available.
Continuous Association Diagnostic
The continuous association diagnostic is a deployment-side audit using the same learned picker output stream but different retention rules before REAL association.
Manuscript settings:
- days:
2019-07-06and2021-11-13 - SNR conversion:
10*log10(SNR ratio), where the stored picker SNR is a standard-deviation ratio - baseline SNR threshold:
4.25 dB - retained picks:
576,875 - confidence comparator: top
576,875picks by phase probability only - REAL chunking: 15-min windows
- REAL
-R:0.4/25/0.05/3/5 - REAL
-S:4/2/3/2/1.0/0.1/1.0 - event matching:
5 sorigin time and30 kmepicentral distance for the baseline association;3 sand20 kmfor the phase-balanced sensitivity
Helper scripts are in code/odata/. The saved manuscript summaries are in
results/continuous_association/, results/bootstrap/, and
results/manuscript_figures/.
Main Numerical Results in the Archive
Baseline continuous association:
| Filter | Retained picks | Event TP | Event precision | Event recall | Event F1 |
|---|---|---|---|---|---|
| SNR >= 4.25 dB | 576,875 | 1,301 | 0.472 | 0.556 | 0.511 |
| Top phase probability | 576,875 | 1,561 | 0.491 | 0.667 | 0.565 |
Paired catalog-event bootstrap:
- SNR event recall:
0.556 - confidence event recall:
0.667 - SNR minus confidence recall difference:
-0.111 - 95% interval:
[-0.127, -0.094] - reference catalog events:
2,340 - resamples:
10,000
Matched learning results are stored in:
results/multiseed/phase_any/
results/multiseed/phase_dispersion/
results/snr_filtered_test_precision/
Check File Integrity
After download, verify the archive contents:
shasum -a 256 -c CHECKSUMS.sha256
If you rerun scripts, generated files will change and the checksum file should be regenerated for the new archive state.
Reproducibility Boundaries
This archive supports two levels of reproducibility.
First, manuscript figures and tables can be checked from the archived derived CSV/JSON summaries and plotted-data exports without downloading the raw data. This is the expected quick rendering check for reviewers.
Second, full training, testing, SNR filtering, and association can be rerun after
downloading the public datasets listed above. These reruns require substantial
storage and compute. The intended entrypoint is
code/scripts/reproduce_paper_figures_from_open_data.py; the local data paths
are explicit command-line arguments because raw data are not redistributed here.
The archive does not include the manuscript source itself. The submitted GRL manuscript should cite this repository URL and the final immutable commit hash in the Open Research section.
The archive includes exact training-key manifests and fixed seed values, but it does not redistribute raw CREDIT-X1local, SeismicX-Cont, or SeisDispFusion-NCF waveform/dispersion arrays. Those raw data should be obtained from the public sources listed above.
The archive does not claim that the reported SNR thresholds are universal physical thresholds across tasks. Phase-picking SNR and dispersion SNR are task-specific filtering scales, as described in the manuscript and SI.
Citation
Please cite this archive, the GRL manuscript, and the upstream datasets when
reusing the materials. A CITATION.cff file and data-source BibTeX entries are
included. The repository URL for this archive is:
https://huggingface.co/cangyeone/snr_bias
Use the final Hugging Face commit hash as the archive revision in the GRL Open Research statement after the upload is complete.
License
Unless otherwise noted, project-controlled code, derived outputs, summary tables, figure assets, model checkpoints, and reproducibility notes in this archive are released under the Creative Commons Attribution 4.0 International license (CC BY 4.0).
Raw datasets are not redistributed in this archive and should be used under the terms stated by their original repositories or publications.