Commit
·
0f17633
1
Parent(s):
a19acb8
created demo
Browse files
.gitignore
CHANGED
|
@@ -1,2 +1,3 @@
|
|
| 1 |
__pycache__/
|
| 2 |
trainings/
|
|
|
|
|
|
| 1 |
__pycache__/
|
| 2 |
trainings/
|
| 3 |
+
scores/
|
root_gnn_dgl/README.md
CHANGED
|
@@ -40,6 +40,16 @@ Run the `setup/test_setup.py` script to confirm that all packages needed for tra
|
|
| 40 |
```bash
|
| 41 |
python setup/test_setup.py
|
| 42 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
## Data Preparation
|
| 45 |
The first step in the process is to convert the events stored in ROOT files into DGL graph objects. This conversion is handled automatically by the Dataset objects during their creation, provided the graph data has not already been saved to disk. To accomplish this, a simple script is used to initialize the relevant Dataset object and then exit. This script needs to be executed for each data chunk in each dataset being used for training.
|
|
@@ -47,7 +57,15 @@ The first step in the process is to convert the events stored in ROOT files into
|
|
| 47 |
Below is an example of how to use the `scripts/prep_data.py` script:
|
| 48 |
|
| 49 |
```bash
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
```
|
| 52 |
|
| 53 |
The `--shuffle_mode` flag performs shuffling and pre-batches the graphs in each chunk, since holding the entire dataset in memory and shuffling it together can be prohibitive for large datasets.
|
|
|
|
| 40 |
```bash
|
| 41 |
python setup/test_setup.py
|
| 42 |
```
|
| 43 |
+
## Running the Demo
|
| 44 |
+
The demo training is an example of our ML workflow, consisting of training a pretrained model, then finetuning it for an analysis task. The config files for the demo are located in the directory `configs/demo/`. The demo can be run on a login node.
|
| 45 |
+
|
| 46 |
+
The pretraining for the demo is a multiclass classification training on 12 datasets corresponding to 12 distinct physics processes, containing 10,000 simulated collision events each. The pretraining is then fintuned on a binary classification task between two datasets containing 10,000 simulated collision events each for two different processes, called ttH CP Even and ttH CP Odd.
|
| 47 |
+
|
| 48 |
+
The entire demo can be ran with the command
|
| 49 |
+
```bash
|
| 50 |
+
source run_demo.sh
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
|
| 54 |
## Data Preparation
|
| 55 |
The first step in the process is to convert the events stored in ROOT files into DGL graph objects. This conversion is handled automatically by the Dataset objects during their creation, provided the graph data has not already been saved to disk. To accomplish this, a simple script is used to initialize the relevant Dataset object and then exit. This script needs to be executed for each data chunk in each dataset being used for training.
|
|
|
|
| 57 |
Below is an example of how to use the `scripts/prep_data.py` script:
|
| 58 |
|
| 59 |
```bash
|
| 60 |
+
datasets=("ttH" "tHjb" "ggF" "VBF" "WH" "ZH" "ttyy" "tttt" "SingleT_schan" "ttbar" "ttW" "ttt")
|
| 61 |
+
chunks=3
|
| 62 |
+
|
| 63 |
+
for data in "${datasets[@]}"; do
|
| 64 |
+
python scripts/prep_data.py --config configs/demo/pretraining_multiclass.yaml --dataset "$data" --shuffle_mode --chunk 0
|
| 65 |
+
for ((i=0; i<chunks; i++)); do
|
| 66 |
+
python scripts/prep_data.py --config configs/demo/pretraining_multiclass.yaml --dataset "$data" --shuffle_mode --chunk "$i"
|
| 67 |
+
done
|
| 68 |
+
done
|
| 69 |
```
|
| 70 |
|
| 71 |
The `--shuffle_mode` flag performs shuffling and pre-batches the graphs in each chunk, since holding the entire dataset in memory and shuffling it together can be prohibitive for large datasets.
|
root_gnn_dgl/configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml
CHANGED
|
@@ -1,17 +1,17 @@
|
|
| 1 |
-
Training_Name:
|
| 2 |
-
Training_Directory: trainings/
|
| 3 |
Model:
|
| 4 |
module: models.GCN
|
| 5 |
class: Transferred_Learning_Finetuning
|
| 6 |
args:
|
| 7 |
-
pretraining_path: trainings/
|
| 8 |
pretraining_model:
|
| 9 |
module: models.GCN
|
| 10 |
class: Edge_Network
|
| 11 |
args:
|
| 12 |
hid_size: 64
|
| 13 |
in_size: 7
|
| 14 |
-
out_size:
|
| 15 |
n_layers: 4
|
| 16 |
n_proc_steps: 4
|
| 17 |
hid_size: 64
|
|
@@ -19,30 +19,30 @@ Model:
|
|
| 19 |
out_size: 1
|
| 20 |
n_layers: 4
|
| 21 |
n_proc_steps: 4
|
| 22 |
-
dropout: 0
|
| 23 |
Training:
|
| 24 |
-
epochs:
|
| 25 |
batch_size: 1024
|
| 26 |
learning_rate: 0.00001
|
| 27 |
gamma: 0.99
|
| 28 |
Datasets:
|
| 29 |
-
|
| 30 |
module: root_gnn_base.dataset
|
| 31 |
class: LazyDataset
|
| 32 |
-
shuffle_chunks:
|
| 33 |
batch_size: 1024
|
| 34 |
padding_mode: NONE #one of STEPS, FIXED, or NONE
|
| 35 |
args: &dataset_args
|
| 36 |
-
name:
|
| 37 |
label: 0
|
| 38 |
weight_var: weight
|
| 39 |
-
chunks:
|
| 40 |
-
buffer_size:
|
| 41 |
file_names: ttH_NLO.root
|
| 42 |
tree_name: output
|
| 43 |
fold_var: Number
|
| 44 |
-
raw_dir: /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/
|
| 45 |
-
save_dir: /pscratch/sd/j/joshuaho/
|
| 46 |
node_branch_names:
|
| 47 |
- [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
|
| 48 |
- [jet_eta, ele_eta, mu_eta, ph_eta, 0]
|
|
@@ -54,14 +54,14 @@ Datasets:
|
|
| 54 |
node_branch_types: [vector, vector, vector, vector, single]
|
| 55 |
node_feature_scales: [1e-1, 1, 1, 1e-1, 1, 1, 1]
|
| 56 |
folding:
|
| 57 |
-
n_folds:
|
| 58 |
-
test: [0
|
| 59 |
# validation: 1
|
| 60 |
-
train: [
|
| 61 |
-
|
| 62 |
<<: *dataset_defn
|
| 63 |
args:
|
| 64 |
<<: *dataset_args
|
| 65 |
-
name:
|
| 66 |
label: 1
|
| 67 |
file_names: ttH_CPodd.root
|
|
|
|
| 1 |
+
Training_Name: finetuning_ttH_CP_Even_vs_Odd
|
| 2 |
+
Training_Directory: trainings/demo/finetuning_ttH_CP_Even_vs_Odd
|
| 3 |
Model:
|
| 4 |
module: models.GCN
|
| 5 |
class: Transferred_Learning_Finetuning
|
| 6 |
args:
|
| 7 |
+
pretraining_path: trainings/demo/pretraining_multiclass/model_epoch_100.pt # update to the last epoch of the pretraining
|
| 8 |
pretraining_model:
|
| 9 |
module: models.GCN
|
| 10 |
class: Edge_Network
|
| 11 |
args:
|
| 12 |
hid_size: 64
|
| 13 |
in_size: 7
|
| 14 |
+
out_size: 12
|
| 15 |
n_layers: 4
|
| 16 |
n_proc_steps: 4
|
| 17 |
hid_size: 64
|
|
|
|
| 19 |
out_size: 1
|
| 20 |
n_layers: 4
|
| 21 |
n_proc_steps: 4
|
| 22 |
+
dropout: 0
|
| 23 |
Training:
|
| 24 |
+
epochs: 500
|
| 25 |
batch_size: 1024
|
| 26 |
learning_rate: 0.00001
|
| 27 |
gamma: 0.99
|
| 28 |
Datasets:
|
| 29 |
+
ttH_CP_Even: &dataset_defn
|
| 30 |
module: root_gnn_base.dataset
|
| 31 |
class: LazyDataset
|
| 32 |
+
shuffle_chunks: 3
|
| 33 |
batch_size: 1024
|
| 34 |
padding_mode: NONE #one of STEPS, FIXED, or NONE
|
| 35 |
args: &dataset_args
|
| 36 |
+
name: ttH_CP_Even
|
| 37 |
label: 0
|
| 38 |
weight_var: weight
|
| 39 |
+
chunks: 3
|
| 40 |
+
buffer_size: 1
|
| 41 |
file_names: ttH_NLO.root
|
| 42 |
tree_name: output
|
| 43 |
fold_var: Number
|
| 44 |
+
raw_dir: /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/multilabel_10K/
|
| 45 |
+
save_dir: /pscratch/sd/j/joshuaho/GNN4Colliders/root_gnn_dgl/data/demo/finetuning_ttH_CP_Even_vs_Odd/
|
| 46 |
node_branch_names:
|
| 47 |
- [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
|
| 48 |
- [jet_eta, ele_eta, mu_eta, ph_eta, 0]
|
|
|
|
| 54 |
node_branch_types: [vector, vector, vector, vector, single]
|
| 55 |
node_feature_scales: [1e-1, 1, 1, 1e-1, 1, 1, 1]
|
| 56 |
folding:
|
| 57 |
+
n_folds: 3
|
| 58 |
+
test: [0]
|
| 59 |
# validation: 1
|
| 60 |
+
train: [1, 2]
|
| 61 |
+
ttH_CP_Odd:
|
| 62 |
<<: *dataset_defn
|
| 63 |
args:
|
| 64 |
<<: *dataset_args
|
| 65 |
+
name: ttH_CP_Odd
|
| 66 |
label: 1
|
| 67 |
file_names: ttH_CPodd.root
|
root_gnn_dgl/configs/demo/pretraining_multiclass.yaml
CHANGED
|
@@ -19,7 +19,7 @@ Loss:
|
|
| 19 |
class: Softmax
|
| 20 |
args: {dim: 1}
|
| 21 |
Training:
|
| 22 |
-
epochs:
|
| 23 |
batch_size: 1024
|
| 24 |
learning_rate: 0.0001
|
| 25 |
gamma: 0.99
|
|
|
|
| 19 |
class: Softmax
|
| 20 |
args: {dim: 1}
|
| 21 |
Training:
|
| 22 |
+
epochs: 500
|
| 23 |
batch_size: 1024
|
| 24 |
learning_rate: 0.0001
|
| 25 |
gamma: 0.99
|
root_gnn_dgl/run_demo.sh
CHANGED
|
@@ -1,8 +1,6 @@
|
|
| 1 |
#!/bin/bash
|
| 2 |
|
| 3 |
# Pretraining
|
| 4 |
-
|
| 5 |
-
# Data Preparation
|
| 6 |
datasets=("ttH" "tHjb" "ggF" "VBF" "WH" "ZH" "ttyy" "tttt" "SingleT_schan" "ttbar" "ttW" "ttt")
|
| 7 |
chunks=3
|
| 8 |
|
|
@@ -13,9 +11,39 @@ for data in "${datasets[@]}"; do
|
|
| 13 |
done
|
| 14 |
done
|
| 15 |
|
| 16 |
-
# Training
|
| 17 |
-
|
| 18 |
python scripts/training_script.py --config configs/demo/pretraining_multiclass.yaml --preshuffle --nocompile --lazy
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
# Inference
|
| 21 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
#!/bin/bash
|
| 2 |
|
| 3 |
# Pretraining
|
|
|
|
|
|
|
| 4 |
datasets=("ttH" "tHjb" "ggF" "VBF" "WH" "ZH" "ttyy" "tttt" "SingleT_schan" "ttbar" "ttW" "ttt")
|
| 5 |
chunks=3
|
| 6 |
|
|
|
|
| 11 |
done
|
| 12 |
done
|
| 13 |
|
|
|
|
|
|
|
| 14 |
python scripts/training_script.py --config configs/demo/pretraining_multiclass.yaml --preshuffle --nocompile --lazy
|
| 15 |
|
| 16 |
+
# Finetuning
|
| 17 |
+
|
| 18 |
+
datasets=("ttH_CP_Even" "ttH_CP_Odd")
|
| 19 |
+
chunks=3
|
| 20 |
+
|
| 21 |
+
for data in "${datasets[@]}"; do
|
| 22 |
+
python scripts/prep_data.py --config configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml --dataset "$data" --shuffle_mode --chunk 0
|
| 23 |
+
for ((i=0; i<chunks; i++)); do
|
| 24 |
+
python scripts/prep_data.py --config configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml --dataset "$data" --shuffle_mode --chunk "$i"
|
| 25 |
+
done
|
| 26 |
+
done
|
| 27 |
+
|
| 28 |
+
python scripts/training_script.py --config configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml --preshuffle --nocompile --lazy
|
| 29 |
+
|
| 30 |
+
|
| 31 |
# Inference
|
| 32 |
|
| 33 |
+
python scripts/inference.py \
|
| 34 |
+
--target "/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/multilabel_10K/ttH_NLO.root" \
|
| 35 |
+
--destination "/global/cfs/projectdirs/atlas/joshua/GNN4Colliders/root_gnn_dgl/scores/ttH_NLO.root" \
|
| 36 |
+
--config "configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml" \
|
| 37 |
+
--chunks 1 \
|
| 38 |
+
--chunkno 0 \
|
| 39 |
+
--write \
|
| 40 |
+
--branch 'GNN_Score'
|
| 41 |
+
|
| 42 |
+
python scripts/inference.py \
|
| 43 |
+
--target "/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/multilabel_10K/ttH_CPodd.root" \
|
| 44 |
+
--destination "/global/cfs/projectdirs/atlas/joshua/GNN4Colliders/root_gnn_dgl/scores/ttH_CPodd.root" \
|
| 45 |
+
--config "configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml" \
|
| 46 |
+
--chunks 1 \
|
| 47 |
+
--chunkno 0 \
|
| 48 |
+
--write \
|
| 49 |
+
--branch 'GNN_Score'
|