ho22joshua commited on
Commit
0f17633
·
1 Parent(s): a19acb8

created demo

Browse files
.gitignore CHANGED
@@ -1,2 +1,3 @@
1
  __pycache__/
2
  trainings/
 
 
1
  __pycache__/
2
  trainings/
3
+ scores/
root_gnn_dgl/README.md CHANGED
@@ -40,6 +40,16 @@ Run the `setup/test_setup.py` script to confirm that all packages needed for tra
40
  ```bash
41
  python setup/test_setup.py
42
  ```
 
 
 
 
 
 
 
 
 
 
43
 
44
  ## Data Preparation
45
  The first step in the process is to convert the events stored in ROOT files into DGL graph objects. This conversion is handled automatically by the Dataset objects during their creation, provided the graph data has not already been saved to disk. To accomplish this, a simple script is used to initialize the relevant Dataset object and then exit. This script needs to be executed for each data chunk in each dataset being used for training.
@@ -47,7 +57,15 @@ The first step in the process is to convert the events stored in ROOT files into
47
  Below is an example of how to use the `scripts/prep_data.py` script:
48
 
49
  ```bash
50
- <insert exmaple here>
 
 
 
 
 
 
 
 
51
  ```
52
 
53
  The `--shuffle_mode` flag performs shuffling and pre-batches the graphs in each chunk, since holding the entire dataset in memory and shuffling it together can be prohibitive for large datasets.
 
40
  ```bash
41
  python setup/test_setup.py
42
  ```
43
+ ## Running the Demo
44
+ The demo training is an example of our ML workflow, consisting of training a pretrained model, then finetuning it for an analysis task. The config files for the demo are located in the directory `configs/demo/`. The demo can be run on a login node.
45
+
46
+ The pretraining for the demo is a multiclass classification training on 12 datasets corresponding to 12 distinct physics processes, containing 10,000 simulated collision events each. The pretraining is then fintuned on a binary classification task between two datasets containing 10,000 simulated collision events each for two different processes, called ttH CP Even and ttH CP Odd.
47
+
48
+ The entire demo can be ran with the command
49
+ ```bash
50
+ source run_demo.sh
51
+ ```
52
+
53
 
54
  ## Data Preparation
55
  The first step in the process is to convert the events stored in ROOT files into DGL graph objects. This conversion is handled automatically by the Dataset objects during their creation, provided the graph data has not already been saved to disk. To accomplish this, a simple script is used to initialize the relevant Dataset object and then exit. This script needs to be executed for each data chunk in each dataset being used for training.
 
57
  Below is an example of how to use the `scripts/prep_data.py` script:
58
 
59
  ```bash
60
+ datasets=("ttH" "tHjb" "ggF" "VBF" "WH" "ZH" "ttyy" "tttt" "SingleT_schan" "ttbar" "ttW" "ttt")
61
+ chunks=3
62
+
63
+ for data in "${datasets[@]}"; do
64
+ python scripts/prep_data.py --config configs/demo/pretraining_multiclass.yaml --dataset "$data" --shuffle_mode --chunk 0
65
+ for ((i=0; i<chunks; i++)); do
66
+ python scripts/prep_data.py --config configs/demo/pretraining_multiclass.yaml --dataset "$data" --shuffle_mode --chunk "$i"
67
+ done
68
+ done
69
  ```
70
 
71
  The `--shuffle_mode` flag performs shuffling and pre-batches the graphs in each chunk, since holding the entire dataset in memory and shuffling it together can be prohibitive for large datasets.
root_gnn_dgl/configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml CHANGED
@@ -1,17 +1,17 @@
1
- Training_Name: ttH_CP_even_vs_odd_finetuning_12_process
2
- Training_Directory: trainings/ttH_vs_ttH_CPodd_TL_studies/ttH_CP_even_vs_odd_finetuning_12_process
3
  Model:
4
  module: models.GCN
5
  class: Transferred_Learning_Finetuning
6
  args:
7
- pretraining_path: trainings/Hyy_BIG/model_epoch_59.pt
8
  pretraining_model:
9
  module: models.GCN
10
  class: Edge_Network
11
  args:
12
  hid_size: 64
13
  in_size: 7
14
- out_size: 13
15
  n_layers: 4
16
  n_proc_steps: 4
17
  hid_size: 64
@@ -19,30 +19,30 @@ Model:
19
  out_size: 1
20
  n_layers: 4
21
  n_proc_steps: 4
22
- dropout: 0.10
23
  Training:
24
- epochs: 200
25
  batch_size: 1024
26
  learning_rate: 0.00001
27
  gamma: 0.99
28
  Datasets:
29
- ttH: &dataset_defn
30
  module: root_gnn_base.dataset
31
  class: LazyDataset
32
- shuffle_chunks: 10
33
  batch_size: 1024
34
  padding_mode: NONE #one of STEPS, FIXED, or NONE
35
  args: &dataset_args
36
- name: ttH
37
  label: 0
38
  weight_var: weight
39
- chunks: 100
40
- buffer_size: 11
41
  file_names: ttH_NLO.root
42
  tree_name: output
43
  fold_var: Number
44
- raw_dir: /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/
45
- save_dir: /pscratch/sd/j/joshuaho/root_gnn/root_gnn_dgl/data/processed_ttH_vs_ttH_CPOdd_10M
46
  node_branch_names:
47
  - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
48
  - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
@@ -54,14 +54,14 @@ Datasets:
54
  node_branch_types: [vector, vector, vector, vector, single]
55
  node_feature_scales: [1e-1, 1, 1, 1e-1, 1, 1, 1]
56
  folding:
57
- n_folds: 10
58
- test: [0, 1, 2]
59
  # validation: 1
60
- train: [3,4,5,6,7,8,9]
61
- ttH_CPodd:
62
  <<: *dataset_defn
63
  args:
64
  <<: *dataset_args
65
- name: ttH_CPodd
66
  label: 1
67
  file_names: ttH_CPodd.root
 
1
+ Training_Name: finetuning_ttH_CP_Even_vs_Odd
2
+ Training_Directory: trainings/demo/finetuning_ttH_CP_Even_vs_Odd
3
  Model:
4
  module: models.GCN
5
  class: Transferred_Learning_Finetuning
6
  args:
7
+ pretraining_path: trainings/demo/pretraining_multiclass/model_epoch_100.pt # update to the last epoch of the pretraining
8
  pretraining_model:
9
  module: models.GCN
10
  class: Edge_Network
11
  args:
12
  hid_size: 64
13
  in_size: 7
14
+ out_size: 12
15
  n_layers: 4
16
  n_proc_steps: 4
17
  hid_size: 64
 
19
  out_size: 1
20
  n_layers: 4
21
  n_proc_steps: 4
22
+ dropout: 0
23
  Training:
24
+ epochs: 500
25
  batch_size: 1024
26
  learning_rate: 0.00001
27
  gamma: 0.99
28
  Datasets:
29
+ ttH_CP_Even: &dataset_defn
30
  module: root_gnn_base.dataset
31
  class: LazyDataset
32
+ shuffle_chunks: 3
33
  batch_size: 1024
34
  padding_mode: NONE #one of STEPS, FIXED, or NONE
35
  args: &dataset_args
36
+ name: ttH_CP_Even
37
  label: 0
38
  weight_var: weight
39
+ chunks: 3
40
+ buffer_size: 1
41
  file_names: ttH_NLO.root
42
  tree_name: output
43
  fold_var: Number
44
+ raw_dir: /global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/multilabel_10K/
45
+ save_dir: /pscratch/sd/j/joshuaho/GNN4Colliders/root_gnn_dgl/data/demo/finetuning_ttH_CP_Even_vs_Odd/
46
  node_branch_names:
47
  - [jet_pt, ele_pt, mu_pt, ph_pt, MET_met]
48
  - [jet_eta, ele_eta, mu_eta, ph_eta, 0]
 
54
  node_branch_types: [vector, vector, vector, vector, single]
55
  node_feature_scales: [1e-1, 1, 1, 1e-1, 1, 1, 1]
56
  folding:
57
+ n_folds: 3
58
+ test: [0]
59
  # validation: 1
60
+ train: [1, 2]
61
+ ttH_CP_Odd:
62
  <<: *dataset_defn
63
  args:
64
  <<: *dataset_args
65
+ name: ttH_CP_Odd
66
  label: 1
67
  file_names: ttH_CPodd.root
root_gnn_dgl/configs/demo/pretraining_multiclass.yaml CHANGED
@@ -19,7 +19,7 @@ Loss:
19
  class: Softmax
20
  args: {dim: 1}
21
  Training:
22
- epochs: 200
23
  batch_size: 1024
24
  learning_rate: 0.0001
25
  gamma: 0.99
 
19
  class: Softmax
20
  args: {dim: 1}
21
  Training:
22
+ epochs: 500
23
  batch_size: 1024
24
  learning_rate: 0.0001
25
  gamma: 0.99
root_gnn_dgl/run_demo.sh CHANGED
@@ -1,8 +1,6 @@
1
  #!/bin/bash
2
 
3
  # Pretraining
4
-
5
- # Data Preparation
6
  datasets=("ttH" "tHjb" "ggF" "VBF" "WH" "ZH" "ttyy" "tttt" "SingleT_schan" "ttbar" "ttW" "ttt")
7
  chunks=3
8
 
@@ -13,9 +11,39 @@ for data in "${datasets[@]}"; do
13
  done
14
  done
15
 
16
- # Training
17
-
18
  python scripts/training_script.py --config configs/demo/pretraining_multiclass.yaml --preshuffle --nocompile --lazy
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  # Inference
21
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  #!/bin/bash
2
 
3
  # Pretraining
 
 
4
  datasets=("ttH" "tHjb" "ggF" "VBF" "WH" "ZH" "ttyy" "tttt" "SingleT_schan" "ttbar" "ttW" "ttt")
5
  chunks=3
6
 
 
11
  done
12
  done
13
 
 
 
14
  python scripts/training_script.py --config configs/demo/pretraining_multiclass.yaml --preshuffle --nocompile --lazy
15
 
16
+ # Finetuning
17
+
18
+ datasets=("ttH_CP_Even" "ttH_CP_Odd")
19
+ chunks=3
20
+
21
+ for data in "${datasets[@]}"; do
22
+ python scripts/prep_data.py --config configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml --dataset "$data" --shuffle_mode --chunk 0
23
+ for ((i=0; i<chunks; i++)); do
24
+ python scripts/prep_data.py --config configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml --dataset "$data" --shuffle_mode --chunk "$i"
25
+ done
26
+ done
27
+
28
+ python scripts/training_script.py --config configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml --preshuffle --nocompile --lazy
29
+
30
+
31
  # Inference
32
 
33
+ python scripts/inference.py \
34
+ --target "/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/multilabel_10K/ttH_NLO.root" \
35
+ --destination "/global/cfs/projectdirs/atlas/joshua/GNN4Colliders/root_gnn_dgl/scores/ttH_NLO.root" \
36
+ --config "configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml" \
37
+ --chunks 1 \
38
+ --chunkno 0 \
39
+ --write \
40
+ --branch 'GNN_Score'
41
+
42
+ python scripts/inference.py \
43
+ --target "/global/cfs/projectdirs/atlas/joshua/root_gnn/root_gnn_dgl/data/ntuples/Hyy_pretraining/multilabel_10K/ttH_CPodd.root" \
44
+ --destination "/global/cfs/projectdirs/atlas/joshua/GNN4Colliders/root_gnn_dgl/scores/ttH_CPodd.root" \
45
+ --config "configs/demo/finetuning_ttH_CP_Even_vs_Odd.yaml" \
46
+ --chunks 1 \
47
+ --chunkno 0 \
48
+ --write \
49
+ --branch 'GNN_Score'