Commit ·
4173a0e
1
Parent(s): cb4107f
updated readme
Browse files- root_gnn_dgl/README.md +20 -8
root_gnn_dgl/README.md
CHANGED
|
@@ -41,15 +41,17 @@ Run the `setup/test_setup.py` script to confirm that all packages needed for tra
|
|
| 41 |
python setup/test_setup.py
|
| 42 |
```
|
| 43 |
## Running the Demo
|
| 44 |
-
The demo training is an example of our ML workflow, consisting of training a pretrained model, then finetuning it for an analysis task. The config files for the demo are located in the directory `configs/
|
| 45 |
|
| 46 |
-
The pretraining for the demo is a multiclass classification training on 12 datasets corresponding to 12 distinct physics processes, containing
|
| 47 |
|
| 48 |
The entire demo can be ran with the command
|
| 49 |
```bash
|
| 50 |
source run_demo.sh
|
| 51 |
```
|
| 52 |
|
|
|
|
|
|
|
| 53 |
## Data Preparation
|
| 54 |
The first step in the process is to convert the events stored in ROOT files into DGL graph objects. This conversion is handled automatically by the Dataset objects during their creation, provided the graph data has not already been saved to disk. To accomplish this, a simple script is used to initialize the relevant Dataset object and then exit. This script needs to be executed for each data chunk in each dataset being used for training.
|
| 55 |
|
|
@@ -69,25 +71,35 @@ done
|
|
| 69 |
|
| 70 |
The `--shuffle_mode` flag performs shuffling and pre-batches the graphs in each chunk, since holding the entire dataset in memory and shuffling it together can be prohibitive for large datasets.
|
| 71 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
## Training
|
| 73 |
Training is run by `scripts/training_script`. `--preshuffle` tells it to use the preshuffled and batched graphs rather than shuffling and batching on the fly, and `--restart` can be used to force the training to start from the beginning rather than from the last available checkpoint.
|
| 74 |
|
|
|
|
|
|
|
| 75 |
```bash
|
| 76 |
-
python scripts/training_script.py --config configs/
|
| 77 |
```
|
| 78 |
|
| 79 |
-
This step should produce the training directory `trainings/
|
| 80 |
|
| 81 |
## Inference
|
| 82 |
Inference is done by `scripts/inference.py`. This script applies the model defined by `--config` onto the samples located at `--target`. A new set of samples with the GNN scores saved as the `--branch` in the ntuples will be created at `--destination`. The `--chunks` arguement will handel the inference in specified chunks.
|
| 83 |
|
| 84 |
```bash
|
| 85 |
python scripts/inference.py \
|
| 86 |
-
--target "/global/cfs/projectdirs/
|
| 87 |
-
--destination "/global/cfs/projectdirs/
|
| 88 |
-
--config "configs/
|
| 89 |
--chunks 1 \
|
| 90 |
--chunkno 0 \
|
| 91 |
--write \
|
| 92 |
--branch 'GNN_Score'
|
| 93 |
-
```
|
|
|
|
|
|
|
|
|
| 41 |
python setup/test_setup.py
|
| 42 |
```
|
| 43 |
## Running the Demo
|
| 44 |
+
The demo training is an example of our ML workflow, consisting of training a pretrained model, then finetuning it for an analysis task, while also training a model for the analysis task from scratch. The config files for the demo are located in the directory `configs/stats_100K/`. The demo can be run on a login node on Perlmutter.
|
| 45 |
|
| 46 |
+
The pretraining for the demo is a multiclass classification training on 12 datasets corresponding to 12 distinct physics processes, containing 100,000 simulated collision events each. The pretraining is then fintuned on a binary classification task between two datasets containing 100,000 simulated collision events each for two different processes, called ttH CP Even and ttH CP Odd.
|
| 47 |
|
| 48 |
The entire demo can be ran with the command
|
| 49 |
```bash
|
| 50 |
source run_demo.sh
|
| 51 |
```
|
| 52 |
|
| 53 |
+
This shell script can also be used as an example to run the entire workflow.
|
| 54 |
+
|
| 55 |
## Data Preparation
|
| 56 |
The first step in the process is to convert the events stored in ROOT files into DGL graph objects. This conversion is handled automatically by the Dataset objects during their creation, provided the graph data has not already been saved to disk. To accomplish this, a simple script is used to initialize the relevant Dataset object and then exit. This script needs to be executed for each data chunk in each dataset being used for training.
|
| 57 |
|
|
|
|
| 71 |
|
| 72 |
The `--shuffle_mode` flag performs shuffling and pre-batches the graphs in each chunk, since holding the entire dataset in memory and shuffling it together can be prohibitive for large datasets.
|
| 73 |
|
| 74 |
+
You will probably see a `list index out of range error`. This is expected, and the program will still save your graphs to disk before throwing this error. Fixing this error is a WIP, but for now can be ignored
|
| 75 |
+
```python
|
| 76 |
+
dgl.save_graphs(str(graph_path).replace('.bin', f'_{self.process_chunks[i]}.bin'), self.graph_chunks[i], {'labels': self.label_chunks[i], 'tracking': self.tracking_chunks[i], 'global': self.global_chunks[i]})
|
| 77 |
+
IndexError: list index out of range
|
| 78 |
+
```
|
| 79 |
+
|
| 80 |
## Training
|
| 81 |
Training is run by `scripts/training_script`. `--preshuffle` tells it to use the preshuffled and batched graphs rather than shuffling and batching on the fly, and `--restart` can be used to force the training to start from the beginning rather than from the last available checkpoint.
|
| 82 |
|
| 83 |
+
Using the `--nocompile` arguement is also recommended, as using `torch.compile()` requires padding the graphs beforehand during data processing.
|
| 84 |
+
|
| 85 |
```bash
|
| 86 |
+
python scripts/training_script.py --config configs/stats_100K/pretraining_multiclass.yaml --preshuffle --nocompile --lazy
|
| 87 |
```
|
| 88 |
|
| 89 |
+
This step should produce the training directory `trainings/stats_100K/pretraining_multiclass/` containing a copy of the config file, checkpoints (`model_epoch_*.pt`) with the model weights after each epoch of training, npz files with the GNN outputs for each event after each epoch of training, and two files `training.log` and `training.png` which summarize the model performance and convergence.
|
| 90 |
|
| 91 |
## Inference
|
| 92 |
Inference is done by `scripts/inference.py`. This script applies the model defined by `--config` onto the samples located at `--target`. A new set of samples with the GNN scores saved as the `--branch` in the ntuples will be created at `--destination`. The `--chunks` arguement will handel the inference in specified chunks.
|
| 93 |
|
| 94 |
```bash
|
| 95 |
python scripts/inference.py \
|
| 96 |
+
--target "/global/cfs/projectdirs/trn007/lbl_atlas/data/stats_100K/ttH_NLO.root" \
|
| 97 |
+
--destination "/global/cfs/projectdirs/trn007/lbl_atlas/data/scores/stats_100K/ttH_NLO.root" \
|
| 98 |
+
--config "configs/stats_100K/finetuning_ttH_CP_even_vs_odd.yaml" \
|
| 99 |
--chunks 1 \
|
| 100 |
--chunkno 0 \
|
| 101 |
--write \
|
| 102 |
--branch 'GNN_Score'
|
| 103 |
+
```
|
| 104 |
+
|
| 105 |
+
You can also input a list as the `--config` and the `--branch` to simultaneously apply multiple models onto the same set of samples. An example on how to do this in shell script is in the `run_demo.sh` file.
|